No video

Learning Rate Grafting: Transferability of Optimizer Tuning (Machine Learning Research Paper Review)

#grafting #adam #sgd
The last years in deep learning research have given rise to a plethora of different optimization algorithms, such as SGD, AdaGrad, Adam, LARS, LAMB, etc. which all claim to have their special peculiarities and advantages. In general, all algorithms modify two major things: The (implicit) learning rate schedule, and a correction to the gradient direction. This paper introduces grafting, which allows to transfer the induced learning rate schedule of one optimizer to another one. In that, the paper shows that much of the benefits of adaptive methods (e.g. Adam) are actually due to this schedule, and not necessarily to the gradient direction correction. Grafting allows for more fundamental research into differences and commonalities between optimizers, and a derived version of it makes it possible to computes static learning rate corrections for SGD, which potentially allows for large savings of GPU memory.
OUTLINE
0:00 - Rant about Reviewer #2
6:25 - Intro & Overview
12:25 - Adaptive Optimization Methods
20:15 - Grafting Algorithm
26:45 - Experimental Results
31:35 - Static Transfer of Learning Rate Ratios
35:25 - Conclusion & Discussion
Paper (OpenReview): openreview.net...
Old Paper (Arxiv): arxiv.org/abs/...
Our Discord: / discord
Abstract:
In the empirical science of training large neural networks, the learning rate schedule is a notoriously challenging-to-tune hyperparameter, which can depend on all other properties (architecture, optimizer, batch size, dataset, regularization, ...) of the problem. In this work, we probe the entanglements between the optimizer and the learning rate schedule. We propose the technique of optimizer grafting, which allows for the transfer of the overall implicit step size schedule from a tuned optimizer to a new optimizer, preserving empirical performance. This provides a robust plug-and-play baseline for optimizer comparisons, leading to reductions to the computational cost of optimizer hyperparameter search. Using grafting, we discover a non-adaptive learning rate correction to SGD which allows it to train a BERT model to state-of-the-art performance. Besides providing a resource-saving tool for practitioners, the invariances discovered via grafting shed light on the successes and failure modes of optimizers in deep learning.
Authors: Anonymous (Under Review)
Links:
TabNine Code Completion (Referral): bit.ly/tabnine-...
KZread: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.c...
LinkedIn: / ykilcher
BiliBili: space.bilibili...
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribes...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 61

  • @YannicKilcher
    @YannicKilcher2 жыл бұрын

    OUTLINE 0:00 - Rant about Reviewer #2 6:25 - Intro & Overview 12:25 - Adaptive Optimization Methods 20:15 - Grafting Algorithm 26:45 - Experimental Results 31:35 - Static Transfer of Learning Rate Ratios 35:25 - Conclusion & Discussion

  • @TheParkitny
    @TheParkitny2 жыл бұрын

    I feel like we need more reviews of reviews. Maybe one day it'll expose how poor the process is currently.

  • @dinoscheidt

    @dinoscheidt

    2 жыл бұрын

    Agreed 💯

  • @ulamss5

    @ulamss5

    2 жыл бұрын

    The current process is absolutely broken. Getting a paper published is more a function of luck and fame rather than merit.

  • @rohananil8672
    @rohananil86722 жыл бұрын

    I loved watching this video especially the review of the reviewer - it made my day!

  • @sieyk
    @sieyk2 жыл бұрын

    The mistake the authors made was not calling the paper "Step Size is All You Need" It would have been accepted without review

  • @FreddySnijder-TheOnlyOne
    @FreddySnijder-TheOnlyOne2 жыл бұрын

    Reviewing the reviewers should be standard practice to increase the quality of the reviews and subsequently the accepted paper value efficiency.

  • @visuality2541
    @visuality25412 жыл бұрын

    reviewing those openreviews are really fun to me personally

  • @jabowery
    @jabowery2 жыл бұрын

    Particularly since the Manhattan Project there is developing a quasi-religious attitude toward Theory that is undoing the empirical tradition. While it is true that we cannot inherently generalize from empirical results just about the only prior we have to those results is the theory of algorithmic information. Yet there have, in recent decades, been repeated examples of important empirical results, rejected in peer review simply because there was no accompanying Theory to explain those results. If this doesn't sicken you then there is something wrong with your rational faculties.

  • @sieyk

    @sieyk

    2 жыл бұрын

    Scientist: "Hey, look! When I bump these two rocks together, this third rock floats! Isn't that neat?" Reviewer: "I don't think that should happen." Scientist: "Oh, neither did I. But it does! Look, try it yourself!" Reviewer: "No thanks. I don't think there's enough theory backing up what you're doing here." Scientist: "Well of course not. This is totally new. What do you expect?" Reviewer: "I'm going to have to reject your experiments because there isn't strong enough evidence." Scientist: "....."

  • @TheEbbemonster
    @TheEbbemonster2 жыл бұрын

    Lol, so by training a lot of models, you can calculate a learning rate schedule. And then you can train your model with larger batch sizes because you save memory by just using SGD * scheduled-learning-rate. I wonder if this only works on the same dataset, and if your save compute/time in the end. To get a decent schedule I would expect that you would need a lot of runs to get a robust schedule 😂 Jeremy Howard already used learning rate scheduling years ago across datasets, so I would expect the schedule to somewhat generalise. Thanks for helping us all get smarter ❤️

  • @julianke455
    @julianke4552 жыл бұрын

    The rant was hilarious lmao

  • @dracleirbag5838
    @dracleirbag58382 жыл бұрын

    I love your review of the reviews. 😂

  • @Mrbits01
    @Mrbits012 жыл бұрын

    This was the case with most ICLR reviews this year. I don't know what they did to the reviewer pool, but I feel the review quality was worse than the previous 2 years (not to say it was stellar then). I have never seen reviews on ICLR that are so all over the place, and so devoid of meaningful dialogue/feedback.

  • @dewinmoonl
    @dewinmoonl2 жыл бұрын

    there's pretty much 2 mode of this channel, the facecam with a more punchy persona, and without the facecam a more acamdemic persona both are cool in the context of making YT videos went back and watched some of the older videos, I think yannic is still very much grounded and have not "taken off" into the meme and clickbait in order to be "successful" on YT keep up the honest work, very much liked your contents

  • @herp_derpingson
    @herp_derpingson2 жыл бұрын

    31:46 One thing that would be really helpful in building intuition would be how the parameters of Adam/Adagrad change over the course of the training. If we see that after some steps it converges to some value, then we graft and switch optimizers. We are assuming that the other variables would be roughly constant for the remainder of the training. I wonder how that would work out. That way we are not arbitrarily picking 2000 or so steps.

  • @YannicKilcher

    @YannicKilcher

    2 жыл бұрын

    true, but they must have done some experimentation to come up with that number in the first place

  • @ChocolateMilkCultLeader
    @ChocolateMilkCultLeader2 жыл бұрын

    I'm going to be writing about this paper. Thanks for sharing

  • @adokoka
    @adokoka2 жыл бұрын

    Hi Yannick. Thanks for the paper review. Imho I don’t see the innovation in this paper. These variants of gradient descent algorithms, normalisation, preconditioning have comprehensively been studied in the field of optimisation in general. On a theoretical remark, the authors seem to assume that the direction of the gradient is fixed and one only needs to find optimal step sizes by transferring learning rates between algorithms. Both overshooting or undershooting the optimal steps would result in inefficiency. Furthermore, no mention has been made on how the proposed methodology behaves in the neighbourhood of a saddle point.

  • @YannicKilcher

    @YannicKilcher

    2 жыл бұрын

    I don't think that the authors here go for innovation. They clearly propose this as an investigative tool to disentangle implicit step size and direction correction. Any "innovation" is simply a side-product. Also, I don't think they make any of these assumptions, but simply propose this as a transfer mechanism, explicitly discarding the direction of one algorithm. If they were to assume that the direction were fixed, there would be no need to transfer the step size at all, since the directions would be the same.

  • @adokoka

    @adokoka

    2 жыл бұрын

    @@YannicKilcher Thanks for the clarifications. I eventually noticed that the proposed grafting is not a guaranteed gradient descent method. Interesting paper anyway :)

  • @tarekshaalan3257
    @tarekshaalan32572 жыл бұрын

    Thats a great explanation / interesting paper thank you 🙏🏻, I am wondering when people write a paper why they don’t put all the signs / letters in a table as an Appendix for reader to get back to it, why you have to say I am guessing M is magnitude and D direction ☹️ it shakes all my understanding once we start guessing process, this happen also with me all the time 🤷🏻‍♂️

  • @sieyk

    @sieyk

    2 жыл бұрын

    I also really hate it when math gets dumped in a paper and you're expected to be fluent in whatever was going through their head.

  • @jasdeepsinghgrover2470
    @jasdeepsinghgrover24702 жыл бұрын

    Going by similar idea... Why don't we train model in parts... Maybe train with 1 optimiser for 100 epochs and then test 3 possible optimisers for maybe 5 epochs and use the better one... The loss function keeps changing in structure as models are optimised

  • @paxdriver
    @paxdriver2 жыл бұрын

    This is a very similar approach to my current project looking to use pbr raytracing optimizers grafted onto SGD models using blender and cycles render engine. 👍

  • @Coolguydudeness1234

    @Coolguydudeness1234

    2 жыл бұрын

    Please share!

  • @paxdriver

    @paxdriver

    2 жыл бұрын

    Working on it. Blender runs Python natively though so if you learn material nodes you can apply it too right out of the box. I'm still trying to learn all the math and python lol

  • @LouisChiaki
    @LouisChiaki2 жыл бұрын

    39:12 "get enough sleep" Why Yannic knew that I watched this at 2:20 AM before sleep?

  • @priyamdey3298
    @priyamdey32982 жыл бұрын

    2:34 This is gonna be a meme🤣

  • @swissdude9624
    @swissdude96242 жыл бұрын

    Did I get that correctly? We only save memory once we start learning how one optimizer "corrects" the other and the use that knowledge to continue with the one requiring less memory.

  • @YannicKilcher

    @YannicKilcher

    2 жыл бұрын

    correct

  • @sacramentofwilderness6656

    @sacramentofwilderness6656

    2 жыл бұрын

    Interesting question is whether the dependency during the initial amount of steps holds in the process of consequent training. If the optimizer initially was in a region with small curvature and the got into a high-curvature region, I would expect that grafting won't be beneficial.

  • @rishabhsahlot7481
    @rishabhsahlot74812 жыл бұрын

    That is probably why adabelief(derived from Adam) gives generalization similar to sgd

  • @tarekshaalan3257
    @tarekshaalan32572 жыл бұрын

    That's true such comments for PHD students will be really disappointing and confusing, please put yourself in the publisher shoes before saying anything and how this is effective or destructive. In order to a researcher to grow require a support from the community / no one is perfect and we all have to start from somewhere.

  • @richardbloemenkamp8532
    @richardbloemenkamp85322 жыл бұрын

    Before ML I used Conjugate Gradients, GMRES, Levenberg Marquard. They all seemed pretty smart. Why does ML use SGD and ADAM which seem simpler and less smart? Especially CG with A-orthogonality in theory works great for elongated valleys with minimal memory requirements.

  • @mgostIH

    @mgostIH

    2 жыл бұрын

    Conjugate gradient solves a different problem, finding a solution of a linear system, but even if you could adapt it to NNs it requires strictly positive definite matrices. GMRES is more general but still for linear systems solutions, while Levenberg Marquard is for least-squares curve fitting. The problem with all those classical iterative algorithms are that they suppose quite strong conditions on the problem you are trying to solve (linearity, convexity or single global optimas) and usually are intractable in huge dimensions (like Newton's method), while NNs have almost no theoretical guarantee on everything and a global solution isn't even important, we have a lot of experimental evidence in the behaviour of NNs minimas and those reached by current methods are often as good as you can get.

  • @richardbloemenkamp8532

    @richardbloemenkamp8532

    2 жыл бұрын

    @@mgostIH First thanks a lot for your quite detailed answer! I think you have some important points but on the other hand, I and others have applied these classical methods many times to non-linear problems with good results and there exist many variations that overcome their limitations. E.g. there exist various optimizations to the update directions. It appears a bit to me that the ML community has tested all available methods and found which works best. Maybe less stress was put on determining why certain methods work so well or maybe I'm just not aware. I agree that in ML the global solution is not important where in physics problems we want to prevent as much as possible getting stuck in a local minimum. BTW if locally your matrix is indefinite then you have a saddle point and thus no optimum. If your matrix is semi-definite than I think classical optimization still works including CG. Only the gradient becomes zero. I still have the feeling that there is a disconnect between the traditional mathematiciens working on optimization problems and the new big-tech researchers on this point.

  • @mgostIH

    @mgostIH

    2 жыл бұрын

    @@richardbloemenkamp8532 Don't take it as a cut on old possible methods that may work, in my experience there's a lot of overlooked stuff in ML, you could try an implementation of those algorithms yourself or see where they may have a bottleneck/lack of assumptions in current regimes. If you have results that are competitive with current methods that can be worth a paper!

  • @user-qk9mw1dx9u
    @user-qk9mw1dx9u2 жыл бұрын

    The algebra must be consistent with the topology(N. Bourbaki). Dividing mathematics into "topology", "algebra", "analysis", "geometry" etc., is mainly done for clarity in (undergraduate) teaching. Mathematics should rather be seen as a highly coherent body of knowledge.

  • @afafssaf925
    @afafssaf9252 жыл бұрын

    Saying "despite the evidence, I don't feel like will hold up can be perfectly reasonable. Anytime you're trying to "defeat" a model/optimizer/theory that has held up for a long time, you're more likely to find spurious results, or to have missed something. Case in point: if you were to say this about every paper that claimed to beat Adam, you'd be pretty much right all the time.

  • @YannicKilcher

    @YannicKilcher

    2 жыл бұрын

    Yes, I agree. But that's not the point here, the point is that this is an official reviewer listing their own misunderstanding as weaknesses of the paper, dismissing the authors as amateurs, and directly ignoring the paper's evidence. Sure this might not generalize (which is what the other reviewers also point out), but you can't just look at a graph showing X and then say "I don't think X", especially as a reviewer. Maybe they meant generalization but failed to formulate that, but given the test of the review, I don't think that's the case

  • @fast_harmonic_psychedelic
    @fast_harmonic_psychedelic2 жыл бұрын

    did you just tell me to get enough sleep

  • @YannicKilcher

    @YannicKilcher

    2 жыл бұрын

    you watched to the end, good job :D

  • @TimScarfe
    @TimScarfe2 жыл бұрын

    Hilarious rant😂😂

  • @G12GilbertProduction
    @G12GilbertProduction2 жыл бұрын

    "Theory is not reasonable." OK. I'm done. ×DDD

  • @jaad9848
    @jaad98482 жыл бұрын

    The worse part of that reviewer is the 5 the reviewer assigned for their confidence on the review. Also I think the reviewers first language isnt english but on the other hand the reviewer critiques the language , insert something about "glass houses"

  • @ksy8585
    @ksy85852 жыл бұрын

    what's the name of the discord channel?

  • @YannicKilcher

    @YannicKilcher

    2 жыл бұрын

    Link is in the video description

  • @peabrane8067
    @peabrane80672 жыл бұрын

    There's also no "theory" behind Adam or SGD as well, as well as 99 percent of existing optimizer "tricks". So it's pointless to ask for any theory at all, if the entire field is empirically motivated (whatever trains fast is good stuff) by nature.

  • @Lee-vs5ez
    @Lee-vs5ez2 жыл бұрын

    acdemic environment

  • @blanamaxima
    @blanamaxima2 жыл бұрын

    So one can have today 200 GPUs run experiments and publish a paper that has no conclusive result... The experiments do not show anything , there is no conclusion as results are in margin of uncertainty (1-2%).

  • @YannicKilcher

    @YannicKilcher

    2 жыл бұрын

    If you don't think there is a conclusive result, I may not have done a good job explaining the paper. Keep in mind that not every paper's goal is to be better than all others.

  • @n1c0l1z

    @n1c0l1z

    2 жыл бұрын

    @@YannicKilcher That reviewer you criticized is clearly not good at writing proper English, and probably hastily made his review. However, I think he has a strong point, and so has @John Doe: most of the elicited "phenomena" in this paper seem within margin of uncertainty, and the paper does not provide any supporting theory. I would have rejected it as well for the lack of any one of those two kinds of evidence. Probably the authors should have made less experiments, but provide better statistics (e.g.confidence intervals) about their measures.

  • @jakelionlight3936
    @jakelionlight39362 жыл бұрын

    ​Yannic Hudziak do you know this person? He's trolling people in comments with GPT.​ I'm embarrassed to say it took me two comments to notice...

  • @woowooNeedsFaith
    @woowooNeedsFaith2 жыл бұрын

    2:31 - We should be able to vote these reviewers out. If your review is nonsensical, you should lose you right to review anything. And maybe more...

  • @fast_harmonic_psychedelic
    @fast_harmonic_psychedelic2 жыл бұрын

    this reviewer should be removed from his post

  • @matthewchiu9898
    @matthewchiu98982 жыл бұрын

    So there is no one right learning rate adaptation algorithm. It seems to me, the paper is a good example of hacking the optimizers, which themselves are a bunch of hacks.

  • @woowooNeedsFaith
    @woowooNeedsFaith2 жыл бұрын

    4:17 - How people, who can't even write, become reviewers? *Summary Of The Review... paper is **_insufficinet._*

  • @piotr780
    @piotr7802 жыл бұрын

    20:35 level of science is so low - middle school mathematics is published as "new method", no theoretical justification, no general theory in deep elarnig at all, only speculations

  • @NavinF

    @NavinF

    2 жыл бұрын

    Ok buddy, let’s see your papers

  • @flightrisk7566

    @flightrisk7566

    2 жыл бұрын

    where does this claim to be a “new method?” i thought this paper only seeks to empirically demonstrate somewhat more competitive application of sgd to problem domains where it’s performance was previously thought completely laughable by searching for a corrective learning rate schedule using adaptive methods