Gradients are Not All You Need (Machine Learning Research Paper Explained)

Ғылым және технология

#deeplearning #backpropagation #simulation
More and more systems are made differentiable, which means that accurate gradients of these systems' dynamics can be computed exactly. While this development has led to a lot of advances, there are also distinct situations where backpropagation can be a very bad idea. This paper characterizes a few such systems in the domain of iterated dynamical systems, often including some source of stochasticity, resulting in chaotic behavior. In these systems, it is often better to use black-box estimators for gradients than computing them exactly.
OUTLINE:
0:00 - Foreword
1:15 - Intro & Overview
3:40 - Backpropagation through iterated systems
12:10 - Connection to the spectrum of the Jacobian
15:35 - The Reparameterization Trick
21:30 - Problems of reparameterization
26:35 - Example 1: Policy Learning in Simulation
33:05 - Example 2: Meta-Learning Optimizers
36:15 - Example 3: Disk packing
37:45 - Analysis of Jacobians
40:20 - What can be done?
45:40 - Just use Black-Box methods
Paper: arxiv.org/abs/2111.05803
Abstract:
Differentiable programming techniques are widely used in the community and are responsible for the machine learning renaissance of the past several decades. While these methods are powerful, they have limits. In this short report, we discuss a common chaos based failure mode which appears in a variety of differentiable circumstances, ranging from recurrent neural networks and numerical physics simulation to training learned optimizers. We trace this failure to the spectrum of the Jacobian of the system under study, and provide criteria for when a practitioner might expect this failure to spoil their differentiation based optimization algorithms.
Authors: Luke Metz, C. Daniel Freeman, Samuel S. Schoenholz, Tal Kachman
Links:
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
KZread: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
LinkedIn: / ykilcher
BiliBili: space.bilibili.com/2017636191
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 52

  • @YannicKilcher
    @YannicKilcher2 жыл бұрын

    OUTLINE: 0:00 - Foreword 1:15 - Intro & Overview 3:40 - Backpropagation through iterated systems 12:10 - Connection to the spectrum of the Jacobian 15:35 - The Reparameterization Trick 21:30 - Problems of reparameterization 26:35 - Example 1: Policy Learning in Simulation 33:05 - Example 2: Meta-Learning Optimizers 36:15 - Example 3: Disk packing 37:45 - Analysis of Jacobians 40:20 - What can be done? 45:40 - Just use Black-Box methods

  • @amanvijayjindal5742

    @amanvijayjindal5742

    2 жыл бұрын

    Wut is a Jacobin 🤔

  • @IoannisNousias
    @IoannisNousias2 жыл бұрын

    Love those foundational explanations. Don’t ever apologize for them! Keep them coming.

  • @ThichMauXanh
    @ThichMauXanh2 жыл бұрын

    The authors history is big on gradient everything (evidenced by the self-citation sprinkled thru out the paper - I didnt look at author section and already know who's on it LOL), so instead of RL/Evo to learn optimizer/arch, they make everything differentiable and cooked up complicated schemes to use these gradients. Now I'm impressed they digged deeper into this direction and criticized it and released this paper. Great science being done here.

  • @erhimonarowland9464

    @erhimonarowland9464

    2 жыл бұрын

    Pls try to simplify this paper for the benefit of some of us who are new entrants ino the field of Ai. Thank u .

  • @galileo3431
    @galileo34312 жыл бұрын

    Two things: 1. Your explanation of the reparameterization trick is amazing! It's exactly these Aha! moments I love science for! 2. I really love that you're so honest about that your understanding of the paper might be wrong. For me as a master student that is aiming for a PhD in ML, it's so good to notice that not every paper is self-explainatory. When I don't understand a paper, I always tend to doubt my skills. Watching you beeing honest about this issue really helps. Thank you! 🙏🏼

  • @jonathanballoch
    @jonathanballoch2 жыл бұрын

    Its interesting: If this problem is effectively solved for RNNs by simply using LSTMs, this kind of implies that there could be some sort of corrective layer added to differentiable simulators, met-learning, and the other chaotic regimes right? Overall great video and great paper! keep up the good work everyone!

  • @stefaniew433
    @stefaniew4332 жыл бұрын

    Vielen Dank! Ich freue mich schon darauf, das Video in aller Ruhe zu schauen.

  • @MichaelBrown-gt4qi
    @MichaelBrown-gt4qi2 жыл бұрын

    Such a great explanation! I seriously love the way you go through these papers.

  • @tiefkluehlfeuer
    @tiefkluehlfeuer2 жыл бұрын

    Finally someone showed that RNNs suffer from vanishing/exploding gradients! So cool!

  • @dariopassos
    @dariopassos2 жыл бұрын

    Yannic, I found the basic explanations very useful. You explain things in a very intuitive way which is a very good complementary information to someone like me that self-taught this kind of subjects by books only. More.... give us more! ;)

  • @fhub29
    @fhub29Ай бұрын

    Finally I understood the logic behind VAEs and the reparameterization trick. Thank you so much. I failed at getting it during Deep learning lectures but KZread saved my *ss. What a time to be alive.

  • @rnoro
    @rnoro2 жыл бұрын

    I think the essential question is the limit theorem in analysis. Given a function and a Hilbert space, under what condition the convergence is good enough to take derivatives. It's subtle mathematically and if you are not careful in carrying out the calculation or choosing the basis, then you will see what so called big variance or so on

  • @herp_derpingson
    @herp_derpingson2 жыл бұрын

    From what I understand, the paper says that the Gradient Variance can be thought as how inefficient the backprop is, for a system. If the variance is high, it is making the weights ping pong around chaotically without moving in a real direction. . Then the paper argues that the gradient variance is high in certain chaotic systems, which makes sense as a small change in initial conditions can make a large difference. Maybe I did not understand it correctly, but I dont think this is an shortcoming of SGD. Rather it just means that predicting chaotic systems is hard, which the SGD needs to do in order to optimize. If there are optimizers available that does not need to predict how chaotic systems behave and can still get the job done then it is preferable to use those instead. . Also, this kind of applies while training a very deep neural network. The top layers of the network then become the "chaotic system" and the gradients in the lower layers then get high variance. . 44:58 Gradient clipping has always been a band-aid. We do not expect the neural network to learn as efficiently as it would otherwise have. It just makes the impossible possible. . 45:45 Black box methods are usually the 1st thing we throw at a problem then we try with differentiable systems. Blackbox methods take forever to train compared to differentiable systems, so when differentiable systems work its much more preferable to use it. . Maybe I am a bit biased because I have put a lot of work into differentiable systems :P Regardless, good paper.

  • @scottmiller2591
    @scottmiller25912 жыл бұрын

    Black-box gradient is not a very good name here. The key is that the gradient is approximated over a scale which is large compared to the fluctuations. There are several "black box" gradients (like the complex step method) which have scales that are normally very, very small while still being accurate. These will have very similar problems as using backprop. In signal processing you find similar issues (though for different reasons) in estimating spectra. The solution there was the development of multi-taper spectral techniques, which essentially are multiple scale methods weighted to optimize the desired results. I suspect a multi-scale analog here for Jacobian estimates will be a solution.

  • @barlowtwin
    @barlowtwin2 жыл бұрын

    Very awesome revision for me thanks :)

  • @SenilerWhatever
    @SenilerWhatever2 жыл бұрын

    Great video, thanks for the detailed explanation! How do you find the papers that you want to make a video about? Do you just go through arXiv and choose the papers that sound interesting or do you only choose papers from well-known conferences like NeurIPS etc.?

  • @ElieLabeca
    @ElieLabeca2 жыл бұрын

    absolutely love this channel

  • @stmandl
    @stmandl2 жыл бұрын

    The proposed Index problem at 11:30 actually could be ok (notice the s_{i-1})

  • @computerscienceitconferenc7375
    @computerscienceitconferenc73752 жыл бұрын

    Good explanations

  • @laurenpinschannels
    @laurenpinschannels2 жыл бұрын

    curious to see comparisons and tests of theories about why we sometimes see success on these otherwise ill-posed test problems. it can't just be high dimensionally if the feedback is consistently effectively random. how do you detect when a step of backprop was too chaotic?

  • @ivanvoid4910
    @ivanvoid49102 жыл бұрын

    Amazing want more~

  • @beaconofwierd1883
    @beaconofwierd18832 жыл бұрын

    Is the eigenvalues of the jacobian differentiable without the hessian matrix? If so, could we add the eigenvalues to the loss such that the parameters try to keep the eigenvalues bellow 1 while updating?

  • @CharlesVanNoland
    @CharlesVanNoland2 жыл бұрын

    I consider backprop to be analogous to a horse-drawn carriage where some kind of sparse hierarchical predictive modeling, like brains, is the way forward to sportscar machine learning.

  • @dancar2537
    @dancar25372 жыл бұрын

    if you could tell us also how they came to this that would be even greater. still, that you do this explaining, is great

  • @julienherzen7191
    @julienherzen71912 жыл бұрын

    I guess the gradients having high variance speaks in favour of applying some kind of low-pass filter on them - such as what's being done in practice with momentum, Adam, etc.

  • @paincake7865
    @paincake78652 жыл бұрын

    I don't really understand two parts of the reparametrization trick. First of all, you said you sample the vector from a gaussian. Do you mean a multidimensional gaussian, or are all of the values in the vector based on a single distribution? Second, if you always sample from the normal distribution for the reparametrization trick to work, wouldn't you just get noise? The normal distribution always has the same mean and sd, so where would the learned part from the encoder to the latent vector be?

  • @YannicKilcher

    @YannicKilcher

    2 жыл бұрын

    - yes, it's usually a multidimensional gaussian with a diagonal covariance matrix, so you can treat each dimension independently - the learned part are the weights that are used to produce mu and sigma, and every sample is multiplied by sigma and shifted by mu. so yes there is noise from the sampling, but there is also signal coming from the encoder.

  • @paincake7865

    @paincake7865

    2 жыл бұрын

    @@YannicKilcher Alright, thanks. I did not get the part where (the learned) sigma and mu were used in the reparametrization the first time around, so I got confused. But now it makes sense. Thank you

  • @logo2462
    @logo24622 жыл бұрын

    First! Looking forward to hearing how this applies to inner-optimization routines.

  • @logo2462

    @logo2462

    2 жыл бұрын

    After finishing the video, I’m wondering if the authors could find ways of including some sort of group norm (batchnorm etc) to reduce the gradient variance.

  • @nielswarncke537
    @nielswarncke5372 жыл бұрын

    Equation (4)-(6) seem wrong: d l_0 / d Phi = something + d l_0 / d Phi => something = 0. Probably they mean something like L1 or L2 loss with the second d l_0 / d Phi ?

  • @Ronnypetson
    @Ronnypetson2 жыл бұрын

    Imagine backpropagating through backpropagation itself

  • @samanthaqiu3416

    @samanthaqiu3416

    2 жыл бұрын

    my brain just went NaN trying to make sense of it 😂

  • @IoannisNousias

    @IoannisNousias

    2 жыл бұрын

    that would be forward propagation, silly…

  • @WhiteThunder121

    @WhiteThunder121

    2 жыл бұрын

    @@IoannisNousias 5D multiverse propagation with time travel!

  • @artlenski8115

    @artlenski8115

    2 жыл бұрын

    You can, think of the update rule i.e. w = w + \Delta w, then why not extending this expression to a whole polynomial with higher terms, but then, why not even go further and replace this rule by a RNN, now you have to optimise over RNN, that then is used to update the original network. Look for "learning to learn" models.

  • @herp_derpingson

    @herp_derpingson

    2 жыл бұрын

    There are quite a few papers which have done that. There are a few Yannic videos on it too.

  • @AtomosNucleous
    @AtomosNucleous2 жыл бұрын

    I laughed so hard alone in my room when you said "paper rejected" because of the Index. Btw the index is correct because of the i-1 in the product. it just applies for t>=1

  • @paulcurry8383
    @paulcurry83832 жыл бұрын

    One thing I’m always a bit confused by with these more theoretical papers is what the task is that the models in the graphs are doing. I would assume it’s an important thing to specify so as to argue that the behavior of gradients in those situations is generalizable.

  • @etiennetiennetienne
    @etiennetiennetienne2 жыл бұрын

    in the end, is it that gradient calculation, although correct, has large variance due to stochastic conditions of approximation of it (minibatch, truncated bp) or just that taking the exact derivative to converge is a bad idea?

  • @YannicKilcher

    @YannicKilcher

    2 жыл бұрын

    I think the main point is that in chaotic systems, the gradients can have very large variance, just due to the nature of these systems (small changes lead to large effects).

  • @45pierro

    @45pierro

    2 жыл бұрын

    I would expect they try sampling multiple times and averaging the losses to reduce the variance.

  • @etiennetiennetienne

    @etiennetiennetienne

    2 жыл бұрын

    you are both right, the latter is the point of the paper. i guess they study problems where loss landscape is super bumpy, and we cannot smooth it using residuals or batchnorm so we need another way of smooth it, or give up gradients? (or use it with RL for the bad gradient parts?)

  • @peterszilvasi752
    @peterszilvasi7522 жыл бұрын

    "We found a mistake... Paper is rejected!" 😂

  • @amanvijayjindal5742
    @amanvijayjindal57422 жыл бұрын

    Gradient is everything we were made to believe

  • @WhiteEyeTree
    @WhiteEyeTree2 жыл бұрын

    May I ask what kind of application you are using to read pdf with such huge margins on the side? I have been looking for something like that for a while now!

  • @erentas7391
    @erentas73912 жыл бұрын

    The names to be pressed on t-shirts

  • @mgostIH
    @mgostIH2 жыл бұрын

    16:30 "Turns out autoencoders by themselves don't really work" To be fair a result that proves you wrong has been published just a few days ago, indeed a field that goes *fast* 😆 The paper is "Masked Autoencoders Are Scalable Vision Learners"

  • @NavinF

    @NavinF

    2 жыл бұрын

    I dunno man, their generated images look pretty blurry by my standards. In fact they look like the output images from every other autoencoder paper. Pretty sure autoencoders are still dead.

  • @aliabdulhussain8359
    @aliabdulhussain83592 жыл бұрын

    Watching this video, the paper is contraductory since it says gradients are not all you need then they recommend approximating it using "black box method". I would say the gradient is not the issue, it is either they way we are computing it or the structure we are using to solve the problem.

  • @erhimonarowland9464
    @erhimonarowland94642 жыл бұрын

    This paper sounds great in principles. Please make it a little more practical for some of us trying to apply Bp loss function to interprete the real estate market here in Nigeria

Келесі