A New Physics-Inspired Theory of Deep Learning | Optimal initialization of Neural Nets

Ғылым және технология

A special video about recent exciting developments in mathematical deep learning! 🔥 Make sure to check out the video if you want a quick visual summary over contents of the “The principles of deep learning theory” book deeplearningtheory.com/.
SPONSOR: Aleph Alpha 👉 app.aleph-alpha.com/
17:38 ERRATUM: Boris Hanin reached out to us and made this point "I found the explanations to be crisp and concise, except for one point. Namely, I am pretty sure the description you give of why MLPs become linear models at infinite width is not quite correct. It is not true that they are equivalent to a random feature model in which features are the post-activations of the final hidden layer and that activations in previous layers don’t move. Instead, what happens is that the full vector of activations in each layer moves by an order 1 amount. However, while the Jacobian of the model output with respect to its parameters remains order 1 the Hessian goes to zero. Put another way, the whole neural network can be replaced by its linearization around the start of training. In the resulting linear model all parameters move to fit the data.".
Check out our daily #MachineLearning Quiz Questions: / aicoffeebreak
➡️ AI Coffee Break Merch! 🛍️ aicoffeebreak.creator-spring....
📕 The book: Roberts, Daniel A., Sho Yaida, and Boris Hanin. The principles of deep learning theory. Cambridge University Press, 2022. arxiv.org/abs/2106.10165
MAGMA paper 📜: arxiv.org/abs/2112.05253
Outline:
00:00 The Principles of Deep Learning Theory (Book)
02:12 Neural networks and black boxes
05:35 Large-width limit
07:59 How to get the large-width limit and Forward propagation recap
13:11 Why we need non-Gaussianity
16:28 No wiring for infinite-width networks
17:13 No representation learning for infinite-width networks
19:31 Layer recursion
22:36 Experimental verification
24:09 The Renormalisation Group
26:08 Fixed points
28:45 Stability
31:15 Experimental verification (activation functions)
34:57 Outro and thanks
35:26 Sponsor: Aleph Alpha
Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
Don Rosenthal, Dres. Trost GbR, banana.dev -- Kyle Morris, Julián Salazar, Edvard Grødem, Vignesh Valliappan, Kevin Tsai, Mutual Information, Mike Ton
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
Patreon: / aicoffeebreak
Ko-fi: ko-fi.com/aicoffeebreak
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔗 Links:
AICoffeeBreakQuiz: / aicoffeebreak
Twitter: / aicoffeebreak
Reddit: / aicoffeebreak
KZread: / aicoffeebreak
#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research
Music 🎵 : It's Only Worth It if You Work for It (Instrumental) - NEFFEX

Пікірлер: 35

@AICoffeeBreak2 жыл бұрын
17:38 ERRATUM: Boris Hanin reached out to us and made this point "I found the explanations to be crisp and concise, except for one point. Namely, I am pretty sure the description you give of why MLPs become linear models at infinite width is not quite correct. It is not true that they are equivalent to a random feature model in which features are the post-activations of the final hidden layer and that activations in previous layers don’t move. Instead, what happens is that the full vector of activations in each layer moves by an order 1 amount. However, while the Jacobian of the model output with respect to its parameters remains order 1 the Hessian goes to zero. Put another way, the whole neural network can be replaced by its linearization around the start of training. In the resulting linear model all parameters move to fit the data.".
@jerepovedamartinez45742 жыл бұрын
Hi Leitia, I'm a physicist working in deep learning for a company. I love physics but I was convinced that all the things that I've learnt in my degree and mindset acquired were useless in software/computer science related things. Thanks you for proving me that I was damn wrong. Now, I am more grateful of knowing a bit about Thermodynamics, Statistical Mechanics and Quantum Mechanics (perturbation theory, in particular).
@rafalobo66332 жыл бұрын
Hi! my name is Rafael. I'am from to Brazil and following your vídeos about Data Science and they are very... very... good. Congratulations!!
@AICoffeeBreak
2 жыл бұрын
I had help for this one. 😅 But thanks. I'll tell your good feedback to Karla too.
@rafalobo6633
2 жыл бұрын
@@AICoffeeBreak Nice too meet you Karla!! 😄😄
@Mutual_Information2 жыл бұрын
This is excellent stuff! I'm glad you gave such a complete overview of a really technical topic. KZread will be the next wikipedia :)
@pixoncillo12 жыл бұрын
Absolutely my favourite Coffee break so far. I'm really happy to see that you exploit your background as a physicist and share with with us and Mr. Coffee bean. It is amazing to see how the quality of these videos has increased within a year. This is already top-level!
@scottmiller25912 жыл бұрын
I'm glad they footnoted Radford Neal's and other work that went before this - almost all of the techniques in this paper had been applied to neural networks before.
@PritishMishra2 жыл бұрын
The manim animations were AMAZING!!
@sunitbhattacharya2 жыл бұрын
Came here from Yann LeCun's post on FB. Really loved this! The last few years have truly been interesting for DL. About the paper, with evermore powerful models coming into the scene and folks trying to crack the blackbox, such theoretical work is very much needed.
@oscezrcd2 жыл бұрын
The presentation is simply amazing! On the other hand, the distribution of the parameters is assumed to be the initialization distribution. It seems that this should be true only in the beginning part of the training process. A theory that works in the entirety of the training process would be even more useful.
@drdca8263
2 жыл бұрын
I think (but I could easily be wrong; machine learning isn’t my area) part of the idea might be that, in the wide network limit, the weights don’t change very much over the course of training, and therefore the distributions don’t change much
@hannesstark50242 жыл бұрын
This is my favorite coffe break so far. Thanks for all the effort!
@TimScarfe2 жыл бұрын
Amazing seeing you mature as a content producer, this is awesome!
@zbaker00712 жыл бұрын
I love this video! You can tell how much love is poured into it. Can't wait to read the book
@DerPylz2 жыл бұрын
The cow is almost as cute as Ms. Coffee Bean 🐮☕
@AICoffeeBreak
2 жыл бұрын
😱
@mgostIH2 жыл бұрын
I loved this! I didn't expect we were already at this level in the theory of neural networks, I'm glad they are being treated more formally and am looking forward for more work in this regards!
@squeakytoyrecords17022 жыл бұрын
Wow! This is fascinating. I would like to reproduce this model. In your experiments you mention using a 50 layer MLP with a hidden size 500. Are you referring to 500 Neurons? If so, what is the neuron distribution on the hidden layers? Thank you for your hard work.
@antonkot62502 жыл бұрын
Great job on top of the perfect paper!
@user-kx2lz2fz3v2 жыл бұрын
Thank you for making this wonderful video!
@AD1732 жыл бұрын
I have so many questions regarding this. For instance, you start reinitializing the weights around the 9:30'ish mark to get the Gaussian distribution. However, how do you reinitialize the weights? From which distribution? Is there training involved here? I guess I now have to read the book. You've essentially given me homework... Of how many pages it takes for them to get to this point... And, knowing me, I'll probably read all 472 of them.
@ZKKKKKKK2 жыл бұрын
Would you please consider explaining causal attention(Causal Attention for Vision-Language Tasks)? Find it quite confusing🙏
@Youkouleleh2 жыл бұрын
big video here, I save it for the weekend xd
@tempoaccoun9082 жыл бұрын
Awesome video
@kristoferkrus Жыл бұрын
I haven't heard before that details are lost when you make the network width too large. Is this something that has been observed? Does this assume that you use stochastic gradient descent or does this also occur with other optimizers such as Adam, which scale the gradient to make up for the fact that the magnitude of the gradient elements decrease as the width of the network increases?
@siarez2 жыл бұрын
Thank you for this! I really like to recreate the plots you show towards the end for different activation functions? Can you share the code?
@RoboticusMusic2 жыл бұрын
What aspects are physics inspired, explained ELI5?
@JorgetePanete
Жыл бұрын
ATM Machine
@perlindholm41292 жыл бұрын
Idea - In sklearn you can do random convolve() (2D) over the coefs_[0][28x28,k] matrices to get to the truth faster. Just turn the convolve off after first iteration. You get an image like coefs_ matrix in shape[0]. Another idea is to simulate the input of dark matter. Its dark so the input must have been put inside the hidden layers and evaporated inwards to a destroyer. So dark matter is a time material from dt to dt^10 fast enough environment for input information to be destroyed in the core. Guess //Per
@AICoffeeBreak
2 жыл бұрын
Are you GPT-3? 🤣
@perlindholm4129
2 жыл бұрын
@@AICoffeeBreak Are you a MonICA?
@AICoffeeBreak
2 жыл бұрын
@@things_leftunsaid so it is not AI generated?
@AICoffeeBreak
2 жыл бұрын
@@things_leftunsaid thanks. It makes sense in all the lack of sense. 😅
@rohitkhoiwal78032 жыл бұрын
Make a video on Reformer, explains its key detail why its special and Transformer is great than why reformer are using. explain Reversible residual layer and LSH