A New Physics-Inspired Theory of Deep Learning | Optimal initialization of Neural Nets
Ғылым және технология
A special video about recent exciting developments in mathematical deep learning! 🔥 Make sure to check out the video if you want a quick visual summary over contents of the “The principles of deep learning theory” book deeplearningtheory.com/.
SPONSOR: Aleph Alpha 👉 app.aleph-alpha.com/
17:38 ERRATUM: Boris Hanin reached out to us and made this point "I found the explanations to be crisp and concise, except for one point. Namely, I am pretty sure the description you give of why MLPs become linear models at infinite width is not quite correct. It is not true that they are equivalent to a random feature model in which features are the post-activations of the final hidden layer and that activations in previous layers don’t move. Instead, what happens is that the full vector of activations in each layer moves by an order 1 amount. However, while the Jacobian of the model output with respect to its parameters remains order 1 the Hessian goes to zero. Put another way, the whole neural network can be replaced by its linearization around the start of training. In the resulting linear model all parameters move to fit the data.".
Check out our daily #MachineLearning Quiz Questions: / aicoffeebreak
➡️ AI Coffee Break Merch! 🛍️ aicoffeebreak.creator-spring....
📕 The book: Roberts, Daniel A., Sho Yaida, and Boris Hanin. The principles of deep learning theory. Cambridge University Press, 2022. arxiv.org/abs/2106.10165
MAGMA paper 📜: arxiv.org/abs/2112.05253
Outline:
00:00 The Principles of Deep Learning Theory (Book)
02:12 Neural networks and black boxes
05:35 Large-width limit
07:59 How to get the large-width limit and Forward propagation recap
13:11 Why we need non-Gaussianity
16:28 No wiring for infinite-width networks
17:13 No representation learning for infinite-width networks
19:31 Layer recursion
22:36 Experimental verification
24:09 The Renormalisation Group
26:08 Fixed points
28:45 Stability
31:15 Experimental verification (activation functions)
34:57 Outro and thanks
35:26 Sponsor: Aleph Alpha
Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
Don Rosenthal, Dres. Trost GbR, banana.dev -- Kyle Morris, Julián Salazar, Edvard Grødem, Vignesh Valliappan, Kevin Tsai, Mutual Information, Mike Ton
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
Patreon: / aicoffeebreak
Ko-fi: ko-fi.com/aicoffeebreak
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔗 Links:
AICoffeeBreakQuiz: / aicoffeebreak
Twitter: / aicoffeebreak
Reddit: / aicoffeebreak
KZread: / aicoffeebreak
#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research
Music 🎵 : It's Only Worth It if You Work for It (Instrumental) - NEFFEX
Пікірлер: 35
17:38 ERRATUM: Boris Hanin reached out to us and made this point "I found the explanations to be crisp and concise, except for one point. Namely, I am pretty sure the description you give of why MLPs become linear models at infinite width is not quite correct. It is not true that they are equivalent to a random feature model in which features are the post-activations of the final hidden layer and that activations in previous layers don’t move. Instead, what happens is that the full vector of activations in each layer moves by an order 1 amount. However, while the Jacobian of the model output with respect to its parameters remains order 1 the Hessian goes to zero. Put another way, the whole neural network can be replaced by its linearization around the start of training. In the resulting linear model all parameters move to fit the data.".
Hi Leitia, I'm a physicist working in deep learning for a company. I love physics but I was convinced that all the things that I've learnt in my degree and mindset acquired were useless in software/computer science related things. Thanks you for proving me that I was damn wrong. Now, I am more grateful of knowing a bit about Thermodynamics, Statistical Mechanics and Quantum Mechanics (perturbation theory, in particular).
Hi! my name is Rafael. I'am from to Brazil and following your vídeos about Data Science and they are very... very... good. Congratulations!!
@AICoffeeBreak
2 жыл бұрын
I had help for this one. 😅 But thanks. I'll tell your good feedback to Karla too.
@rafalobo6633
2 жыл бұрын
@@AICoffeeBreak Nice too meet you Karla!! 😄😄
This is excellent stuff! I'm glad you gave such a complete overview of a really technical topic. KZread will be the next wikipedia :)
Absolutely my favourite Coffee break so far. I'm really happy to see that you exploit your background as a physicist and share with with us and Mr. Coffee bean. It is amazing to see how the quality of these videos has increased within a year. This is already top-level!
I'm glad they footnoted Radford Neal's and other work that went before this - almost all of the techniques in this paper had been applied to neural networks before.
The manim animations were AMAZING!!
Came here from Yann LeCun's post on FB. Really loved this! The last few years have truly been interesting for DL. About the paper, with evermore powerful models coming into the scene and folks trying to crack the blackbox, such theoretical work is very much needed.
The presentation is simply amazing! On the other hand, the distribution of the parameters is assumed to be the initialization distribution. It seems that this should be true only in the beginning part of the training process. A theory that works in the entirety of the training process would be even more useful.
@drdca8263
2 жыл бұрын
I think (but I could easily be wrong; machine learning isn’t my area) part of the idea might be that, in the wide network limit, the weights don’t change very much over the course of training, and therefore the distributions don’t change much
This is my favorite coffe break so far. Thanks for all the effort!
Amazing seeing you mature as a content producer, this is awesome!
I love this video! You can tell how much love is poured into it. Can't wait to read the book
The cow is almost as cute as Ms. Coffee Bean 🐮☕
@AICoffeeBreak
2 жыл бұрын
😱
I loved this! I didn't expect we were already at this level in the theory of neural networks, I'm glad they are being treated more formally and am looking forward for more work in this regards!
Wow! This is fascinating. I would like to reproduce this model. In your experiments you mention using a 50 layer MLP with a hidden size 500. Are you referring to 500 Neurons? If so, what is the neuron distribution on the hidden layers? Thank you for your hard work.
Great job on top of the perfect paper!
Thank you for making this wonderful video!
I have so many questions regarding this. For instance, you start reinitializing the weights around the 9:30'ish mark to get the Gaussian distribution. However, how do you reinitialize the weights? From which distribution? Is there training involved here? I guess I now have to read the book. You've essentially given me homework... Of how many pages it takes for them to get to this point... And, knowing me, I'll probably read all 472 of them.
Would you please consider explaining causal attention(Causal Attention for Vision-Language Tasks)? Find it quite confusing🙏
big video here, I save it for the weekend xd
Awesome video
I haven't heard before that details are lost when you make the network width too large. Is this something that has been observed? Does this assume that you use stochastic gradient descent or does this also occur with other optimizers such as Adam, which scale the gradient to make up for the fact that the magnitude of the gradient elements decrease as the width of the network increases?
Thank you for this! I really like to recreate the plots you show towards the end for different activation functions? Can you share the code?
What aspects are physics inspired, explained ELI5?
@JorgetePanete
Жыл бұрын
ATM Machine
Idea - In sklearn you can do random convolve() (2D) over the coefs_[0][28x28,k] matrices to get to the truth faster. Just turn the convolve off after first iteration. You get an image like coefs_ matrix in shape[0]. Another idea is to simulate the input of dark matter. Its dark so the input must have been put inside the hidden layers and evaporated inwards to a destroyer. So dark matter is a time material from dt to dt^10 fast enough environment for input information to be destroyed in the core. Guess //Per
@AICoffeeBreak
2 жыл бұрын
Are you GPT-3? 🤣
@perlindholm4129
2 жыл бұрын
@@AICoffeeBreak Are you a MonICA?
@AICoffeeBreak
2 жыл бұрын
@@things_leftunsaid so it is not AI generated?
@AICoffeeBreak
2 жыл бұрын
@@things_leftunsaid thanks. It makes sense in all the lack of sense. 😅
Make a video on Reformer, explains its key detail why its special and Transformer is great than why reformer are using. explain Reversible residual layer and LSH