Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures (Paper Explained)

Ғылым және технология

Backpropagation is one of the central components of modern deep learning. However, it's not biologically plausible, which limits the applicability of deep learning to understand how the human brain works. Direct Feedback Alignment is a biologically plausible alternative and this paper shows that, contrary to previous research, it can be successfully applied to modern deep architectures and solve challenging tasks.
OUTLINE:
0:00 - Intro & Overview
1:40 - The Problem with Backpropagation
10:25 - Direct Feedback Alignment
21:00 - My Intuition why DFA works
31:20 - Experiments
Paper: arxiv.org/abs/2006.12878
Code: github.com/lightonai/dfa-scal...
Referenced Paper by Arild Nøkland: arxiv.org/abs/1609.01596
Abstract:
Despite being the workhorse of deep learning, the backpropagation algorithm is no panacea. It enforces sequential layer updates, thus preventing efficient parallelization of the training process. Furthermore, its biological plausibility is being challenged. Alternative schemes have been devised; yet, under the constraint of synaptic asymmetry, none have scaled to modern deep learning tasks and architectures. Here, we challenge this perspective, and study the applicability of Direct Feedback Alignment to neural view synthesis, recommender systems, geometric learning, and natural language processing. In contrast with previous studies limited to computer vision tasks, our findings show that it successfully trains a large range of state-of-the-art deep learning architectures, with performance close to fine-tuned backpropagation. At variance with common beliefs, our work supports that challenging tasks can be tackled in the absence of weight transport.
Authors: Julien Launay, Iacopo Poli, François Boniface, Florent Krzakala
Links:
KZread: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher

Пікірлер: 73

  • @PM-4564
    @PM-45644 жыл бұрын

    12 seconds in and I'm already impressed by your French pronunciation skills

  • @theodorosgalanos9663
    @theodorosgalanos96634 жыл бұрын

    Fascinating work, thanks for reviewing it Yannic! The random matrices reminded me of that alchemy in some metric learning models where you project triples through a random matrix and it still kind of works :)

  • @dmitrysamoylenko6775
    @dmitrysamoylenko67754 жыл бұрын

    I watch all of this videos and I'm not even a data scientist, just a programmer. I think I can learn from it

  • @firedrive45
    @firedrive453 жыл бұрын

    TLDR: Applying a random matrix transformation to a layer vector removes degeneracy of solution by being non-linear and having outputs separate more quickly than in linear transformations or low param count transformations.

  • @wyalexlee8578
    @wyalexlee85784 жыл бұрын

    Thank for this! The paper explained!

  • @bluel1ng
    @bluel1ng4 жыл бұрын

    This time not one but actually two related papers, you found an effective way to double the output! Feels a bit like magic that FDA works. Great to see basic research in the spotlight (after so many hype-bandwagon number pushing papers). Interesting that due to the random projections the weights of all neurons can be initialized with the same value (e.g. all zero when tanh is used) - e.g. BP needs some asymmetry/randomness in the initial weights in order for two neurons to be different. My first thought was that FDA needs layer-normalization or some form of weight normalization, but seems that works without it for simple tasks like MNIST.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Yes, and it's entirely possible that if we find the correct normalizations etc. like we did for BP, we can push this pretty far.

  • @MrOlivm
    @MrOlivm3 жыл бұрын

    I would love videos on topic/threads/ideas that are shared across papers. Also a mention in your descriptions of the software you use to create content, drawing over papers as you read

  • @jorenboulanger4347
    @jorenboulanger43474 жыл бұрын

    Much love and gratitude ε>

  • @dermitdembrot3091
    @dermitdembrot30913 жыл бұрын

    The idea that DFA induces clustering sounds good. However, the last layer really needs to capitalize on the clustered input and it's surprising that it apparently does, since it itself just gets a "random" feedback.

  • @CristianGarcia
    @CristianGarcia4 жыл бұрын

    Thanks Yannic! Question: Wouldn't DFA be a solution for the vanishing/exploding gradient problem? Is this mentioned anywhere?

  • @CristianGarcia

    @CristianGarcia

    4 жыл бұрын

    I know relu doesn't suffer from this, but what about e.g. sigmoid/tanh?

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    It's not mentioned, but yes, it could be a remedy. I mean, the problem doesn't even arise with this, so yes :)

  • @tweak3871
    @tweak38713 жыл бұрын

    Your theory seems really intuitive to me, the idea is that we just need "a coordinate system" for which to communicate the errors. I've been trying to think about normal backprop in these terms, and if what you say is true, then it basically means that we are potentially unnecessarily preserving the coordinate system with respect to the next layer to communicate the error between the current layer and the next layer in backprop. As long as the error ultimately is preserved with respect to the final task, then the whole system will learn. Super fascinating idea. I wonder if we can use a sparse coordinate system that is more biologically plausible.

  • @alexissalguero5751
    @alexissalguero57514 жыл бұрын

    setup a patreon man keep up the good work! definitely helps

  • @jadoo16815125390625
    @jadoo168151253906254 жыл бұрын

    There is a way to test your hypothesis. There are many ways to approximate random matrices B drawn from gaussian distribution, including sparse binary matrices! The key idea is that many other distributions like binomial or Poisson approximate gaussian distributions (shifted and scaled appropriately) if enough samples are drawn. So, if we replace the B matrices in DFA with sparse binary matrices sampled from binomial distributions and still get reasonable performance, it will strongly support your hypothesis. With sparse matrices, the weight updates will look like jagged co-ordinate descent type updates, with nothing in common with BP or DFA updates except the locality preserving property of the matrix. We did some work in locality sensitive hashing, and showed that such sparse matrices provide amazing computational efficiency while preserving distances and angles: arxiv.org/abs/1812.01844

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Nice, thanks for the reference and the suggestion :)

  • @PhucLe-qs7nx
    @PhucLe-qs7nx4 жыл бұрын

    My intuitions on how random matrices can replace the weight transport in BP: 1. Most vectors in high-dimensional space are almost always (nearly) perpendicular. 2. Using the forward weights for BP gives you an absolute measure (current output is 5 and correct output is 7, we need to reduce 2 units). Because of 1), replace weights matrices with random matrices give you relative measure (I don't care where we're at, but we need to reduce 2).

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Yes but in half the cases that reducing 2 would actually increase 2 because you're rotated

  • @JoaoVitor-mf8iq
    @JoaoVitor-mf8iq4 жыл бұрын

    Could we use both loss functions at the same time (in some way)? (Ex: sum backpropagation of last layer and the DFA of the current layer)

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Sure, why not.

  • @nebularwinter
    @nebularwinter4 жыл бұрын

    sorry for the noob question: what do you mean precisely when you say random matrices almost preserve angles and distances? what kind of stuff should I read to understand that?

  • @larrybird3729

    @larrybird3729

    4 жыл бұрын

    Think about those random matrices as reference-points, if your reference-point stays constant, you can use that point to understand something else.

  • @larrybird3729

    @larrybird3729

    4 жыл бұрын

    Take that with a grain of salt :)

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    There is a class of random linear transformations that approximately preserve these quantities with high probability. Search for Johnson Lindenstrauss Lemma

  • @KKara-fz6ib
    @KKara-fz6ib4 жыл бұрын

    Great review. Which software do you use for it?

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    OneNote

  • @mattiasfagerlund
    @mattiasfagerlund4 жыл бұрын

    Wouldn't this technique also avoid the O(N^2) of transformers, and thus be way way faster? Could it be used to pre-train the network and have backprop fine tune it...

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Yes, I mean technically, this could bypass anything, the question is just how well it performs :)

  • @herp_derpingson
    @herp_derpingson4 жыл бұрын

    I wish they showed some benchmarks of improved GPU performance. This certainly looks a lot more GPU friendly than backprop.

  • @varunnair7776

    @varunnair7776

    4 жыл бұрын

    Absolutely, this is a great point. In fact, I bet its an order of magnitude or more since the random B_i matrix is fixed at the start and we only need gradients for the loss function with respect to the output

  • @PhucLe-qs7nx

    @PhucLe-qs7nx

    4 жыл бұрын

    Lighton.ai is working on some kind of optic hardware, maybe this algorithm is more suitable to their hardware than on a GPU so they leave it out.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    I thought so too, but then there's still forward prop, so it can't all be parallel.

  • @theodorosgalanos9663

    @theodorosgalanos9663

    4 жыл бұрын

    @@YannicKilcher could it be that you can do a sort of ensemble of backward passes and smh approximate better since that can be parallelized?

  • @bibiedf

    @bibiedf

    4 жыл бұрын

    Hey, author here. Showing benchmarks of improved GPU performance is complicated, because our implementation of DFA is not meant to be super-duper-fast, but easy to use and to retrieve info out of instead (like getting alignment angles). The main issue is that the backward process is ingrained deep within ML libraries like Torch and TensorFlow. Hacking into it to make it do something completely different (such as DFA) is challenging. If you don't want to bother with rewriting a lot of the autograd system in C++/CUDA, you have to accept your implementation will be less efficient, and will still do things "the BP-way" sometimes. @Phúc Lê is also right, we are developing at LightOn optical hardware that performs very large random projections much more rapidly than a GPU. You can put two and two together and guess why we are interested in DFA :). Also @Theodoros Galanos and @Yannic Kilcher, there are ways to actually enable more parallelization with DFA. Because in classification problems the sign of the error is known in advance, you can directly update a layer with a modified binarized DFA as soon as it has done its forward. See 1909.01311 for reference.

  • @konghong3885
    @konghong38854 жыл бұрын

    random matrix, hmmmm.... any possible combination with bayesian methods (approximating the effect from following layers)? It will be hard for an entire deep network, but for pack of few layers, its maybe do-able (e.g arxiv.org/pdf/1310.1867.pdf) (p.s, I'm just an undergrad student so plz forgive my ignorance)

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    I'm not sure there's a connection here, but maybe

  • @Notshife
    @Notshife4 жыл бұрын

    For weight normalization, it has occurred to me that perhaps something very loosely aligned with the notion of conservation of energy could be applied to the weights. Something like; the input values that are summed for a neuron (usually weights x output of previous) are proportional to the synaptic weight divided by the root of the sum of the squares of the weights which are connected to that input x the output. Do you know of any papers on this line of thinking? It's not a very sound science as to why I think it would be useful, except that in my mind it forces a feature to "become relevant" for reasons too long to explain in a youtube comment.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    There is "weight standardization" which goes on that direction. Also many initialization methods have properties like that and otherwise there are things like unitary neural networks. Not sure if any of those are exactly what you're describing

  • @Notshife

    @Notshife

    4 жыл бұрын

    @@YannicKilcher I will investigate. thanks

  • @bengineer_the
    @bengineer_the3 жыл бұрын

    I have been reading about the fact that a single soma can communicate with dual/multiple neurotransmitters. (In fact they may never act alone) This has been theorised to allow the necessary negative feedback at the synaptic junction. I found the subject to be called 'co-transmission' following on from a revised view of something called Dale's principle. :)

  • @YannicKilcher

    @YannicKilcher

    3 жыл бұрын

    Nice, thanks for the reference!

  • @oleksandrpopovych4841
    @oleksandrpopovych48414 жыл бұрын

    Man, do you have patreon? I would like to donate, so you never stop! :)

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    I will stop, with or without patreon 😁 but thanks

  • @TheKivifreak

    @TheKivifreak

    3 жыл бұрын

    @@YannicKilcher : 0

  • @PM-4564
    @PM-45644 жыл бұрын

    At 4:45, if you haven't already looked at Blake Richard's work you should do so. He claims to have found a biological plausible way in which (clusters of) neurons could do credit assignment, by "multiplexing" their signals (or sending both signals on the same channel) to simultaneously propagate a forwards signal and the error backwards signal with different compartments in the neuron. multiplexing: kzread.info/dash/bejne/c5qmt5SweLTUotY.html and credit assignment kzread.info/dash/bejne/g2Vsw8qlnJDTgqg.html (modeling a cluster of neurons as the fundamental unit) At 21:00 I think somewhere Geoffrey Hinton as an explanation for how the forward weights adapt to the random backward weights when talking about stacked auto encoders and temporal derivatives.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Thanks a lot for these references, very much appreciated :)

  • @schollpiero
    @schollpiero4 жыл бұрын

    I don't know why the angles from different layers should be preserved. Since there are non-linear activation functions between different layer.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    I don't think there are nonlinearities in the backward projection

  • @lohitkapoor1800
    @lohitkapoor18004 жыл бұрын

    This is amazing channel. Rate at which you describe latest papers on daily basis is incredible. I have one request to you also. Can you keep the background of research paper screen and your notes' screen dark, with the letters being white for research papers and white/colorful for your notes? Also you can use dark mode for your browser if you want ,like with Nighteye extension.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Sadly, research papers are still on white PDF pages. I like dark mode too, but I don't think it would work

  • @lohitkapoor1800

    @lohitkapoor1800

    4 жыл бұрын

    @@YannicKilcher foxit reader allows modification in text and background color of pdf

  • @zikunchen6303
    @zikunchen63034 жыл бұрын

    Is the reason why B matrices can preserve length and angles because they are gaussian?

  • @bluel1ng

    @bluel1ng

    4 жыл бұрын

    You might have a look at en.wikipedia.org/wiki/Restricted_isometry_property and en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma ...

  • @bluel1ng

    @bluel1ng

    4 жыл бұрын

    Or if you prefer a paper: "Database-friendly Random Projections" by Dimitris Achlioptas people.ee.duke.edu/~lcarin/p93.pdf also see: scikit-learn.org/stable/modules/random_projection.html

  • @DanielHesslow

    @DanielHesslow

    4 жыл бұрын

    Yeah, the elements being gaussian is sufficient but not necessary. Johnson-Lindenstrauss lemma and random projections are probably the keywords to search for if you want to know more.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Thanks. Andreas is right. Gaussians are one of an entire class of random projections that preserve these quantities.

  • @zikunchen6303

    @zikunchen6303

    4 жыл бұрын

    I see, thanks!

  • @Notshife
    @Notshife4 жыл бұрын

    I have been really hoping you would cover this, this content is very sexy, thank you!

  • @patrickjdarrow
    @patrickjdarrow4 жыл бұрын

    Can they really claim biological plausibility while using random matrices? Just because the technique doesn't follow BP's biological implausibilities doesn't mean that it is itself biologically plausible, no?

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Not per se, but it's more plausible than backprop

  • @not_a_human_being
    @not_a_human_being3 жыл бұрын

    It's a bit like building mechanical horse... One of the natures criterions for evolution is continuously valid offspring, this heavily restricts the architecture.

  • @raguaviva
    @raguaviva4 жыл бұрын

    uffff the video only addresses DFA at the minute 10 :/

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    That's why there's time stamps

  • @raguaviva

    @raguaviva

    4 жыл бұрын

    @@YannicKilcher oh! thanks for that!!!

  • @JTMoustache
    @JTMoustache4 жыл бұрын

    this is quite similar to policy gradient

  • @snippletrap
    @snippletrap4 жыл бұрын

    Hebbian learning in spiking networks is biologically plausible (Numenta). Layer updates are not. It doesn't matter whether the layers are updated in series or parallel, since the architecture is wrong from the start. But so what? Backprop is just one way to implement the search for global minima. However the brain does it, the results are probably the same.

Келесі