A bio-inspired bistable recurrent cell allows for long-lasting memory (Paper Explained)

Ғылым және технология

Even though LSTMs and GRUs solve the vanishing and exploding gradient problems, they have trouble learning to remember things over very long time spans. Inspired from bistability, a property of biological neurons, this paper constructs a recurrent cell with an inherent memory property, with only minimal modification to existing architectures.
OUTLINE:
0:00 - Intro & Overview
1:10 - Recurrent Neural Networks
6:00 - Gated Recurrent Unit
14:40 - Neuronal Bistability
22:50 - Bistable Recurrent Cell
31:00 - Neuromodulation
32:50 - Copy First Benchmark
37:35 - Denoising Benchmark
48:00 - Conclusion & Comments
Paper: arxiv.org/abs/2006.05252
Code: github.com/nvecoven/BRC
Abstract:
Recurrent neural networks (RNNs) provide state-of-the-art performances in a wide variety of tasks that require memory. These performances can often be achieved thanks to gated recurrent cells such as gated recurrent units (GRU) and long short-term memory (LSTM). Standard gated cells share a layer internal state to store information at the network level, and long term memory is shaped by network-wide recurrent connection weights. Biological neurons on the other hand are capable of holding information at the cellular level for an arbitrary long amount of time through a process called bistability. Through bistability, cells can stabilize to different stable states depending on their own past state and inputs, which permits the durable storing of past information in neuron state. In this work, we take inspiration from biological neuron bistability to embed RNNs with long-lasting memory at the cellular level. This leads to the introduction of a new bistable biologically-inspired recurrent cell that is shown to strongly improves RNN performance on time-series which require very long memory, despite using only cellular connections (all recurrent connections are from neurons to themselves, i.e. a neuron state is not influenced by the state of other neurons). Furthermore, equipping this cell with recurrent neuromodulation permits to link them to standard GRU cells, taking a step towards the biological plausibility of GRU.
Authors: Nicolas Vecoven, Damien Ernst, Guillaume Drion
Links:
KZread: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher

Пікірлер: 67

  • @auridiamondiferous
    @auridiamondiferous4 жыл бұрын

    this reminds me of logic circuits! the bistable part: keep high value until new enough low value is detected. keep low value until new enough high value is detected. THIS IS EQUAL to In electronics, a Schmitt trigger is a comparator circuit with hysteresis implemented by applying positive feedback to the noninverting input of a comparator or differential amplifier.

  • @FreddySnijder-TheOnlyOne
    @FreddySnijder-TheOnlyOne3 жыл бұрын

    What I find interesting is that the BRC setup will be much faster than a LSTM.GRU, because it only uses element wise multiplications; it can also be parallelised more easily.

  • @herp_derpingson
    @herp_derpingson4 жыл бұрын

    It would be funny if two papers down the line, it reclaims the crown from GPT. What a time to be alive.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    haha, yea. though I think just because it can remember things well it's not the same thing as the ability to attend to any token in the sequence. I guess the two have different strengths.

  • @PatrickOliveras

    @PatrickOliveras

    4 жыл бұрын

    @@YannicKilcher Yeah, this seems to be more of a longer and stronger short-term memory. Perhaps it will have stronger implications in reinforcement learning?

  • @cwhy

    @cwhy

    4 жыл бұрын

    @@YannicKilcher And attention will be kind of cheating if the purpose is the same to this paper, because it let in all the information at once.

  • @revimfadli4666

    @revimfadli4666

    4 жыл бұрын

    @@PatrickOliveras for longer memory(and more biomimicry) maybe it can be combined with differentiable plasticity?

  • @kazz811
    @kazz8114 жыл бұрын

    Stellar walk through of a very nice paper! Thanks for doing these. Also, great job explaining the GRU by writing a diagrammatic version of it.

  • @alelasantillan
    @alelasantillan4 жыл бұрын

    Amazing explanation! Thank you!

  • @mehermanoj45
    @mehermanoj454 жыл бұрын

    Damm man! A video every day👌

  • @patrickjdarrow
    @patrickjdarrow4 жыл бұрын

    Interesting that the presence of BRCs highlights the trade-off of long-term memory. It feels like a better analog for the function 'I' may be learned.

  • @mrityunjoypanday227
    @mrityunjoypanday2274 жыл бұрын

    Interesting to see, the use in RL. Replacing LSTM.

  • @sphereron

    @sphereron

    4 жыл бұрын

    Yes, many environments can require long term memory. Supervised problems not as often.

  • @revimfadli4666

    @revimfadli4666

    4 жыл бұрын

    @@sphereron especially in complex, stochastic environments when storing all inputs ever would be inefficient, in contrast to tasks like NLP or image processing, where all inputs are already accessible in memory

  • @priyamdey3298
    @priyamdey32984 жыл бұрын

    One thing I would like to know: How do you keep track of the new papers coming in? Do you keep an eye on the sites everyday?

  • @SachinSingh-do5ju

    @SachinSingh-do5ju

    4 жыл бұрын

    Yeah we all want to know, what's the source?

  • @ulm287

    @ulm287

    4 жыл бұрын

    Twitter? just follow ML bots

  • @videomae6519

    @videomae6519

    4 жыл бұрын

    may be arxiv

  • @user93237

    @user93237

    4 жыл бұрын

    arXiv-sanity, Twitter accounts by various researchers, r/machinelearning

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Craigslist

  • @tsunamidestructor
    @tsunamidestructor4 жыл бұрын

    Maybe I'm in the minority here but I really want to see LSTMs/GRUs outperform GPT-x models

  • @angrymurloc7626

    @angrymurloc7626

    4 жыл бұрын

    If history tells us anything, scalability will win over intelligent design.

  • @revimfadli4666

    @revimfadli4666

    4 жыл бұрын

    Perhaps with some Turing machine modifications....

  • @NicheAsQuiche

    @NicheAsQuiche

    4 жыл бұрын

    I feel that too but I don't think it makes computational sense, like, 'why would only seeing one thing at a time in one order, having to remember it all, work better than being able to attend to all of it in parallel?' I think the reason for our hope for it is that recurrence most likely better resembles what happens in humans and we don't like the thought of designing something that works better (i know GPT-x is no where near human intelligence in language but future improvements on transformers may make it so)

  • @revimfadli4666

    @revimfadli4666

    4 жыл бұрын

    @@NicheAsQuiche for tasks where all input data are available at once(NLP, image processing, etc) large-scale parallelization like that might work better using GPUs/TPUs/etc, but for autonomous agents, reinforcement learning and the like in a stochastic & complex environment, storing inputs like that would be inefficient compared to having the NN "compress" all that data(which LSTM and the kind already do)

  • @bluel1ng
    @bluel1ng4 жыл бұрын

    Fantastic sound quality! Nice explanation of gating in unrolling / BPTT. Regarding the linear feedback simplification: You claim that it would self stabilize over time, but if f(V_post) would be a*V_post with a>2 this would "explode" with zero input (an IIR filter with one delay unit is not always stable).

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    yes, very true. I was just thinking of the simplest case

  • @bosepukur
    @bosepukur4 жыл бұрын

    wonderful video

  • @harshpathak1247
    @harshpathak12474 жыл бұрын

    Nice overview! And Nice paper. This seems similar to pitchfork Bifurcations, while performing optimization via Continuation methods. Hope these methods continue to explain more about deep learning optimization

  • @damienernst5758

    @damienernst5758

    4 жыл бұрын

    The nBRC cell is indeed experiencing a pitchfork bifurcation at a=1 - see Appendix of arxiv.org/abs/2006.05252 for more details.

  • @harshpathak1247

    @harshpathak1247

    4 жыл бұрын

    Thanks, I have been following this topic closely. Here is the list of papers that talk about the dynamics of RNN. github.com/harsh306/awesome-nn-optimization#dynamics-bifurcations-and--rnns-difficulty-to-train

  • @MrZouzan
    @MrZouzan4 жыл бұрын

    Thanks !

  • @TheThirdLieberkind
    @TheThirdLieberkind4 жыл бұрын

    This is so interesting. I wonder how the research on the math and functions in biological neurons is done. It really sounds like the brain does actual number crunching, and handles signals though known mathematical functions alike that we do with computers. There might be a lot we can learn from biology in machine learning research

  • @004307ec

    @004307ec

    4 жыл бұрын

    Then you might want to search spike neural network (SNN)

  • @n.lu.x

    @n.lu.x

    4 жыл бұрын

    I would say its the other way around. We use math as a language to describe the world around us. It's just that now we want to use it to model learning and intelligence and math is the best tool/language to do that. What I'm getting at is that brain doesn't necessarily do number crunching but we describe it as such because that's the closest we can get to modeling how it works.

  • @ekstrapolatoraproksymujacy412
    @ekstrapolatoraproksymujacy4124 жыл бұрын

    In the GRU they use, signal from ht-1 is multiplied by the weight matrix before going through the reset gate, it's usually the other way around and if that's the case, that weight matrix can potentially have values that amplify ht-1 enough to get this positive feedback and bistable behavior from normal GRU with standard sigmoid in reset gate. And you missed that they get rid of this weight matrix completely in their BRC and they adding ht-1 without any processing (besides that 1+tanh reset gate) to output tanh

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Thanks for the clarifications

  • @ekstrapolatoraproksymujacy412

    @ekstrapolatoraproksymujacy412

    4 жыл бұрын

    ​@@YannicKilcher I checked their code and they use the default implementation of the keras GRUcell which has the "reset_after" argument that controls whether the reset gate is after or before matrix multiplication, I changed it so that gate is before matmul and it is now running their benchmark3 on mnist. Of course training is painfully slow so time will tell...

  • @Claudelu
    @Claudelu4 жыл бұрын

    A good question would be, what is a good database of new or interesting papers, we all want to know where u find these amazing papers!

  • @sagumekishin5748
    @sagumekishin57484 жыл бұрын

    Maybe we can apply network architecture search here to find a good recurrent cell.

  • @darkmythos4457
    @darkmythos44574 жыл бұрын

    Thanks, very intresting. In case you are reading this, I am going to suggest a nice ICML20 paper: "Fast Differentiable Sorting and Ranking".

  • @SlickMona
    @SlickMona4 жыл бұрын

    At 36:33 - why would nBRC get *better* with higher values of T?

  • @NicheAsQuiche

    @NicheAsQuiche

    4 жыл бұрын

    Im confused about that too. maybe I'm not getting something but it looks like in all the benchmarks that model performs better with longer sequences, like somehow making the task harder makes it easier for it.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    I don't think the T changes. Just the N that specifies where the information is

  • @EditorsCanPlay
    @EditorsCanPlay4 жыл бұрын

    here we go again

  • @clivefernandes5435
    @clivefernandes54354 жыл бұрын

    So if we have every long sentences or paragraphs these will perform better than lstms rite ?

  • @YannicKilcher

    @YannicKilcher

    3 жыл бұрын

    Maybe, I guess that's up for people to figure out

  • @marat61
    @marat614 жыл бұрын

    Why did LSTM permorm so worser than GRU at starting at T = 50?

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    who knows, it's more complicated

  • @patrickjdarrow
    @patrickjdarrow4 жыл бұрын

    At 24:00, "...a biological neuron can only feed back onto itself". What is being referenced here? Surely not synapses

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    I think they mean this bistability mechanism

  • @patrickjdarrow

    @patrickjdarrow

    4 жыл бұрын

    @@YannicKilcher makes much more sense.

  • @grafzhl
    @grafzhl4 жыл бұрын

    Tried my hands on an experimental Pytorch implementation: github.com/742617000027/nBRC/blob/master/main.py

  • @aBigBadWolf
    @aBigBadWolf4 жыл бұрын

    in their source code, the experiments are all using truncated backprop at 100 steps. How does this learn with more than 100 paddings symbols.. ?

  • @supernovae34

    @supernovae34

    4 жыл бұрын

    Hi ! Author here, the back propagation through time is actually made over all time-steps of the time-series. That parameter you are talking about is actually not used in the model and is a left-over of old code. I apparently forgot to remove it when cleaning the code. Nice catch, it would obviously be impossible to learn anything on these time-series with such settings. Sorry for the confusion ! (I updated the code)

  • @aBigBadWolf

    @aBigBadWolf

    4 жыл бұрын

    @@supernovae34 thanks for the info. Which tensorflow version is this code for? It would be helpful if you'd add such reproducibility details to the readme.

  • @supernovae34

    @supernovae34

    4 жыл бұрын

    @@aBigBadWolf It's on TensorFlow 2. Indeed, I plan on doing a clean README and probably cleaning the code a little more. I haven't had the time yet but will make sure to have it done in the very near future !

  • @aBigBadWolf

    @aBigBadWolf

    4 жыл бұрын

    @@supernovae34 I ran benchmark 3 for three of the model with zs 300. The LSTM resulted in nans after 30% accuracy. BRC is stuck at 10%. nBRC achieved 94% accuracy (after 20k steps). The paper doesn't mention any instability issues. Additionally, this first run of BRC is not at all where the mean and std of table 4 would indicate which begs the question: how fair is the comparison in table 4 really?

  • @supernovae34

    @supernovae34

    4 жыл бұрын

    @@aBigBadWolf Hi, sorry to hear that. This is weird, I didn't run into such issues. For the nBRC 94% accuracy, that is rather normal, it should gain a few more percents in the next 10k steps. However, I am rather surprised for the BRC as it proved to work well over three runs (and even more which were done when testing the architecture before the final three runs). Did you run it for long enough ? One thing we noticed on this particular benchmark is that the training of BRC makes a bit of a "saw-tooth" shape. I am also quite surprised for the LSTM as we never saw it learn anything (on the validation set !). One thing that might be worth noting though, is that we saw GRU overfit the benchmark 2 with "no-end" set at 200. That is, they achieved a loss of 0.6 on the training set, but a loss of 1.2 on the test set. Are you sure that the 30% accuracy of the LSTM is on the validation set ? Have you had any problems recreating the results for the other benchmarks ? I will upload today a self contained example for benchmark 2 (as I already did for benchmark 1). Both these scripts should give pretty much exactly the same results as those presented in the paper. Note that benchmark 2 requires a "warming-up" phase for the recurrent cells to start learning, so it is normal for the loss to not decrease before 13 to 20 epochs (variance). Once benchmark 2 is done (currently running to make sure results are the same with the new script using Keras sequential model as those presented in the paper), I will do the same with benchmark3. This will result in three self contained scripts, which should be much cleaner than what is currently available, sorry for the incovenience. Also, come to think about it, we should probably have shown more learning curves for benchmark 2 and 3 which would have given more insight than just the convergence results (it might also have answered some of your questions), unfortunately we lacked room in the paper to do so. Finally, we did try to be as fair as possible and our goal was never to crunch small percentages here and there, we just wanted to highlight a general behaviour !

  • @DavenH
    @DavenH4 жыл бұрын

    Ahh, bi-stable. I kept reading it thought it would rhyme with 'listable' and didn't know what the heck that word meant.

  • @snippletrap
    @snippletrap4 жыл бұрын

    For the gold standard in biologically plausible neural modeling, check out the work by Numenta.

  • @user-kw9cu
    @user-kw9cu4 жыл бұрын

    ah yes 5Head

Келесі