Do we need Attention? - Linear RNNs and State Space Models (SSMs) for NLP

Ғылым және технология

(More Recent version for Mamba: • Do we need Attention? ... )
A talk for MLSys surveying recent methods using linear RNNs and State Space models to replace attention in transformer-style models.
Slides: github.com/srush/do-we-need-a...
This talk predates the work on Mamba, but covers foundational preliminaries. Mamba version coming soon.

Пікірлер: 29

@jasonzhai25845 ай бұрын
I'm an MSc student who is new to deep learning and only first heard of SSM, and without being taught these at school I really struggles to get my head around these concepts for the very first time. This introduction is ABSOLUTELY AMAZING! In just 40 minutes these materials are presented in such a concise yet information-rich way that is understandable even for a newbie like me. I am confident that this video paves my way for understanding more advanced papers on the topic. Thank you!
@varunsaagars7 ай бұрын
🎯 Key Takeaways for quick navigation: 00:00 🤖 *This talk explores the question of whether we need attention mechanisms in neural networks for natural language processing.* 01:06 🧠 *Transformers use attention layers to compute interactions between components, which can become complex for long sequences.* 04:32 ⏳ *Transformer models face limitations due to their quadratic dependency on sequence length, affecting both training and generation speed.* 07:04 🌐 *Researchers are exploring alternatives to attention mechanisms in neural networks for NLP.* 11:30 🔄 *Linear RNNs, especially linear recurrent neural networks (RNNs), are more efficient than traditional RNNs for sequence tasks.* 15:52 💡 *Linear RNNs can be efficiently computed using techniques like Fourier transforms or associative scans, making them faster for training.* 20:58 📊 *Continuous time State Space Models (SSMs) are used to explore different parameterizations of linear RNNs, allowing for effective long-range sequence modeling.* 23:18 🏆 *Linear RNNs with SSM parameterization have shown promising results in various machine learning tasks, including language modeling and natural language processing.* 27:00 🧠 *Linear RNNs and State Space Models (SSMs) can effectively handle the routing components in transformer-style models, simplifying their structure.* 27:40 📊 *When fine-tuning linear RNN-based models for tasks involving long-range sentence matching, the kernels learn to look at longer ranges of information, adapting their coefficients accordingly.* 28:23 📈 *Linear RNN hybrid models, combining attention layers with linear RNNs, have shown improved perplexity compared to open source Transformer models, even with a similar number of parameters.* 30:44 🧩 *Researchers have explored simpler parameterizations for linear RNNs, such as using a diagonal form or damped exponential moving averages, achieving good results on long-range tasks.* 32:44 🔄 *A new approach called "rwkv" combines linear RNNs to create an efficient model inspired by Transformer attention, potentially competing with large-scale Transformers.* 34:16 💡 *Scaling up linear RNNs for larger models, such as a 14 billion parameter language model, shows promise for competing with Transformers in zero-shot prediction tasks.* 36:22 🛠️ *Challenges in adopting linear RNNs in practice include support for complex numbers, efficient Fourier transforms, numerical stability, and the need for system-level improvements.* 39:06 📣 *While attention remains dominant in NLP and deep learning, exploring alternative approaches, developing new algorithms, and building communities for scaling are essential for future innovations.*
@tharunbhaskar67957 ай бұрын
Hands down, the best video I found after all the searching. Was trying to get into S4, Mamba, Linear RNNs and stuff, but most of the videos I visited were very difficult to understand.. But this one made a lot of sense and looking forward for more such videos
@srush_nlp
7 ай бұрын
Thanks! Working on some followups.
@wentropia7 ай бұрын
very good! I was really lost in the mamba paper, but now I understand a little. Thank You!
@srush_nlp
7 ай бұрын
That's great to hear. I hope to add a Mamba version soon as well.
@kevon2177 ай бұрын
Incredible walkthrough. Appreciate the time taken to explain simply and succinctly.
@A2ATemp Жыл бұрын
Keep up the good work!
@AI_Financier7 ай бұрын
somehow similar to Rocket method, thanks for the clear explanation
@ninefates98826 ай бұрын
Early 2024 your side of the bet looks a lot better off than a mere 6 months ago. 😀
@dannyleybzon265 ай бұрын
Wow, this is the best explanation I've seen! Is there a Mamba-specific one coming out? :D
@srush_nlp
5 ай бұрын
Yes, but it's a lot to learn!
@kimchi_taco7 ай бұрын
Thx for kind introduction of SSM. However, I'm not very convinced because 1. Anyway, it learns static routing (27:00). This must be overfitted to training data. Do you think it's generalized enough for OOO data like GPT-4? 2. SSM means memory of all history is compressed into one latent vector. Can it provide all relevant information to future tokens? 3. LSTM was introduced to resolve vanishing gradient issue as RNN's matrix is multiplied by L times (eigenvalue is exploding). SSM re-introduces it again. What does it have gradient vanishing issue?
@MartinBlais5 ай бұрын
Sasha this was excellent. Thank you. Just a note that the volume was barely audible. It would be useful to normalize the audio levels before upload. Thanks again
@srush_nlp
5 ай бұрын
Sorry! Was just learning how to do it at this point. Later videos are better.
@thomasjohnson48427 ай бұрын
Great talk! You said "This talk predates the work on Mamba, but covers foundational preliminaries. Mamba version coming soon." Any word on when that video will be out? Also interested in RWKV v6
@srush_nlp
7 ай бұрын
It's on the top of my to-do list.
@tharunbhaskar6795
7 ай бұрын
I'm looking forward to this
@ln2deep6 ай бұрын
Is the performance better of the linear models in Dao et al. better because there are additional parameters spare for training by dropping attention alongside reasonable long-distance modelling? We are losing some of the completeness of attention so its surprising that the perplexity would be lower. I suppose it could also be because you need less data to learn reasonable approximations so maybe they are more data efficient?
@randomman518811 ай бұрын
Attention is all you need!
@simonl19386 ай бұрын
What does the C HiPPO matrix look like? Is it learned?
@icant11129 ай бұрын
😘
@simonl19386 ай бұрын
How does the state in the SSM not explode in size without an activation function?
@user-xx9nt7zm8t7 ай бұрын
why can't we form kernels using non-linear RNNs?
@srush_nlp
7 ай бұрын
Because the non-linear RNN formula relu(Ax + Bu) = relu(A relu(Ax+Bu) + Bu) doesn't allow us to rearrange the terms into either associative or kernel form. Basically the relu breaks the algebra that lets us rearrange things.
@vertovitch99897 ай бұрын
Nice freudian slip on slide 13 ;)
@vertovitch9989
7 ай бұрын
"On January 1, 2027, an Attention-based model will be state-of-the-art in natural language processing."
@PerFeldvoss7 ай бұрын
Did we really ask James, really?