Mamba - a replacement for Transformers?

Mamba is a new neural network architecture proposed by Albert Gu and Tri Dao.
Timestamps:
00:00 - Mamba - a replacement for Transformers?
00:19 - The Long Range Arena benchmark
01:20 - Legendre Memory Units
02:07 - HiPPO: Recurrent Memory with Optimal Polynomial Projections
02:38 - Combining Recurrent, Convolutional and Continuous-time Models with Linear State-Space Layers
03:28 - Efficiently Modeling Long Sequences with Structured State Spaces (S4)
05:46 - The Annotated S4
06:13 - Mamba: Linear-Time Sequence Modeling with Selective State Spaces
07:42 - Motivation: Why selection is needed
09:59 - S5
12:00 - Empirical evaluation
The paper can be found here: arxiv.org/abs/2312.00752
Topics: #mamba #foundation
References for papers mentioned in the video can be found at
samuelalbanie.com/digests/202...
For related content:
- Twitter: / samuelalbanie
- personal webpage: samuelalbanie.com/
- KZread: / @samuelalbanie1

Пікірлер: 166

@shiholololo10536 ай бұрын
Standford Labs are thriving right now. To think all this work is made OPEN-SOURCE at a period of hostile and fierce competition among the big tech companies.
@nikoladjordjevic4477
6 ай бұрын
Original transformers were Open Source by Google Also, GPT and GPT2 were open source This is no surprise to those in the community
@user-qp1jh5vm8m
6 ай бұрын
2 Timothy 3:16 New World Translation of the Holy Scriptures (Study Edition) 16 All Scripture is inspired of God+ and beneficial for teaching,+ for reproving, for setting things straight,+ for disciplining in righteousness,+
@patrickangel4880
6 ай бұрын
Like a knife, a weapon available to everyone is not a weapon anymore it's just a mere tool... #hail_to_the_open_source_and_public_research
@peterbennett2301
6 ай бұрын
Is not Mathematics the language of God?
@dezh6345
5 ай бұрын
@@nikoladjordjevic4477 Those companies all turned closed source once money got involved.
@qwerasdliop28106 ай бұрын
Insane, I loved the way you went through multiple important prior papers before talking about mamba!
@looksintolasers
5 ай бұрын
Depth-first search of the depenency tree of papers :)
@adamshaw466 ай бұрын
I really really like the build up of ideas through papers, it's a great way to introduce the idea while giving references that we can look up and trace ourselves and coming onto the scene with no context of the last few years of research it provides a neat overview
@mkamp
5 ай бұрын
Absolutely fantastic. Personally, I would be happy to watch a much longer video: same structure, just slower and broken down a bit more. This is not a complaint. The video is awesome as it is. Just feedback.
@life_of_ccb6 ай бұрын
Thank you for such a good survey of the prior work! Your effort is noted and appreciated!
@SamuelAlbanie1
6 ай бұрын
Much appreciated!
@Rojfos6 ай бұрын
Thats a really high quality content. I also really like the way you highlight the text when you read over it, this makes it easier to follow along!
@SamuelAlbanie1
6 ай бұрын
Thanks!
@Fritz0id6 ай бұрын
Thanks for this, I feel caught up again! I've seen several papers popping up with alternatives to the transformer architecture, but I lacked a framework to grok them. The way you put this paper in a broader context, both in terms of the new benchmark for long range arenas and the emphasis on "no free lunch" w/re to LTI vs SSM was really helpful.
@triplanetary
6 ай бұрын
Can you send some links of those papers that list the alternatives transformers architecture.
@BradNeuberg6 ай бұрын
Always appreciate your excellent video explanations of cutting edge papers, thanks!
@SamuelAlbanie1
6 ай бұрын
Thanks!
@xyh65526 ай бұрын
The technique of solving long-term memory problems using polynomial projection is somewhat similar to using FFT for multiplication. Essentially, both methods use highly efficient information representations with almost orthogonal channel capacity to represent the original information
@krox477
5 ай бұрын
I don't understand any thing
@johnsherfey3675
5 ай бұрын
Yeah, but only big math heads will actually ever fully understand it.
@pierrekinbrand
5 ай бұрын
Ironically many of the ML concepts in the video went over my head but this Fourier analogy was more approachable for me.
@MeanGeneHacks6 ай бұрын
Hope the open source community builds on this
@dinoscheidt
6 ай бұрын
Well get on it. The open source community is also 🫵
@ItsRyanStudios
6 ай бұрын
WE are the open source community ☺️
@rrestoring_faith
6 ай бұрын
The authors already keep their code open source so the work is replicable. It's common practice in ML research.
@borregoayudando1481
6 ай бұрын
All you need is Mambas?
@rjarpa
6 ай бұрын
exepto for gtp 3 and 4 XD @@rrestoring_faith
@SethuIyer956 ай бұрын
The crux of the performance of this network lies in the fact that they are using coefficients of legendre polynomial as a basis which allowed the information to be highly compressed with minimal information loss, thinking about sequence memory, moving away from iterative or recursive processing to a more holistic, algebraic form of memory management.
@xyh6552
6 ай бұрын
In line with your viewpoint, this job is actually similar to using FFT to process n-bit multiplication
@christophkogler6220
6 ай бұрын
@@xyh6552 I think it basically is a high dimensional FFT that's tracking location in the models similarly high dimensional memory/association space. Should provide near-perfect representation, recall, and higher efficiency for recurrent networks.
@derghiarrinde
6 ай бұрын
U lost me at "legendre"
@SethuIyer95
6 ай бұрын
@@xyh6552 Yep, FFT is on fourier basis, this is using legendre basis.
@xyh6552
6 ай бұрын
@christophkogler6220 Similar to your viewpoint, from the perspective of solving the Kakeya conjecture in finite fields, I believe the main idea is to utilize the rigidity of polynomials to achieve efficient compression. I speculate that the effect of utilizing the relationship between polynomials and roots in polynomial splitting fields is essentially replacing one "n" in the complexity with "logn"
@Dart_ilder3 ай бұрын
I liked this video so much that I reached for the like button 3 times while watching it. Awesome context on S4. This is extremely helpful for getting the context and stripping the hype to get to the meaning. That's definitely a sub and I am off to watch all the other videos
@johnny021996 ай бұрын
Thanks for the video, would love to have a more detailed explaination based on the related works before!
@freedom_aint_free6 ай бұрын
Amazing work ! Keep 'em coming !
@SamuelAlbanie1
6 ай бұрын
Thanks, will try!
@alileevil6 ай бұрын
Honestly how do you make sense of these papers? I've listened to the whole video and still haven't got a clue what it is about. Quite a lot of brilliant people out there do to work like this.
@TobiasWeg6 ай бұрын
Very interesting and well explained. Thanks a lot.
@XAheli6 ай бұрын
Keep these coming! Great video.
@drayg0n8066 ай бұрын
I noticed that @havenhq had tuned a chat version of the pretrained Mamba-2.8B on huggingface. I played it on colab and it feels like a decent chatbot already. I'm very excited about the future of this architecture
@ArnavMondal14
6 ай бұрын
You have any code for it?
@user-xk6rg7nh8y5 ай бұрын
Thanks for your work !! It is really helpful to look through the related works 😮😮
@user-hf3vn8et6j6 ай бұрын
Thank you for bringing this to our eyes and it has been really insightfull
@fiery_transition6 ай бұрын
As a person new to the field, I greatly appreciate the way you presented things here!
@SamuelAlbanie1
6 ай бұрын
Thanks!
@kobilica9996 ай бұрын
Man, those papers include hardcore numerical linear algebra :D
@HaganeNoGijutsushi
3 ай бұрын
S4 seems to go the hardest with its convolutional trick, but then everyone else goes "fuck this complicated shit, it's too constraining, let's just parallelize more!" and honestly if I had been the one coming up with that clever math I'd feel so cheated 😂.
@MustafaAkben6 ай бұрын
Great review! Looking forward to playing with it soon :)
@SamuelAlbanie1
6 ай бұрын
Thanks!
@NoNTr1v1aL6 ай бұрын
Amazing video! Subscribed.
@JazevoAudiosurf6 ай бұрын
Tri Dao is one hell of a contributor
@Ben_D.6 ай бұрын
I need an ‘explain it like I’m five’ version of this. 😄 But I hope it means something strong is coming down the pipe.
@christophkogler6220
6 ай бұрын
Actual ELI5: Many current AI models rely on 'MLP (Multi-Layer Perceptron)' and 'Transformer' blocks in their design. The "problematic" (but also usually the 'smart') one is the 'Transformer' block. These need more and more resources to process the context as the context size increases, making scaling up VERY difficult - for a 8x larger context you need about 64x the resources. This is because Transformers compare every part of the context to every other part of the context, every time. The Mamba architecture excludes both the MLP and Transformer blocks for the new 'Mamba' block. It needs the same amount of resources for an increase in context size no matter how large the context already is. For an 8x larger context, you would only need about 8x the resources. That means that - compared to a Transformer based model - you could give it way more input at once and get way more output at once, with the same memory resources. If the method works at larger scales, Mamba could be another significant step forward for AI capabilities. Most current public-facing LLM models, like ChatGPT, use Transformers in their architecture. Transformers include 'self-attention', which basically weighs the importance of every thing against everything else, all at once. This means they process any input in approximately O(N^2) time and memory (where N is the input length). As input / context length increases, their demands scale incredibly high. Anybody with a decent GPU technically CAN run a local LLM, its just small, slow, and dumb. To run anything decent, you end up needing tens (maybe even hundreds) of gigabytes of extremely fast memory, which means workstation GPU's that cost thousands or even entire GPU clusters. The Mamba architecture is basically an entirely different TYPE of AI, more similar to a Recurrent Neural Network, and is both faster and more memory efficient. It processes and considers information sequentially, instead of all at once, but can ALSO ignore unimportant information. The architecture would be able to process an input in approximately O(n+L) time and memory, where n is essentially some constant and L is input length. If it continues to work so efficiently at increasingly large scales, it means literally orders of magnitude faster output and lessened memory requirements for a given context window, which can allow model context size to be increased massively while still using less computational resources than the previous methods. This part is basically educated guesswork, as this level of linear algebra / ML theory is a fair bit over my head: I think Legendre memory cells basically create a high dimensional Fast Fourier Transform from the Legendre polynomials (1:25), which fits neatly into my mental model of AI. In a certain mathematical sense, everything an AI knows can be represented as an incredibly complex interplay of gradients/associations between multidimensional vectors. A multidimensional FFT thus allows you to track your 'location' (the context) within this interplay as the model works, efficiently and with little (to near-zero) accuracy loss. They also allow you to accurately recompose the input from memory, thus allowing the model to efficiently recall or ignore only parts of the input that matter. The 'importance' of any specific part of the input to the output is probably based on some calculation of distance from the FFT to the 'current context'. If it isn't a 'near' association, it probably doesn't really matter to what you're doing, and so can be ignored. And here's a summary of the results of paper from ChatGPT, after I had a little Q&A with it: Summarize the benefits the Mamba architecture has over the Transformers architecture when at similar scales. The Mamba architecture offers several significant benefits over traditional Transformer architectures, particularly when both are at similar scales. Here's a summary of these advantages: 1) Higher Throughput and Efficiency: Mamba achieves a much higher throughput in both inference and training compared to Transformers. Specifically, it has been noted to have 5× higher throughput during inference and up to 40× faster efficiency in training operations. This increased efficiency is especially beneficial when dealing with large-scale models and data. 2) Linear Scaling with Sequence Length: Unlike Transformers, which have quadratic scaling with sequence length, Mamba scales linearly. This is a substantial advantage for processing long sequences, as it ensures more predictable and manageable growth in computational requirements and memory usage as sequence length increases. 3) Improved Generation Throughput: In tasks like language modeling, Mamba not only outperforms Transformers of the same size but also matches or even exceeds the performance of Transformers that are twice its size. This indicates higher efficiency and effectiveness of Mamba in generating outputs. 4) Effective Handling of Longer Sequences: Mamba is particularly adept at handling long sequences, outperforming Transformer models in tasks involving extended contexts. Its design allows it to focus on the most relevant parts of a sequence, enhancing its ability to generalize to much longer sequences than it was trained on. 5) Simplified Architecture: By omitting attention and MLP blocks, Mamba’s architecture is more streamlined than that of traditional Transformers. This simplification contributes to its efficiency, especially in dealing with long sequences. 6) Hardware Optimization: Mamba’s hardware-aware algorithm makes it more compatible with modern GPU architectures, leading to better performance on current hardware platforms. This optimization is crucial for achieving faster processing speeds and more efficient utilization of computational resources. In summary, Mamba offers significant improvements over Transformers in terms of efficiency, scalability, and effectiveness, particularly at similar scales. Its innovations in architecture and design enable it to handle longer sequences more efficiently, making it a strong candidate for various applications in fields requiring efficient sequence modeling.
@nartrab1
6 ай бұрын
Thank you! This was excellent.
@alexander191297
6 ай бұрын
I think this answer is wonderful… and can tell it’s ChatGPT generated 😅
@kevinaud6461
6 ай бұрын
@@christophkogler6220I think this was more of an "explain like I have a bachelor's in CS," but that's exactly what I needed 🙂 Thanks for writing it out
@christophkogler6220
6 ай бұрын
@@alexander191297 Only the part after I mention ChatGPT :)
@michaelparis60396 ай бұрын
I'm only at 7:13, right after 'spicy'. Subscribed. Great format and amazing delivery!
@SamuelAlbanie1
6 ай бұрын
Thanks!
@EigenA5 ай бұрын
Great video, thanks for sharing!
@BlayneOliver6 ай бұрын
Would this help a regression based transformer which data is based on the stock market’s price action? Or is it more for multi-media?
@sup53566 ай бұрын
beautifully developed narrative
@circulartext6 ай бұрын
super cool work
@synapsomorphy6 ай бұрын
Very encouraging that they included the situation in which S6 did poorly! If there are no other catches this looks incredible!
@Kobe292616 ай бұрын
This does it for my 'aspiration video' of the week.
@SamuelAlbanie1
6 ай бұрын
Great.
@qwertyuiop-ux6jk6 ай бұрын
thanks for the video
@vga77146 ай бұрын
great summary and even better presenting voice.
@SamuelAlbanie1
6 ай бұрын
Thanks!
@matusstiller42196 ай бұрын
This video reminds me of the fact that I do not understand mathematics🙃
@iamr0b0tx6 ай бұрын
Thanks
@SamuelAlbanie1
6 ай бұрын
Thanks!
@KingPowa006 ай бұрын
What source do you suggest to understand the algebra and math behidn these works? I really struggled to understand most of the concepts, though I have a fairly good basis of the math behind transformers.
@raul36
6 ай бұрын
First of all, I recommend you guys 3Blue1Brown's algebra videos. Then, if you already have a solid knowledge, I would recommend "Linear Algebra Done Right" book
@TheApgreyd6 ай бұрын
Thx KZread for recommendations
@JerryFederspiel6 ай бұрын
Just as complex numbers work well for SSMs in audio, I can't help but wonder whether split-complex numbers would help SSM performance in language tasks (considering the hyperbolic flavor of split-complex numbers and the benefits of hyperbolic embeddings when encoding hierarchical data).
@SamuelAlbanie1
6 ай бұрын
It certainly seems plausible. In my experience, while hyperbolic embeddings make strong intuitive sense for hierarchical data, I've never seen them yield significant gains (the kinds of works I am are familiar are of this flavour: arxiv.org/abs/2304.09172). If your experience has been different, I'd be curious to hear.
@JorgetePanete6 ай бұрын
Remember, the RWKV mentioned is the one from its paper, the RWKV v4, there isn't yet a paper for v5 and v6, but v6 is similar to Mamba Edit: it was updated today
@JorgetePanete
6 ай бұрын
How similar? well, I don't know, check it at the repo
@porting4006 ай бұрын
Great video
@h3techsme6 ай бұрын
This also begs the question of how the hardware-aware process fares when the memory between system and GPU are fully shared...
@couldntfindafreename6 ай бұрын
It is 100% sure that someone is already training a 7B+ Mamba model out there, most likely even bigger.
@circulartext
6 ай бұрын
true
@luizpereira71653 ай бұрын
Can you use Mamba arquitecture in conjunction with Bitnet b1.58?
@dfparker20026 ай бұрын
How is Mamba similar or different to multi-expert models? What is the minimum card spec (memory, cuda, tensors, what ever) to run this model?
@honeymak6 ай бұрын
is it conversational? can it talk to itself or several instances?
@Shnugs6 ай бұрын
When you stand back and squint your eyes at these papers they almost have a turbo encabulator quality to them.
@colejohnson2230
6 ай бұрын
Lol, yeah. I noticed that most fields tend towards that as you get towards the bleeding edge. Sometimes I have to stop what I'm working on and just appreciate how it looks like nonsense to an outside viewer
@RudyMartinInvest6 ай бұрын
Thanks!
@SamuelAlbanie1
6 ай бұрын
Thanks!
@TheGreatestJuJu6 ай бұрын
This makes so much sense. So obvious..
@6lack5ushi6 ай бұрын
is this not a somewhat proof or then addition to Lee Cronin's Assembly theory is you can rebuild input u from the components of m?
@watcher85826 ай бұрын
cool presentation
@SamuelAlbanie1
6 ай бұрын
Thanks!
@grimsk6 ай бұрын
점점 물리학과 유사해지는 느낌 feels like it's becoming more and more similar to physics.... 🙂
@Sam-ri3hr6 ай бұрын
Good video Sam
@Robert_McGarry_Poems6 ай бұрын
This is my first time watching your channel. Impressive walkthrough. When I first heard of Q* my imagination started to build a very similar architecture... I don't follow too much of the technical, but I saw how the sandwiched gates, shown in the video, could be used almost in an analogue fashion. This is brilliant! Watching this made me grin like crazy... This might not be zero memory, but dang if it isn't a huge step in that direction. Using local memory is genius. And that token interpretation length, yes... So... physically, I guess, in my mind the next step is to localize the memory to the operation even more, but it looks like in that architecture it's as local as it's going to get... What about something like... "Sample-and-hold," from actual analogue circuits? That might be something to think about.
@s11-informationatyourservi446 ай бұрын
can’t wait for a model named kobe to come out
@qwertasd76 ай бұрын
any llm using it?
@astridwilde6 ай бұрын
great video
@SamuelAlbanie1
6 ай бұрын
Thanks!
@peteroliver79755 ай бұрын
I want to see this applied to reasoning tokens
@Sai_r2d25 ай бұрын
Lesssgo kobe ✨️
@stan-156 ай бұрын
Cool beans
@zlatanmessi20955 ай бұрын
Added to my plays list on AI
@shyama56126 ай бұрын
is gemini based on this? the logo spiral seems to look like the Legendre polynomial graph,
@Verrisin6 ай бұрын
turning image into a flattened sequence ... I wonder if they are using space filling curves, or just line by line ? ... I wonder which "regularity" would be more useful? Or something else even? - To be fair, having no implicit notion of "relative position of 2 pixels" (which I believe brains have) seems really expensive, if it then has to fully recover that structure from just a sequence of tokens ...
@SamuelAlbanie1
6 ай бұрын
Yes - this is a good point. I think the reason flattening is performed without retaining 2d structure is precisely because it makes for a particularly challenging modelling task.
@memenga2606 ай бұрын
I remember reading a paper on this in 2021 why isn't it adopted earlier page link in the reply
@memenga260
6 ай бұрын
drive.google.com/file/d/1-67LHZbCoDmzLWYp_4ZUXNzavcbGNMGa/view?usp=drivesdk
@SamuelAlbanie1
6 ай бұрын
Good find. I guess mamba is a popular name...
@DamaKubu6 ай бұрын
If you are interrested in doing mechanistic interpretability on mamba model, hit me a dm. Am thinking of writing something like Neel Nanda's transformer lens for mamba or some lower hanging fruit as a start.
@baab42296 ай бұрын
Idk man I kinda like the shapeshifting sapient robots fighting over their home planet cybertrone, why would you wanna replace them
@apidas4 ай бұрын
god, these kids really find the cure for cancer
@MemesnShet6 ай бұрын
Since the big companies are creating their LLMs on transformers with all those resources and time I doubt they'd change unless the results were dramatically better so Mamba while impressive doesn't seem to be it
@SamuelAlbanie1
6 ай бұрын
Thanks!
@Kram10326 ай бұрын
finally apparently near-infinite contexts!
@dhrumil59776 ай бұрын
Whattttt 😵‍💫😵‍💫😵‍💫
@Verrisin6 ай бұрын
"infinite" context length is effectively the main thing we needed. This is very exciting.
@aron29226 ай бұрын
I think about 8 people followed what you were saying but I appreciate the effort
@SamuelAlbanie1
6 ай бұрын
Thanks!
@luismeron45066 ай бұрын
Kobe and Gigi 🏀8️⃣💛💜2️⃣4️⃣🖤
@patrickangel48806 ай бұрын
Like a knife, a weapon available to everyone is not a weapon anymore it's just a mere tool... #hail_to_the_open_source_and_public_research
@garethjax6 ай бұрын
that's enough math for a lifetime. Amazing.
@JohnViguerie3 ай бұрын
in the real world LeCunn and Hinton's ideas haven't yet been optimized and deployed to scale in commerce... 😂 But it's fun to try and keep up
@ReflectionOcean5 ай бұрын
- Understand Mamba's significance by exploring its efficient state space model design and selective state mechanism (00:04). - Review the scale issues with Transformers and the emergence of efficient alternatives like Mamba for long sequence modeling (00:31). - Examine the Hippo Recurrent Memory and its application in sequence modeling for improved performance (01:29). - Recognize the role of kernel Fusion, parallel scan, and recomputation techniques in Mamba's efficient memory usage (09:55). - Consider the empirical results showcasing Mamba's high performance on various tasks, including long sequence modeling and DNA classification (13:02). - Analyze the trade-offs in model design, noting how selection mechanisms can impact performance on different data modalities (15:27). - Investigate the limitations of current empirical evaluations and the need to test Mamba on larger model sizes (15:43). - Dive into the released GitHub code to experiment with the Mamba model firsthand (15:59).
@rkbiri54706 ай бұрын
Need an ELI5 section 😅😂
@ekstrajohn6 ай бұрын
If transformers scale pretty well, I can't think of a reason why Mamba wouldn't scale. At least off the top of my head. Let's see what happens!
@Adovid6 ай бұрын
Transformers don't scale on long sequence operations because generative AI neural networks work better spreading attention over the parameters. We shall see if Mamba can do what it claims after a large model is doing inference.
@Oler-yx7xj6 ай бұрын
I'm so tired that I read this title literally and it took me some time to understand why it is probably not a video about using snakes in place of ChatGPT.
@imded40146 ай бұрын
I can't be the only who clicked on the video expecting the other transformers ...
@flambr6 ай бұрын
in the uk, mamba is the nickname for a hard drug
@iTXS6 ай бұрын
The machines now can get epilepsy lol
@osbernperson6 ай бұрын
Aha yes, this are the OK! 👍 becas I is smart here to, and No can be maybi. Good! Do it Now!
@reinerwilhelms-tricarico3445 ай бұрын
Interesting. But as usual it suffers from acronym overload.
@belzebubukas5 ай бұрын
what
@supperenet90906 ай бұрын
No, it's an replacement for conda.
@bootblacking4 ай бұрын
Why would a snake replace Transformers, it can't even turn into a truck
@derghiarrinde6 ай бұрын
Maybe you could better explain some sentences instead of just highlighting them and reading them aloud. I get you want a lower length video but sometime you could speak to us like we're 10 years old. Would help with understanding. In the worst case, generate special cases using a GPT (explain this passage to me as if I was 15) and just read that. Thanks.
@SamuelAlbanie1
6 ай бұрын
Thanks for the feedback!
@jasonandrewismail20296 ай бұрын
superficial and misleading
@Cineenvenordquist
6 ай бұрын
Remix it with your fixed leads. 🙏🏼