xLSTM: Extended Long Short-Term Memory

Ғылым және технология

xLSTM is an architecture that combines the recurrency and constant memory requirement of LSTMs with the large-scale training of transformers and achieves impressive results.
Paper: arxiv.org/abs/2405.04517
Abstract:
In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs). However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule. Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.
Authors: Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter
Links:
Homepage: ykilcher.com
Merch: ykilcher.com/merch
KZread: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/discord
LinkedIn: / ykilcher
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 98

@GraniLP27 күн бұрын
Funny to see my professors names on the paper here. Feels odd, since I know this channel way before I started to study there.
@wurstelei1356
8 күн бұрын
Thank god they had these techs decades ago, so nothing is patented and hidden from the public.
@tantzer611327 күн бұрын
Seems like the title of this paper could have been, perhaps provocatively, “LSTMs are all you need.”
@nicolasmichel5163
27 күн бұрын
If feel that's not really the conclusion here. More like "Billions of parameters is all you need"
@JackSPk26 күн бұрын
"Matrices aren't circles" - Yannic Kilcher
@pawelkubik9 күн бұрын
I used to think of c and h as memory capacitor and hidden output. This was especially clear in word tagging problems where we had to align our outputs with the input tokens. So the h vector was directly corresponding to one of the tag classes that we used to predict and c was used strictly as the memory (I thought c was just from "capacitor" or "memory Cell").
@intrinsical27 күн бұрын
I mean, the term Language Model was coined in the 90s. Even N-Gram models were considered language models. We just didn't start prefixing Language Models with the word "Large" till the early 2000s. The claim that LSTMs were doing LLM in the 90s is an exaggeration, but also partially true.
@EobardUchihaThawne27 күн бұрын
mlstms are similiar to google's infini attention on memory retrieval
@yeetyeet707027 күн бұрын
Extended Long Short-Term really sounds like upper lower middle class
@Hexanitrobenzene
26 күн бұрын
Yeah, the adjacent words "long" and "short" do not clear the matters at all... In contrast, the authors of "Attention is all you need" could work for political campaigns writing slogans as a side hustle :)
@davidhauser753727 күн бұрын
nice thanks for convering this paper :)
@CM-mo7mv27 күн бұрын
finally approaching ART
@RezaJavadzadeh25 күн бұрын
brilliant thanks Yannic
@KathiresanKathiresan-ld6zn23 күн бұрын
Revux is being mentioned everywhere - definitely a project to watch!
@DamianReloaded27 күн бұрын
So, the answer is kind of yes. If you scale a high-dimensional token mixer using backpropagation to adjust weights towards the desired result, you will achieve functionality. The question lingering in my mind is: Do biological neural networks employ backpropagation? How do we one -shot learn new token sequences and how are we able to remember them long term and bring them back when we need them if they are so low probability (we only saw them once) ?
@xxlvulkann6743
26 күн бұрын
I imagine that when you have agentic models, you can implement more sophisticated memory encoding. For example, you might allow for particular memory samples to have a larger "significance" based upon your current level of arousal/reward. Also, exposure to a token doesn't have to come from the external environment, it may result from constantly "thinking" about the topic, essentially generating and training on synthetic data. We must remember that generative models are still not actual agentic models, they're basically just foundation models.
@ssssssstssssssss
26 күн бұрын
Backpropagation is largely considered implausible for biological networks and BPTT is impossible because it is a non-causal system. Some do think the brain does employ some kind of "gradient" though.
@Hexanitrobenzene
26 күн бұрын
@@ssssssstssssssss BPTT ?
@ChlorieHCl
26 күн бұрын
@@Hexanitrobenzene Back-propagation through time
@eltongoaustriaco8268
26 күн бұрын
The brain might generate a training signal from a single example in short term memory (you repeating your hotel room number in mind). Regarding BP, it is plausible that the brain uses a less optimised version of that.
@pietrorse25 күн бұрын
this reminds me of serialization and paralelization mixing in various layers, which i actually observe in nature.
@Mordenor27 күн бұрын
Thank you Mr Yannic for explaining xLSTM, which extends the famous Long Short-Term Memory model. p.s I like your videos, so please stay healthy
@aintgonhappen
27 күн бұрын
Pray for Mr Yannic 🙏🙏🙏
@KavinKaviya-bw7rb23 күн бұрын
I'm all in on Revux. Presales have the highest returns, and this one’s gold.
@chickenp703827 күн бұрын
we need a new mamba explanation. the current one has errors and doesn’t rely explain much
@longvo7088
27 күн бұрын
You need to read previous papers like HiPPO, S4 to be able to understand Mamba. Also, with some prerequisite skills about CUDA Programming
@AM-yk5yd
27 күн бұрын
Sasha Rush has several as he seems to be a big fan of SSM. "Mamba: The Hard Way" is very detailed.
@paxdriver27 күн бұрын
thank you Yan! I thought I was crazy but you seem to have read a similar tone in the early sections lol that's pretty funny "our paper is all about this addition, and this multiication... Novel ideas, eh?". That's the headline, but only after that does the real new part start with memory management (soft memory, not hardware.. Also confusing).
@andytroo26 күн бұрын
33:10 - is this sort of a built-in soft-max? exponetiate everything then normalise?
@wolpumba409927 күн бұрын
*Summary* *What is xLSTM?* [0:00] * xLSTM aims to push the boundaries of LSTM architectures by incorporating lessons learned from the world of LLMs and Transformers. * It introduces two modified LSTM cells: sLSTM and mLSTM. * xLSTM architectures are formed by residually stacking these modified LSTM blocks. *Key Features:* [7:35] * *Exponential Gating:* [31:02] Replaces the traditional sigmoid non-linearity in LSTM gates with an exponential function to address vanishing gradient issues. * *Normalization and Stabilization Techniques:* [32:38] Introduces methods to handle the rapid growth of the exponential function and stabilize training. * *Modified Memory Structures:* * *sLSTM:* [27:47] Utilizes a scalar memory, scalar update, and "new" memory mixing (which leverages matrix properties for information routing between dimensions). * *mLSTM:* [36:24] Employs a matrix memory and a covariance update rule for associative memory. It's fully parallelizable in training, similar to Transformers. *Advantages:* * *Constant Memory Usage:* Unlike Transformers, xLSTM maintains a fixed memory footprint regardless of sequence length. * *Competitive Performance:* Achieves results comparable to state-of-the-art Transformers and State Space Models on language modeling benchmarks. * *Parallelizable Training (mLSTM):* The mLSTM variant removes the non-linear dependency on past time steps, enabling parallel training like Transformers. *Limitations:* [54:30] * *Large Constant Memory Requirement:* While memory usage is constant, the mLSTM's matrix memory can be large, leading to higher computational costs. * *No Fast Parallel Training for sLSTM:* The sLSTM variant still involves recurrency, making fast parallel training challenging. * *Further Optimization Needed:* The authors acknowledge the need for further architecture and hyperparameter optimization, especially for larger xLSTM models. *Overall:* [55:54] * xLSTM demonstrates the potential of enhanced LSTM architectures to compete with Transformers in language modeling. * Further research and real-world applications will determine its long-term impact and adoption. i summarized the transcript with gemini 1.5 pro
@XX-vu5jo
27 күн бұрын
Gemini is a joke lol
@FunkyJeff22
27 күн бұрын
Thanks!
@guillaumevermeillesanchezm2427
27 күн бұрын
How much did it cost?
@wolpumba4099
27 күн бұрын
@@guillaumevermeillesanchezm2427 Nothing. I'm in some kind of beta. It is also super fast (less than 10 seconds). Much better than GPT-4
@guillaumevermeillesanchezm2427
27 күн бұрын
@@wolpumba4099 thank you for answering!
@ravindramore478313 күн бұрын
Caught some insider buzz about cyberopolis and the names involved.
@ANKIT_GAMING19313 күн бұрын
Cyberopolis been the hot topic in several groups I'm in.
@vanshbibyan3513 күн бұрын
Some serious backing on cyberopolis
@tamirtsogbayar391227 күн бұрын
Hello Yannic thanks for you videos! Are you going to make some vidoes related to KAN (Kolmogorov Arnold Network) ? thank you
@quickpert1382
26 күн бұрын
KANs are fairly easy, and it's a nice lecture to venture into by yourself
@_XoR_
24 күн бұрын
Unfortunately they are quite flawed for most applications since they don't scale and based on the distribution shape they can be worse than mlps.
@quickpert1382
24 күн бұрын
@@_XoR_ Yep, for now we are waiting for an optimized implementation.
@edeneden9727 күн бұрын
in the mLSTM block, isn't it very similar to attention just without softmax?
@GGlessGo
6 күн бұрын
And is it? Cant completely follow actually
@matveyshishov27 күн бұрын
Wait, I've been watching your channel for maaany years, how come it only has 245k subscribers, and something like 2minpapers has 1.5M?
@ChlorieHCl
27 күн бұрын
I've felt a significant decline in quality for Two Minute Paper videos. The 2min are like 30s of unwanted background info, 30s of experimental results, and 1min of sponsor acknowledgment. And also “what a time to be alive” and “hold on to your papers”, apparently. No real info gained from those videos. To the point that I've unsubbed from that channel for months just to get rid of the annoyance.
@yakmage8085
27 күн бұрын
@@ChlorieHClthere’s been a decline for sure but also yannics videos have a significantly higher minimum education requirement. 2 min papers are just video highlights and no math, intuition or criticisms
@AvastarBin
27 күн бұрын
because 2minpapers videos are 5 or 6 minutes long (ironically) and are understandable by anyone regardless of your background, whereas Yannik's videos are one hour long very indepth and requires a lot of background knowledge in ml
@GoldenBeholden
27 күн бұрын
@@ChlorieHCl Yeah, seeing some guy get enthusiastic about research papers was nice enough when just began and sat below 30k subscribers, but he really started playing into his "character" rather than the actual content of the papers. Not really worth your time anymore, to be honest. AI Explained is great if you're looking for another channel in the same vein as this one (al be it lighter on the academics).
@thirdeye4654
27 күн бұрын
Why do influencers on Tiktok have millions of followers just talking bullshit all day long? Because people love entertainment and not many have a long attention span. Also there is just so much time you have in your own life to watch and do stuff.
@hanskraut201820 күн бұрын
I could have told you when i was in the end of Kindergarden. I hope there is more behind it than what it sounds to be.
@bensimonjoules440224 күн бұрын
The last few papers Yannic covered all follow the same line of using back again some sort of recurrence with transformers. In this case not explicitly but I don see a fundamental difference why each step on the sequence couldn't be processed by one. Seems to be a clear direction on research of resurging recurrence, I wonder if this direction has a formal theory or even a name.
@Fabio-zi4hw27 күн бұрын
Is this the ultimate bitter lesson?
@herp_derpingson
26 күн бұрын
Money is all you need
@florianjug25 күн бұрын
Isn’t that close to the mindset behind Mamba as well? What would be the key difference?!
@BijoyBaskeBijoyBaske23 күн бұрын
Shifting my portfolio - heavy on BTC and Revux, with a sprinkle of DOT and ADA.
@KamalMondal-vl6fo13 күн бұрын
Yep, the signals are strong on cyberopolis, especially with the big endorsements.
@user-zb8hd1gh8x23 күн бұрын
I see Revux doing 50x, maybe even 100x after it goes live on major exchanges.
@dairin0d27 күн бұрын
Regarding the large memory requirements of the d*d matrix, perhaps they could take a page from the Vector Symbolic Architectures approach? In VSA, state, keys and values are all vectors of the same shared space (and so have the same dimension), so if all that's needed is to combine them in a way that would result in dot(new_state, key) ~= value, VSA's binding operation (e.g. component-wise / Hadamard product) sounds like a perfectly viable replacement 🤔 I suppose it would still benefit from large space dimensionality, but a vector size can be controlled on a more granular level than a square matrix size. If they use binary or ternary weights, the memory requirements would be even smaller (though that would probably require some changes in how the model is trained).
@JerryFederspiel
19 күн бұрын
If I'm thinking about this right, the off-diagonal elements of the outer products of k and v can be thought of as "clues" that each vector element in the key gives about each other vector element in the value. The Hadamard product dispenses with these clues- each element is treated independently- but maybe each individual element only has to be kind-of right with a VSA because d is so high. It may also be possible to compromise between Hadamard and outer products by taking the key and value vectors and breaking them up into P parts of d/P elements each. Then you take the outer products of corresponding parts. This gives us a memory requirement of P * (d/P)^2 = d^2 / P. It means that each key element gives a clue about d/P value elements. Setting P to sqrt(d) feels good, so clearly that is the right choice 🙂
@Raju-ib9ug13 күн бұрын
Been seeing a lot of talk in the private circles about cyberopolis.
@devrajdevraj336923 күн бұрын
Got a strong feeling Revux will go 100x once it hits Binance.
@TiagoTiagoT27 күн бұрын
Is there enough information in the pdf that some of the current bigger LLMs that can read pdfs would be able to produce the equivalent code to what the researchers used to get their alleged results?
@Hexanitrobenzene
26 күн бұрын
This task probably requires AGI...
@qhansen12327 күн бұрын
2nd for AI
@aakashsaini332727 күн бұрын
3rd for AI :P
@AmirNajafgholi15 күн бұрын
Don't you want to review KAN?
@jonsmith633127 күн бұрын
First for AI
@darshank8748
27 күн бұрын
AI for fisting
@intrinsical26 күн бұрын
So the matrix memory is simply old school Kohonen Maps from the 70s?
@Hexanitrobenzene
26 күн бұрын
It seems, if that's the name. They list Kohonen, Anderson and Nakano as references, all from 1972.
@rumfordc27 күн бұрын
ngl that's gotta be among the top 20 stupidest names for anything i've ever heard
@GAISENSE27 күн бұрын
Feels more tLSTM than mLSTM, right?
@jonnylukejs27 күн бұрын
I invented this and it got jacked low key i called it block matrix lstm and they changed the name to be dicks and get away with it but the fact that it exactly follows my ipynb for it is like ehhh
@jonnylukejs
27 күн бұрын
my app is called hyper chat and I'm still going to launch it but yeah I've had this since i wrote the code for it
@wunder1385
27 күн бұрын
Sure bro
@jaboweryКүн бұрын
Self-aggrandizing Boomer-posting admitted, there is a good reason for bringing to people's "attention" prior art and it has to do with the foundation of intelligence in Kolmogorov Complexity approximation: Don't multiply names for things beyond necessity. Now, don't get me wrong here. I'm not saying that the terms currently in use are inferior -- I'm just saying that unification of taxonomy can reduce the explosion of confusion that now besets the field. So the renaming can be beneficial, so long as one then describes prior art in terms of the current tech-argot with appropriate modifiers.
@user-jx8xe2qy1j23 күн бұрын
Revux is dominating my crypto chats - seems like the next big thing!
@XX-vu5jo27 күн бұрын
I kinda find this a joke of a paper
@AmarSingh-fh3xp13 күн бұрын
Big shots are piling in on cyberopolis, so I decided to get my share. Let's hope it pays off!
@corgirun789216 күн бұрын
the baseline is unfair
@T.D_gamer13 күн бұрын
I like the Pi project, but still there is no money to be made here. I put my holdings in Cyberopolis, easy 50-200x