xLSTM: Extended Long Short-Term Memory
Ғылым және технология
xLSTM is an architecture that combines the recurrency and constant memory requirement of LSTMs with the large-scale training of transformers and achieves impressive results.
Paper: arxiv.org/abs/2405.04517
Abstract:
In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs). However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule. Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.
Authors: Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter
Links:
Homepage: ykilcher.com
Merch: ykilcher.com/merch
KZread: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/discord
LinkedIn: / ykilcher
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Пікірлер: 98
Funny to see my professors names on the paper here. Feels odd, since I know this channel way before I started to study there.
@wurstelei1356
8 күн бұрын
Thank god they had these techs decades ago, so nothing is patented and hidden from the public.
Seems like the title of this paper could have been, perhaps provocatively, “LSTMs are all you need.”
@nicolasmichel5163
27 күн бұрын
If feel that's not really the conclusion here. More like "Billions of parameters is all you need"
"Matrices aren't circles" - Yannic Kilcher
I used to think of c and h as memory capacitor and hidden output. This was especially clear in word tagging problems where we had to align our outputs with the input tokens. So the h vector was directly corresponding to one of the tag classes that we used to predict and c was used strictly as the memory (I thought c was just from "capacitor" or "memory Cell").
I mean, the term Language Model was coined in the 90s. Even N-Gram models were considered language models. We just didn't start prefixing Language Models with the word "Large" till the early 2000s. The claim that LSTMs were doing LLM in the 90s is an exaggeration, but also partially true.
mlstms are similiar to google's infini attention on memory retrieval
Extended Long Short-Term really sounds like upper lower middle class
@Hexanitrobenzene
26 күн бұрын
Yeah, the adjacent words "long" and "short" do not clear the matters at all... In contrast, the authors of "Attention is all you need" could work for political campaigns writing slogans as a side hustle :)
nice thanks for convering this paper :)
finally approaching ART
brilliant thanks Yannic
Revux is being mentioned everywhere - definitely a project to watch!
So, the answer is kind of yes. If you scale a high-dimensional token mixer using backpropagation to adjust weights towards the desired result, you will achieve functionality. The question lingering in my mind is: Do biological neural networks employ backpropagation? How do we one -shot learn new token sequences and how are we able to remember them long term and bring them back when we need them if they are so low probability (we only saw them once) ?
@xxlvulkann6743
26 күн бұрын
I imagine that when you have agentic models, you can implement more sophisticated memory encoding. For example, you might allow for particular memory samples to have a larger "significance" based upon your current level of arousal/reward. Also, exposure to a token doesn't have to come from the external environment, it may result from constantly "thinking" about the topic, essentially generating and training on synthetic data. We must remember that generative models are still not actual agentic models, they're basically just foundation models.
@ssssssstssssssss
26 күн бұрын
Backpropagation is largely considered implausible for biological networks and BPTT is impossible because it is a non-causal system. Some do think the brain does employ some kind of "gradient" though.
@Hexanitrobenzene
26 күн бұрын
@@ssssssstssssssss BPTT ?
@ChlorieHCl
26 күн бұрын
@@Hexanitrobenzene Back-propagation through time
@eltongoaustriaco8268
26 күн бұрын
The brain might generate a training signal from a single example in short term memory (you repeating your hotel room number in mind). Regarding BP, it is plausible that the brain uses a less optimised version of that.
this reminds me of serialization and paralelization mixing in various layers, which i actually observe in nature.
Thank you Mr Yannic for explaining xLSTM, which extends the famous Long Short-Term Memory model. p.s I like your videos, so please stay healthy
@aintgonhappen
27 күн бұрын
Pray for Mr Yannic 🙏🙏🙏
I'm all in on Revux. Presales have the highest returns, and this one’s gold.
we need a new mamba explanation. the current one has errors and doesn’t rely explain much
@longvo7088
27 күн бұрын
You need to read previous papers like HiPPO, S4 to be able to understand Mamba. Also, with some prerequisite skills about CUDA Programming
@AM-yk5yd
27 күн бұрын
Sasha Rush has several as he seems to be a big fan of SSM. "Mamba: The Hard Way" is very detailed.
thank you Yan! I thought I was crazy but you seem to have read a similar tone in the early sections lol that's pretty funny "our paper is all about this addition, and this multiication... Novel ideas, eh?". That's the headline, but only after that does the real new part start with memory management (soft memory, not hardware.. Also confusing).
33:10 - is this sort of a built-in soft-max? exponetiate everything then normalise?
*Summary* *What is xLSTM?* [0:00] * xLSTM aims to push the boundaries of LSTM architectures by incorporating lessons learned from the world of LLMs and Transformers. * It introduces two modified LSTM cells: sLSTM and mLSTM. * xLSTM architectures are formed by residually stacking these modified LSTM blocks. *Key Features:* [7:35] * *Exponential Gating:* [31:02] Replaces the traditional sigmoid non-linearity in LSTM gates with an exponential function to address vanishing gradient issues. * *Normalization and Stabilization Techniques:* [32:38] Introduces methods to handle the rapid growth of the exponential function and stabilize training. * *Modified Memory Structures:* * *sLSTM:* [27:47] Utilizes a scalar memory, scalar update, and "new" memory mixing (which leverages matrix properties for information routing between dimensions). * *mLSTM:* [36:24] Employs a matrix memory and a covariance update rule for associative memory. It's fully parallelizable in training, similar to Transformers. *Advantages:* * *Constant Memory Usage:* Unlike Transformers, xLSTM maintains a fixed memory footprint regardless of sequence length. * *Competitive Performance:* Achieves results comparable to state-of-the-art Transformers and State Space Models on language modeling benchmarks. * *Parallelizable Training (mLSTM):* The mLSTM variant removes the non-linear dependency on past time steps, enabling parallel training like Transformers. *Limitations:* [54:30] * *Large Constant Memory Requirement:* While memory usage is constant, the mLSTM's matrix memory can be large, leading to higher computational costs. * *No Fast Parallel Training for sLSTM:* The sLSTM variant still involves recurrency, making fast parallel training challenging. * *Further Optimization Needed:* The authors acknowledge the need for further architecture and hyperparameter optimization, especially for larger xLSTM models. *Overall:* [55:54] * xLSTM demonstrates the potential of enhanced LSTM architectures to compete with Transformers in language modeling. * Further research and real-world applications will determine its long-term impact and adoption. i summarized the transcript with gemini 1.5 pro
@XX-vu5jo
27 күн бұрын
Gemini is a joke lol
@FunkyJeff22
27 күн бұрын
Thanks!
@guillaumevermeillesanchezm2427
27 күн бұрын
How much did it cost?
@wolpumba4099
27 күн бұрын
@@guillaumevermeillesanchezm2427 Nothing. I'm in some kind of beta. It is also super fast (less than 10 seconds). Much better than GPT-4
@guillaumevermeillesanchezm2427
27 күн бұрын
@@wolpumba4099 thank you for answering!
Caught some insider buzz about cyberopolis and the names involved.
Cyberopolis been the hot topic in several groups I'm in.
Some serious backing on cyberopolis
Hello Yannic thanks for you videos! Are you going to make some vidoes related to KAN (Kolmogorov Arnold Network) ? thank you
@quickpert1382
26 күн бұрын
KANs are fairly easy, and it's a nice lecture to venture into by yourself
@_XoR_
24 күн бұрын
Unfortunately they are quite flawed for most applications since they don't scale and based on the distribution shape they can be worse than mlps.
@quickpert1382
24 күн бұрын
@@_XoR_ Yep, for now we are waiting for an optimized implementation.
in the mLSTM block, isn't it very similar to attention just without softmax?
@GGlessGo
6 күн бұрын
And is it? Cant completely follow actually
Wait, I've been watching your channel for maaany years, how come it only has 245k subscribers, and something like 2minpapers has 1.5M?
@ChlorieHCl
27 күн бұрын
I've felt a significant decline in quality for Two Minute Paper videos. The 2min are like 30s of unwanted background info, 30s of experimental results, and 1min of sponsor acknowledgment. And also “what a time to be alive” and “hold on to your papers”, apparently. No real info gained from those videos. To the point that I've unsubbed from that channel for months just to get rid of the annoyance.
@yakmage8085
27 күн бұрын
@@ChlorieHClthere’s been a decline for sure but also yannics videos have a significantly higher minimum education requirement. 2 min papers are just video highlights and no math, intuition or criticisms
@AvastarBin
27 күн бұрын
because 2minpapers videos are 5 or 6 minutes long (ironically) and are understandable by anyone regardless of your background, whereas Yannik's videos are one hour long very indepth and requires a lot of background knowledge in ml
@GoldenBeholden
27 күн бұрын
@@ChlorieHCl Yeah, seeing some guy get enthusiastic about research papers was nice enough when just began and sat below 30k subscribers, but he really started playing into his "character" rather than the actual content of the papers. Not really worth your time anymore, to be honest. AI Explained is great if you're looking for another channel in the same vein as this one (al be it lighter on the academics).
@thirdeye4654
27 күн бұрын
Why do influencers on Tiktok have millions of followers just talking bullshit all day long? Because people love entertainment and not many have a long attention span. Also there is just so much time you have in your own life to watch and do stuff.
I could have told you when i was in the end of Kindergarden. I hope there is more behind it than what it sounds to be.
The last few papers Yannic covered all follow the same line of using back again some sort of recurrence with transformers. In this case not explicitly but I don see a fundamental difference why each step on the sequence couldn't be processed by one. Seems to be a clear direction on research of resurging recurrence, I wonder if this direction has a formal theory or even a name.
Is this the ultimate bitter lesson?
@herp_derpingson
26 күн бұрын
Money is all you need
Isn’t that close to the mindset behind Mamba as well? What would be the key difference?!
Shifting my portfolio - heavy on BTC and Revux, with a sprinkle of DOT and ADA.
Yep, the signals are strong on cyberopolis, especially with the big endorsements.
I see Revux doing 50x, maybe even 100x after it goes live on major exchanges.
Regarding the large memory requirements of the d*d matrix, perhaps they could take a page from the Vector Symbolic Architectures approach? In VSA, state, keys and values are all vectors of the same shared space (and so have the same dimension), so if all that's needed is to combine them in a way that would result in dot(new_state, key) ~= value, VSA's binding operation (e.g. component-wise / Hadamard product) sounds like a perfectly viable replacement 🤔 I suppose it would still benefit from large space dimensionality, but a vector size can be controlled on a more granular level than a square matrix size. If they use binary or ternary weights, the memory requirements would be even smaller (though that would probably require some changes in how the model is trained).
@JerryFederspiel
19 күн бұрын
If I'm thinking about this right, the off-diagonal elements of the outer products of k and v can be thought of as "clues" that each vector element in the key gives about each other vector element in the value. The Hadamard product dispenses with these clues- each element is treated independently- but maybe each individual element only has to be kind-of right with a VSA because d is so high. It may also be possible to compromise between Hadamard and outer products by taking the key and value vectors and breaking them up into P parts of d/P elements each. Then you take the outer products of corresponding parts. This gives us a memory requirement of P * (d/P)^2 = d^2 / P. It means that each key element gives a clue about d/P value elements. Setting P to sqrt(d) feels good, so clearly that is the right choice 🙂
Been seeing a lot of talk in the private circles about cyberopolis.
Got a strong feeling Revux will go 100x once it hits Binance.
Is there enough information in the pdf that some of the current bigger LLMs that can read pdfs would be able to produce the equivalent code to what the researchers used to get their alleged results?
@Hexanitrobenzene
26 күн бұрын
This task probably requires AGI...
2nd for AI
3rd for AI :P
Don't you want to review KAN?
First for AI
@darshank8748
27 күн бұрын
AI for fisting
So the matrix memory is simply old school Kohonen Maps from the 70s?
@Hexanitrobenzene
26 күн бұрын
It seems, if that's the name. They list Kohonen, Anderson and Nakano as references, all from 1972.
ngl that's gotta be among the top 20 stupidest names for anything i've ever heard
Feels more tLSTM than mLSTM, right?
I invented this and it got jacked low key i called it block matrix lstm and they changed the name to be dicks and get away with it but the fact that it exactly follows my ipynb for it is like ehhh
@jonnylukejs
27 күн бұрын
my app is called hyper chat and I'm still going to launch it but yeah I've had this since i wrote the code for it
@wunder1385
27 күн бұрын
Sure bro
Self-aggrandizing Boomer-posting admitted, there is a good reason for bringing to people's "attention" prior art and it has to do with the foundation of intelligence in Kolmogorov Complexity approximation: Don't multiply names for things beyond necessity. Now, don't get me wrong here. I'm not saying that the terms currently in use are inferior -- I'm just saying that unification of taxonomy can reduce the explosion of confusion that now besets the field. So the renaming can be beneficial, so long as one then describes prior art in terms of the current tech-argot with appropriate modifiers.
Revux is dominating my crypto chats - seems like the next big thing!
I kinda find this a joke of a paper
Big shots are piling in on cyberopolis, so I decided to get my share. Let's hope it pays off!
the baseline is unfair
I like the Pi project, but still there is no money to be made here. I put my holdings in Cyberopolis, easy 50-200x