The Illusion of State in State-Space Models (like Mamba)

The paper:
arxiv.org/abs/2404.08819
Support my learning journey either by clicking the Join button above or becoming a Patreon member!
/ tunadorable
Discuss this stuff with other Tunadorks on Discord
/ discord
All my other links
linktr.ee/tunadorable

Пікірлер: 59

  • @sadface7457
    @sadface745721 күн бұрын

    Your work ethic the last few days is crazy

  • @Tunadorable

    @Tunadorable

    21 күн бұрын

    hahaha I've got 4 paper breakdowns a week pre-recorded through August 8th rn. makes 5 videos per week when you include the weekly paper abstracts. In theory assuming I find at least 4 papers per week that I think are worthy of reading on the channel (I tend to not do videos on maybe 2/3 or 3/4 of the papers I read) then 5 videos per week should be the new norm. that being said some weeks I only find like 2 so odds are it'll be more variable starting September-ish

  • @netherportals

    @netherportals

    21 күн бұрын

    And they cough into they arm instead of the hands like a barbarian

  • @mrpocock
    @mrpocock20 күн бұрын

    Monoids are some type or set with an associative binary operator with an identity for that operation. So nat,0,+ and nat,1,* are monoids, as are more interesting things like set,empty,union. If you have a set A, then A* is the set of all sequences of elements of A. If your set is unary functions, then the natural monoid is function composition, and the identity function is the identity.

  • @minecraftermad
    @minecraftermad21 күн бұрын

    17:00 definitely a haskell enjoyer comment. he'd say "a monad is a monoid in the family of endofunctors" to explain what a monad is.

  • @rikkathemejo
    @rikkathemejo21 күн бұрын

    The fact that SSMs are less expressive than RNNs is not due to how they are trained. The parallel (or convolutional) form used during training and the recurrent form of SSMs are mathematically equivalent, meaning that the output of both computations is the same assuming that we use infinite precision. So even if you train SSMs in their recurrent form you would not be able to do state tracking.

  • @Tunadorable

    @Tunadorable

    21 күн бұрын

    so help me understand here, then the issue stems only from the finite number of layers in SSMs?

  • @rikkathemejo

    @rikkathemejo

    21 күн бұрын

    The issue stems from S4 and S6 having the state update linear in the previous state with the linear map being diagonal. And cannot be solved by adding (a finite number of) layers

  • @TiagoTiagoT
    @TiagoTiagoT21 күн бұрын

    I wasn't expecting this, inspired by this video (actually, my mind started wander way too early and I'm now watching it again). I tried to get Llama 3 to produce additional generations of Wolfram's Rule 110 CA, in several ways, including giving instructions, including the rules in python format, providing several of the initial lines and a few other variations of formats and instructions. And it consistently failed to do it right even for the very first line it wrote...

  • @hjups
    @hjups21 күн бұрын

    The claim of training in parallel limiting recurrent algorithmic learning is nonsensical. What the authors probably mean is that you must alter (apply mathematical assumptions) to the recurrent model in order to train it in parallel, and those assumptions necessarily limit the ability to learn recurrent algorithms. That would be a similar idea to how using linear attention in a transformer will limit its ability to form certain representations (when compared to using softmax) in exchange for removing the O(N^2) attention computation. Funny enough, the authors propose introducing non-linearities to Mamba, which would break the equivalency to linear attention transformers (from the more recent Mamba paper). Regarding state tracking, can some NC0 problems be decomposed into an arbitrary collection of TC0 problems? If so, this suggests that tracking state externally (e.g. via an agentic framework) may counter this failure case, although, I believe that may necessarily require a direct copying ability (which Mamba lacks).

  • @TheSkypeConverser

    @TheSkypeConverser

    21 күн бұрын

    Gud comment

  • @netherportals
    @netherportals21 күн бұрын

    They gotta open their SSM minds eye to get that parallel recurrence

  • @xt-89907
    @xt-8990720 күн бұрын

    In Reinforcement Learning, you’ve got concepts like TD-lambda that incorporate the concept of a time horizon. This prevents state from exploring beyond reason. If we apply similar concepts to SSMs, you may not really need full recurrence through all previous observations.

  • @digitalasylum369
    @digitalasylum36921 күн бұрын

    Great video

  • @juancarlospizarromendez3954
    @juancarlospizarromendez395419 күн бұрын

    Possible solutions: more RAM, more NN layers, more CPUs, more time for massive computation, Non-Deterministic computation, etc.

  • @ATH42069
    @ATH4206920 күн бұрын

    @16:45 I don't know if I remember what monoid means either bro. it's ok

  • @cmobarry
    @cmobarry18 күн бұрын

    Just curious, what if the values along diagonal of A matrix were complex numbers to allow rotations?

  • @be1tube
    @be1tube18 күн бұрын

    These sorts of simple functions seem to be the ones affected by grocking. Did the authors attempt to "overtrain" their models or just stop when they fit the training data?

  • @drdca8263
    @drdca826320 күн бұрын

    13:30 : fan-in is the number of inputs to the gate. So, not only can you do AND(x_1, x_2) , you can do AND(x_1,x_2,…,x_k) , as a single gate, all in the same layer. 17:59 : associativity means that a•(b•c)=(a•b)•c . It doesn’t allow arbitrarily reordering things . I suppose one might call the changing the parentheses to be “reordering” in one sense, as it changes when what parts are combined..

  • @Tunadorable

    @Tunadorable

    20 күн бұрын

    good catch on my skimmed over poorly worded definition; didn’t seem important enough to go in depth on that term

  • @rikkathemejo
    @rikkathemejo21 күн бұрын

    Nice video! I think the WFA-SSM (eq. 7) does not really break parallelism since matrix products are still associative and thus compatible with the parallel scan used in S6 (Mamba), which requires just log(depth) steps. However implementing this efficiently on modern hardware might be more tricky and costly, although possible.

  • @hanyanglee9018
    @hanyanglee901821 күн бұрын

    Let me correct the understanding of RNN a bit. Models have to be DAG to be able to train, RNN is only a trick of param bonding, it's actually a lot layers. Because it's a lot layers, it's too deep, it doesn't train.

  • @xernerac
    @xernerac20 күн бұрын

    31:56 isn‘t that exactly what mamba does? Have I misunderstood mamba this entire time? I was under the impression that mamba is doing that, and not powering the matricies, instead using their associative properties to be able to reorder them to do the computation for a sequence of n in O(log(n)) time and that that was still better than in O(n) time?

  • @ckq
    @ckq21 күн бұрын

    13:44, just because chess and coding arent in TC0 doesnt mean transformers are bad. They're amazing at copying human experts.

  • @ckq

    @ckq

    21 күн бұрын

    22:56 yeah but i dont think that problem is particularly hard. Its easy to do in code and all a transformer would need to do is keep track of each of the swaps involving 5 (via attention) and its - oh I guess it's hard for humans too.

  • @Tunadorable

    @Tunadorable

    21 күн бұрын

    yes for sure. and these models can still do recurrent problems that have as many or fewer iterations than the model's number of layers, so for the vast majority most recurrent problems they may even work perfectly fine as we get bigger & bigger models

  • @adamrak7560

    @adamrak7560

    21 күн бұрын

    @@Tunadorable these models can do problems which need much more iterations than their number of layers, if you allow them to use context too.

  • @OpenSourceAnarchist
    @OpenSourceAnarchist20 күн бұрын

    If your intuition about LLMs needing near-infinite precision to track states is correct, then why aren't we exploring f64, f128, and looking at the computational trade-offs of arbitrary-precision arithmetic vs. state tracking performance?

  • @Tunadorable

    @Tunadorable

    20 күн бұрын

    it takes time for the suggestions of a paper like this to be read, thought about, and derivative works to be written. plus i’m not a hardware person and we don’t necessarily know how fundamental/common state tracking problems are, but if i had to guess i’d say it’s probably not worth the trade off

  • @honkhonk8009
    @honkhonk800919 күн бұрын

    Feels like were living in the 60s all over again highkey

  • @Reversed82
    @Reversed8221 күн бұрын

    i'm an absolute beginner at all of this stuff but doesn't xLSTM also conciously make the tradeoff of having to choose either recurrence or training parallelism?

  • @Tunadorable

    @Tunadorable

    20 күн бұрын

    i’ve not read that paper but vaguely i think i heard that it does. that’s really one of the first big questions to be asked right now when any new architecture gets proposed “but can it train in parallel like transformers” because without that good luck charlie

  • @VincentKun
    @VincentKun21 күн бұрын

    Still watching but at min 6:10, i think that for a very autonomous agent we want a RL agent that learn and runs on policy (policy gradient), it's the only viable way to me. By the way i think that the future in our case will be guided by models that are inherently Recurrent, like Recurrent Linear Unit and S5 (can be trained in parallel scan but it's still fully Recurrent and uses a complex diagonal matrix with clever eigenvalues initialization)

  • @Tunadorable

    @Tunadorable

    21 күн бұрын

    yes i think RL on top of pre-trained LLMs is the way to go. to the best of my knowledge everything in this video applies the same to S5 but not to RWKV

  • @VincentKun

    @VincentKun

    20 күн бұрын

    @@Tunadorable Still applies to LRUs? You get Linear Recurrences+ Position Wise Nonlinearity and stack them up.

  • @OpenSourceAnarchist
    @OpenSourceAnarchist20 күн бұрын

    Why can't LLMs use a RAG-like tool with RNNs (and a memory) to make up for the fact that they can't track states?

  • @Tunadorable

    @Tunadorable

    20 күн бұрын

    for simple states like an iterator in a for loop they can. however, if state tracking ends up being a more foundational skill for actual abstract reasoning/creativity/thinking/etc then good luck developing a RAG tool for it, that’s essentially require symbolic AI to catch up with neural networks in which case why use NNs anymore anyways

  • @Khari99
    @Khari9921 күн бұрын

    Curious how this compares to liquid models

  • @Tunadorable

    @Tunadorable

    21 күн бұрын

    been meaning to look into those, maybe one day

  • @VincentKun

    @VincentKun

    21 күн бұрын

    What type of Liquid models you mean, Liquid State Machine based ones? Because i've followed an entire course on Reservoir Computing and now i'm wondering why physical system cannot be used more to emulate computation.

  • @sikunowlol
    @sikunowlol21 күн бұрын

    oi?

  • @444haluk
    @444haluk18 күн бұрын

    If these people had studied a little math, they would see that they are talking about Euclidian geometry. Geometry isn't confined to Euclidian, hence learning isn't confined to it either, hence position as a state isn't either. You need certain geometric priors to stay in Euclidian geometry.

  • @Tunadorable

    @Tunadorable

    18 күн бұрын

    interesting, didn't do much geometry in school so I didn't make the connection. what's your interpretation of the empirical results then?

  • @chickenp7038
    @chickenp703820 күн бұрын

    can you do a video about mamba2?

  • @Tunadorable

    @Tunadorable

    20 күн бұрын

    it’s on my todo list to do a full from-scratch tensor-by-tensor code walkthrough on mamba 2, but don’t hold your breath it may be awhile before i get to it

  • @chickenp7038

    @chickenp7038

    20 күн бұрын

    @@Tunadorable great please also explain the attention duality

  • @Tunadorable

    @Tunadorable

    19 күн бұрын

    kzread.info/dash/bejne/gYucqs6kZa-sn5s.html

  • @chickenp7038

    @chickenp7038

    19 күн бұрын

    @@Tunadorable yes but is this the exact same math that mama2 figured out?

  • @Summersault666
    @Summersault66621 күн бұрын

    But aren't Mamba a type of RNN?

  • @Tunadorable

    @Tunadorable

    21 күн бұрын

    great question & common misconception. during inference they're effectively the same but during training they're different. the manner in which they are different during training (being parallelizable) affects what they can actually learn, meaning that even though they train much more quickly than an RNN (hence why they're preferred) the things they learn are actually not as good (in theory; no one can actually perform an equivalent amount of training on RNNs bc being not-parallelizable means they would take an ungodly amount of compute to do an equivalent level of training) both because of quantity of training we can perform and because of the specifics of the way the training works leading to different dynamics

  • @wanfuse
    @wanfuse21 күн бұрын

    create a program, and have the program instead do the recursion, problem of course is a chicken and egg problem l, but the RNN can make program so... I assume that the issue is one of speed?

  • @Tunadorable

    @Tunadorable

    21 күн бұрын

    i didn’t clarify it well in the video but as programmers we tend to assume this can be solved with a for loop or recursive function call. the reality is that the issue if state tracking is a far more abstract one that’s not as simple as iterating a variable. in the best case it’s this simple combinatorics or chess example which yes can be done with simple code, but in the worst case it’s more like keeping track of abstract concepts in your head

  • @wanfuse

    @wanfuse

    21 күн бұрын

    @@Tunadorable so its a quantity issue? I deal a lot with recursion and combinatorics ( not well, but a lot) I know its limits, and speed issues) , for some of it I have made some home grown math reduction solutions, but am unsatisfied, with good-- not nearly satisfactory results to me in dealing with speed issues, but inch toward my objectives, another 100,000x speed improvement is my goal, and have a path to get there, right now, I am trying to get my "slow", proof of concept version working with 100% success rate on all tests

  • @Nick_With_A_Stick
    @Nick_With_A_Stick18 күн бұрын

    But sambaaaaaaaaa. Hybrid SSM’s > LLM’s & SSM

  • @Tunadorable

    @Tunadorable

    18 күн бұрын

    hoping to see a hybrid come out that inarguably beats transformers & SSMs and is actually SotA level useable soon. likely requires one of the big labs getting behind it and a lot of very expensive hyperparameter tuning to find the right ratio of the two

  • @Nick_With_A_Stick

    @Nick_With_A_Stick

    18 күн бұрын

    @@Tunadorable agreed, if microsoft allows them to drop Samba 3.6b that would go crazy, since it outperforms phi-3 mini (3.6b) on the same training dataset.

  • @stergiosbachoumas2476
    @stergiosbachoumas247621 күн бұрын

    Interesting read. By the way, you cough in a lot of your videos, take a look at it if you haven't. Stay safe.

  • @Tunadorable

    @Tunadorable

    21 күн бұрын

    yeah thanks it’s persisting from a virus, supposed to take months to go away unfortunately

  • @ickorling7328

    @ickorling7328

    21 күн бұрын

    Unpasturized orange juice vit C soirce is sold at trader joes, and d3 + k2 supplements are sold together (rarely) in one pill because they balance each other in the body. The same is true for potassium iodine, yet dangerous to take a lot of iodine in any case, but it *is* required consistantly for immunity processes. The potasium iodine must be balanced in your body​ same for d3 and k2. Same for magnesium and iron. These things all together will help defeat your immunity suppression which you are suffering from @@Tunadorable