GROKKED LLM beats RAG Reasoning (Part 3)

Ғылым және технология

We open the black box of GROKKED LLMs and analyze each layer of the transformer architecture for its performance in causal reasoning, after the grokking phase transition of our LLM.
Current research in AI clearly indicates that established LLMs, like Gemini Pro 1.5 or GPT-4 Turbo fail in deep reasoning, even when integrated in complex RAG systems.
A Grokking phase transition is essential for LLM to active their performance phase, reaching close to 99% accuracy for "un-seen" tasks in the development and test datasets.
#airesearch
#ainews
#insights

Пікірлер: 42

@SirajFlorida18 күн бұрын
Honestly you have become my favorite vlogs. Just fantastic.
@code4AI
16 күн бұрын
A strong motivational feedback for future work. Thanks.
@antoniosmusic18 күн бұрын
Amazing! Thanks so much for sharing your amazing work!
@code4AI
16 күн бұрын
Thank you for taking the time to write a feedback. Really important for me to also get positive comments.
@tanguero2k718 күн бұрын
I was sooo waiting for this follow-up :D. Thank you!🤩
@LamontCranston-qh2rv17 күн бұрын
Truly outstanding! Thank you so much for creating and sharing such high quality content!
@christopherchilton-smith648217 күн бұрын
Would love a tutorial on grokking phi-3, this grokking thing is hard to wrap my head around.
@novantha1
12 күн бұрын
It's nothing terribly special, once you get into it. Basically the part that makes sense is that LLMs "learn" by producing patterns in matrix multiplication that are helpful for predicting information. Ie: If "dog" is the fifth token in your LLM's vocabulary, and the next word in the sequence should be "dog", then it's helpful to have a pattern of weights that predicts "5" in that sequence. So, you end up with these really complicated patterns of weights that are hard to understand, and progress towards predicting the correct answer. But there are multiple ways to get to the right answer, including "wrong" ways, in the sense that they're shortcuts which might work on the training data, but not all data you might want to predict (remember when your math teacher got mad if you didn't show your work?). Basically, grokking is just training after your LLM's loss goes down on the training set, until it snaps to the "underlying algorithm" and starts predicting the right answer on the test dataset. For example, if you want an LLM to predict sin(x) = y, at first it might be really bad, and then start predicting the right values, but not the right answer for all values...Until you train it long enough that it generalizes (because understanding the formula is easier than memorizing every possible value in a lookup table of floating point numbers). In other words: LLMs memorize answers in normal training, but in training aiming specifically to "grok" they, "understand".
@christopherchilton-smith6482
12 күн бұрын
@@novantha1 wow, that actually makes a lot of sense, thank you.
@luke264217 күн бұрын
In LLMs, is there a concept of "strong generalisation" (defined as two equivalent/identical networks trained on non-overlapping sets of data but both perform 100%) as seen in BF-CNNs? It's a bit off topic but it's great work, also showing generalisation and geometric interpretations: "Generalization in diffusion models arises from geometry-adaptive harmonic representations". There's a great KZread video, Zahra does a great talk on it. It builds on earlier work by Eero Simoncelli's team too, of "bias free CNNs" for denoising, which demonstrates without the bias parameters, the weights generalise much better, train on 5db of noise and it works on 30db of noise, where as a regular wx+b fails. They visualise the manifolds too, it's really a good explanation!
@luke2642
13 күн бұрын
Just discovered a related paper, 700 citations: "Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models"
@1MinuteFlipDoc18 күн бұрын
you are a legend sir!
@laslog18 күн бұрын
Incredible stuff, thank you for your work! this is just amazing
@mazakielad18 күн бұрын
As always, thank you for an amazing explanation and review!
@code4AI
16 күн бұрын
I'm glad you found the work interesting and valuable.
@bernardoramos940916 күн бұрын
The part 4 can be based on "Grokfast: Accelerated Grokking by Amplifying Slow Gradients"
@timgorn892718 күн бұрын
That's incredible! Does this mean that the path to AGI has been paved? Or am I overestimating the results?
@alexjensen99017 күн бұрын
Fantastic video as usual... Typically I find myself full of ideas after watching your videos. This time I find myself unsure how I might implement this information moving forward... I guess I will have to sit with it for a while. Its not lost on me the irony that my favorite author growing up was Robert Heinlein and Stranger in a Strange Land the first book I read from him; yet, this is the one topic that I can not immediately use the knowledge in my projects... 😞
@project-asgard17 күн бұрын
Amazing topic, thank you!!!
@mlytle018 күн бұрын
Great content.
@rajakumar995618 күн бұрын
Great video..
@publicsectordirect98217 күн бұрын
Great video thank you! Love your engaging style too :)
@code4AI
16 күн бұрын
Thanks for watching!
@hotbit732717 күн бұрын
Awesome video. So they have grokked some LLM to perform tests, but there is no grokked LLM that is in the public domain or publically accessible? Why? Or do I miss something?
@andru505418 күн бұрын
Great video, thanks
@code4AI
16 күн бұрын
Glad you liked it!
@manslaughterinc.913517 күн бұрын
I still don't understand why grokked models are against rag. Why can't we combine grokked models with rag systems?
@code4AI
16 күн бұрын
New upcoming video will explain my thoughts in more detail. Great comment.
@alpineparrot105717 күн бұрын
Ive been thinking about a kind of vector database for grokking, it seems it would still facilitate RAG quite nicely too... opinions?
@DigitalAlligator17 күн бұрын
this is research frontier knowledge
@pedromoya912718 күн бұрын
thanks
@raymond_luxury_yacht17 күн бұрын
Dayum
@user-dk8dm8db8t16 күн бұрын
I have some doubts I would like to clear. Will grokking be effective only focusing on the dataset construction if we choose to extend the fine tuning of preexisting pretrained transformer architectures such as llama 3 8b Do you pretrain using existing data as atomic facts and use finetuning as inferred facts, If you finetune, what strategy to you go by, do you finetune the entire network such that all gradients are affected and can hopefully all grads can reach the grokked state, this strategy might induce drastic forget-fullness, not to mention the mind splitting compute power required to essentially extend the pretraining. Or do you do finetuning by something like peft or training the last few layers resulting resulting to not utilizing all the neurons in the grokked state and only trainable neurons essentially reaching grokked state. And the most important one for me(prob), any resources on how to start coding a grokked transformer
@mulderbm18 күн бұрын
Could not wait for this one. At dinner, it dawned on me that papers on this topic from 3 years back were by OpenAI researchers. So if they played with this back then, are the current models even state of the art or are they much farther ahead and just milk the current models like their adoption parent did in the desktop application area? It would make the words of Sam true they will steamroll many of the current layer 1 derivatives like RAG and CoT. Someone else also commented this research is quite old, so if it is, why are we not having this reasoning already more ironed out and implemented in the OpenAI API's? Even Google could have implemented it, as much of Grokking research is tied to researchers from them.
@densonsmith2
17 күн бұрын
Some of Sam Altman's comments about "getting much more out of our training data" makes me think that OpenAI grok's grokking?
@mulderbm
17 күн бұрын
@@densonsmith2 That and smarter synthetic generation probably. Feed the models on niche topics with little data from themselves, enough to grok. But we do not know it is a black box. But that should not prevent us from playing around with this 😀
@fabianaltendorfer11
13 күн бұрын
@@mulderbm In the previous video he mentioned that the researchers used the original transformer architecture for grokking (not the decoder only, gpt style transformer). I'm guessing but it seems to me that the reason could be the architecture.
@publicsectordirect98217 күн бұрын
I cant seem to find part two. Not easily searchable or linked in the description
@code4AI
16 күн бұрын
So if you want to find Part 2 and you are watching Part 3 in a YT channel, there is a tab called videos, where you find all public videos of this channel in a chronologically order. And if you still struggle to find the video, when there are three videos with the title including Grokked or Grokking, and you are at Part three, the probability that one of the other two videos is Part two is different from non-zero. Thank you for your comment.
@publicsectordirect982
16 күн бұрын
@code4AI hi yes sorry i had a brain fart moment i found it once i engaged my brain. Thanks for the help
@darkmatter958318 күн бұрын
qwen 2 please