Matrix Multiplication is AI - What 1.58b LLMs Mean for NVIDIA

In this video, you'll learn why matrix multiplication is the most important primitive modern computing infrastructure needs to be able to perform efficiently. You'll learn why AI is built on matmul(), what it means for NVIDIA, and how 1-bit LLMs and integer quantization changes the game.
♥️ Join my free email newsletter to stay on the right side of change:
👉 blog.finxter.com/email-academy/
Also, make sure to check out the AI and prompt engineering courses on the Finxter Academy:
👉 academy.finxter.com
🚀 Prompt engineers can scale their reach, success, and impact by orders of magnitude!
You can get access to all courses by becoming a channel member here:
👉 / @finxter

Пікірлер: 31

@msclrhd21 күн бұрын
You would still need some circuitry to determine whether to use add (1), skip (0), or subtract (-1) in the ALU based on the 1.58bit value. That would be a specialized 1.58bit "multiply" operation to use the data to select the operation (ADD, NOP, SUB) accordingly. It would be faster than a regular multiply as it would be completable in 1 cycle instead of several cycles.
@SapienSpace22 күн бұрын
Interesting video, I will need to look more into this, thank you for sharing!
@finxter
21 күн бұрын
Thanks, man, great to have you here! :)
@jimp714821 күн бұрын
Sharing this everywhere. !
@finxter
21 күн бұрын
Thanks, very kind of you. :)
@MichaelScharf22 күн бұрын
Thank you for this fantastic video🎉
@finxter
21 күн бұрын
You like it? Thanks!
@user-wg3rr9jh9h16 күн бұрын
1.58Bit instruction ops would be great RISC V vector CPU extensions 🧐.
@finxter
14 күн бұрын
Can you elaborate?
@user-wg3rr9jh9h
14 күн бұрын
@@finxter Vector ternary instructions extension to support ternary adds & comparison ops.
@Alice_Fumo21 күн бұрын
After yesterdays video I initially thought that if it wasn't possible to train the BitNet models (as in use it during training), it wouldn't be able to end up giving us much better models, which disappointed me. However, I thought about it a bit more and if you can lower the price of inference far enough, you actually get much more capabilities, even if you don't scale the models further. With bitnet and specialized hardware, inference cost presumably gets lowered by something like 20--100X, ending up in a memory bottleneck eventually. I image the trit matmul operation takes something like 1000 times less transistors to implement than the 8-bit version. When inference is super cheap, fundamentally different techniques become viable, such as the thing which generates n tokens per 1 output token, example for n was 8. All sorts of reinforcement learning techniques. Then, there's the thing where you can have thousands of inferences and pick the best one. Tree of thoughts. Stupidly long context windows, etc. Thus, it really is a big deal if you have a way to reduce inference costs massively. Plus it makes robots which interact in the world so much more viable due to lower delay, etc.
@finxter
21 күн бұрын
That's very interesting! You should make a video on it and share it with me. :)
@Alice_Fumo
21 күн бұрын
@@finxter I'll consider it. What exactly would you want me to go over? All the ways in which we can get models to be more capable if inference cost wasn't an issue without scaling up the models?
@alexeykulikov5661
18 күн бұрын
I only learn about AI/neural networks when it doesn't come to math/code (sadly, at least for now), love to grasp the general concepts and ideas, and while I was decent/good in programming and "simple", everyday math that I myself needed, I always pathologically feared math, and forgot the complex parts of it quickly after I passed whatever exams I had to... I mean, I may be wrong due to lack of knowledge. But doesn't this approach at least speed up/makes more energy efficient the forward pass/inference by a lot? ~10 times or so on current day chips, and potentially 100-1000X on specialized ones that they can design in the future? As far as I understand, in 8-16 bit float networks it used to be a few times faster than the backward pass, though, so, the gains will be limited to a few dozen percents at most (the computational fraction of the forward path). But still it's something... And they can adjust the weights faster too, with bigger deltas, since there is almost no point in iterating slowly over the whole available float range when they are clamped to [-1, 0, 1] anyways? Although they might need to be careful (slow it down) close to the clamping thresholds I guess? And it likely increases the math required to update the weights too, decreasing the gains... I can sense that there must be some ways to optimize it quite significantly, since we only use -1, 0, 1 range but still train it in full precision, we shouldn't need to iterate over all of it... I just don't know how, due to total lack of knowledge here. And maybe we could somehow run multiple inferences in forward pass too, and train based on the "best" one/ignore the randomness better/idk. Turn inference into a tree of thoughts, extract its steps and essence, and train on those separately... (by "train" I mean run multiple backward passes). There MUST be some way that accelerates the training significantly, thanks to this approach. Either thanks to computational efficiency, or quality of training (backward passes). I hope minds more knowledgeable than mine are already digging into something like that.
@Alice_Fumo
18 күн бұрын
@@alexeykulikov5661 The issue with low-precision training is that you need to be able to calculate gradients which define a direction in which to make a weight update. The more granular the weight updates you can make, the better. The lower the precision, the faster you run into issues of numerical instability, NaNs and the training feasibility drops immensely. Trying to train an 8-bit precision model is already challenging, but when your numbers are literally -1, 0 and 1, you can't do meaningful directional updates to them. However, as I mentioned - if your inference is much cheaper, many approaches to make better models become more viable. For example if you are refining your model in a reinforcement-learning fashion, you might give it tasks and have it complete them. When it succeeds, its completion outputs are new training data. If inference is cheap enough, you can generate a lot of that and use that to train the base-model to become more capable. You might even evaluate these task completions based on how quick or comprehensive they were and thus have many options to pick the best one from. This way, even though training is still slow, you use a lot of inferences to make relatively little very high quality training data. You only need to re-quantize and redeploy the model occasionally, so that's not much of an issue. Thus far this wasn't very feasible / economical, but with the 1.58 bit networks and hardware specialized for them, it really may become viable.
@alexeykulikov5661
18 күн бұрын
Also, they can definitely trade training time for inference time, the models just need to be trained in specific way to be able to assess their own thoughts better, somehow. And be able to assess many paths of thoughts at once. They are working on it, if you listen to some of the top researchers! And I guess they were mostly bottlenecked by the inference speed, at least for some initial, but still significant gains. Hallucinations can also be reduced by a lot, potentially, by giving the model more time to think and analyze its response, correct it, focus on new details, expand them, and so on, looking at the problem from all the angles that it can think of. Also, I guess, it would require to train them to be more creative and knowledgeable, right now they still often get basically almost same responses to the same/similar prompts, regardless of how many times you re-generate the response. It might increase the hallucinations I guess, but creativity+being able to think 100-1000x faster than an average human's main thought stream goes, should open immense capabilities. Again, it would require to train them in a specific multi-step reasoning-supporting way, to be able to ponder in a somewhat philosophical way, and train them to pay attention to some parts of their context more than to others I guess. Also, in the future, they might be able to fit some huge models entirely in a GPU's VRAM, this might bring quite some benefits I guess? And, if it doesn't, thanks to the already quickly increasing server interconnect bandwidth and this bottleneck being reduced... Fitting models in SRAM will definitely help! Hugely efficient and parallelized Groq-like chips, but not with 256MB of SRAM, but with many gigabytes of it? 3D stacked logic + memory, tightly packed, resembling the biological neuron approach as closely as possible (calculation happening as close to memory as possible, "in" it). Watch these if you are interested: kzread.info/dash/bejne/aJ2mwY9mfcqzeqw.html kzread.info/dash/bejne/lX6cmaeGh8ScepM.html 3D-stacked very early, built on 90nm process systems on chip, with carbon nanotube transistors (the nanotubes are not aligned properly yet, yet it still provides large gains), with 4 GIGABYTES of non-volatile RRAM memory on-chip instead of SRAM, perform at >50X performance/watt (if I got it correctly) than the back then top-tier 7 nanometer processors. I am not sure if they include the very different number of transistors into comparison too, their chips have ~ Still it's super impressive. I read some more information and they project potential ~1000X performance/watt increase with 3D systems on chips. What a time to be alive! Potentially... (or being replaced by AI/robots and nobody gives a shit about you, while being unable to start the adult life/career properly still, despite lots of efforts, or dying in wars where the now excessive dozens of millions of people are utilized/where the states, threatened by the future vast advantage of their competitors, clash while it's not too late ;/)
@elim.c.856321 күн бұрын
Why is'nt FPGA the most flexible in the picture?
@finxter
20 күн бұрын
It's still more general-purpose than specific-purpose AI chips.
@scottfranco1962
19 күн бұрын
@@finxter I think you are correct. The arrangement shown, CPU-GPU-FPGA-ASIC, is also in order of difficulty to arrive at a solution.
@user-kp4sf9lc2u21 күн бұрын
Nvidia will just add it to their GPUs
@finxter
21 күн бұрын
Agree, probably they will.
@zxwxz
11 күн бұрын
This is just a simple implementation of an OP. The GPU only needs to integrate this computation to achieve it. Even current GPUs can implement it, although it may require combining multiple OPs, which could lead to reduced memory efficiency. If this is validated in ultra-large LLMs, it will push AI to another peak, and more edge devices will also advance the process simultaneously.
@Dent4221 күн бұрын
*1-trit LLMs
@finxter
20 күн бұрын
Haha. 100 meme points for this name
@Dent42
20 күн бұрын
@@finxter That’s… literally the term for trinary/ternary digit

Matrix Multiplication is AI - What 1.58b LLMs Mean for NVIDIA

Пікірлер: 31

@finxter

21 күн бұрын

@finxter

21 күн бұрын

@finxter

21 күн бұрын

@finxter

14 күн бұрын

@user-wg3rr9jh9h

14 күн бұрын

@finxter

21 күн бұрын

@Alice_Fumo

21 күн бұрын

@alexeykulikov5661

18 күн бұрын

@Alice_Fumo

18 күн бұрын

@alexeykulikov5661

18 күн бұрын

@finxter

20 күн бұрын

@scottfranco1962

19 күн бұрын

@finxter

21 күн бұрын

@zxwxz

11 күн бұрын

@finxter

20 күн бұрын

@Dent42

20 күн бұрын

Келесі