Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training

Ғылым және технология

In this video I will introduce and explain quantization: we will first start with a little introduction on numerical representation of integers and floating-point numbers in computers, then see what is quantization and how it works. I will explore topics like Asymmetric and Symmetric Quantization, Quantization Range, Quantization Granularity, Dynamic and Static Quantization, Post-Training Quantization and Quantization-Aware Training.
Code: github.com/hkproj/quantizatio...
PDF slides: github.com/hkproj/quantizatio...
Chapters
00:00 - Introduction
01:10 - What is quantization?
03:42 - Integer representation
07:25 - Floating-point representation
09:16 - Quantization (details)
13:50 - Asymmetric vs Symmetric Quantization
15:38 - Asymmetric Quantization
18:34 - Symmetric Quantization
20:57 - Asymmetric vs Symmetric Quantization (Python Code)
24:16 - Dynamic Quantization & Calibration
27:57 - Multiply-Accumulate Block
30:05 - Range selection strategies
34:40 - Quantization granularity
35:49 - Post-Training Quantization
43:05 - Training-Aware Quantization

Пікірлер: 74

  • @zendr0
    @zendr07 ай бұрын

    If you are not aware let me tell you. You are helping a generation of ML practitioners learn all this for free. Huge respect to you Umar. Thank you for all your hard work ❤

  • @savvysuraj

    @savvysuraj

    5 ай бұрын

    The content made by Umar is helping me alot.Kudos to Umar.

  • @aireddy
    @aireddy10 күн бұрын

    @Uumarjamilai Great job breaking down complex concepts into actionable insights, you explained the concepts very simple and easy to understand fashion with practical examples.

  • @ankush4617
    @ankush46177 ай бұрын

    I keep hearing about quantization so much, this is the first time i have seen someone go so deep into this topic and come up with such clear explanations! Keep up all your great work, you are a gem to the AI community!! I’m hoping that you will have a video on Mixtral MoE soon 😊

  • @umarjamilai

    @umarjamilai

    7 ай бұрын

    You read my mind about Mistral. Stay tuned! 😺

  • @ankush4617

    @ankush4617

    7 ай бұрын

    @@umarjamilai❤

  • @krystofjakubek9376
    @krystofjakubek93767 ай бұрын

    Great video! Just a clarification: on modern processors floating point operations are NOT slower than integer operations. It very much depends on the exact processor and even then the difference is usually extremely small compared to the other overheads of executing the code. HOWEVER the reduction of size from 32 bit float to 8 bit integer does itself make the operations faster a lot faster. The cause is two fold: 1) modern CPUs and GPUs are typically memory bound and so simply put if we reduce the amount of data the processor needs to load in by 4x we expect the time the processor spends waiting for another set of data to come by to shrink by 4x as well. 2) pretty much all machine learning code is vectorized. This means the processor instead of executing each instruction on a single number grabs N numbers and executes the instruction on all of them at once (SIMD instructions). However most processors dont have N set instead have set the total number of bits all N numbers occupy (for example AVX2 can do operations on 256 bits at a time) so if we go from 32 bits to 8 bits we can do 4x more instructions at once! This is likely what you mean by operations being faster. Note thag CPUs or GPUs are very much similar in this regard, only GPUs have much more SIMD lanes (much more bits).

  • @umarjamilai

    @umarjamilai

    7 ай бұрын

    Thanks for the clarification! I was even going to talk about the internal hardware of adders (Carry-lookahead adder) to show how a simple operation like addition works and compare it with the many steps required for the floating-point number (which also involves normalization). You explanation nailed it! Thanks again!

  • @vik2189
    @vik21893 ай бұрын

    Fantastic video! Probably the best 50 minutes spent on AI related concepts in the past 1 year or so.

  • @dariovicenzo8139
    @dariovicenzo81393 ай бұрын

    Great job, in particular the examples regarding the conversion from/to integer not only with formulas but with true numbers too!

  • @user-rk5mk7jm7r
    @user-rk5mk7jm7r6 ай бұрын

    Thanks a lot for the fantastic tutorial. Looking forward to the more series on the LLM quantization!👏

  • @mandarinboy
    @mandarinboy6 ай бұрын

    Great introductory video! Looking forward to GPTQ and AWQ

  • @myaseena
    @myaseena7 ай бұрын

    Really high quality exposition. Also thanks for providing the slides.

  • @shivamkaushik6637
    @shivamkaushik663715 күн бұрын

    Thank you for this lecture.

  • @AbdennacerAyeb
    @AbdennacerAyeb7 ай бұрын

    Keep Going. This is perfect. Thank you for the effort you are making

  • @jiahaosu
    @jiahaosu6 ай бұрын

    The best video about quantization, thank you very much!!!! It really helps!

  • @Aaron-hs4gj
    @Aaron-hs4gj4 ай бұрын

    Excellent explanation, very intuitive. Thanks so much! ❤

  • @asra1kumar
    @asra1kumar4 ай бұрын

    This channel features exceptional lectures, and the quality of explanation is truly outstanding. 👌

  • @user-td8vz8cn1h
    @user-td8vz8cn1h4 ай бұрын

    This is one of a few channels that I subscribed to after watching one video. Your content is very easy to follow and you are covering topic holistically with additional clarifications, what a man)

  • @ojay666
    @ojay6664 ай бұрын

    Fantastic tutorial!!!👍👍👍I’m hoping that you will post a tutorial on model pruning soon🤩

  • @jaymn5318
    @jaymn53185 ай бұрын

    Great lecture. Clean explanation of the field and gives an excellent perspective on these technical topics. Love your lectures. Thanks !

  • @user-lg3jo6ih1t
    @user-lg3jo6ih1t4 ай бұрын

    I was searching for Quantization basics and could not find relevant videos... this is a life-saver!! thanks and please keep up the amazing work!

  • @user-qo7vr3ml4c
    @user-qo7vr3ml4c2 ай бұрын

    Thank you for the great content. Especially the goal of QAT to have a wider loss function and how that makes it robust to errors due to quantization. Thank you.

  • @jaymn5318
    @jaymn53185 ай бұрын

    Great lecture. Clean explanation of the field and gives a excellent perspective on these technical topics.

  • @NJCLM
    @NJCLM6 ай бұрын

    Great video ! Thank you !!

  • @HeyFaheem
    @HeyFaheem7 ай бұрын

    You are a hidden gem, my brotherr

  • @koushikkumardey882
    @koushikkumardey8827 ай бұрын

    becoming a big fan of your work!!

  • @sebastientetaud7485
    @sebastientetaud74855 ай бұрын

    Excellent Video ! Grazie !

  • @RaviPrakash-dz9fm
    @RaviPrakash-dz9fm2 ай бұрын

    Legendary content!!

  • @Youngzeez1
    @Youngzeez17 ай бұрын

    wow, what an eye-opener! I read lots of research papers but mostly confusing! but your explanation just opened my eyes! Thank you. Please can you do a video on the quantization of vision transformers for object detection?

  • @andrewchen7710
    @andrewchen77106 ай бұрын

    Umar, I've watched your videos on llama, mistral, and now quantization. They're absolutely brilliant and I've shared your channel to my colleagues. If you're in Shanghai, allow me to buy you a meal haha! I'm curious of your research process. During the preparation of your next video, I think it would be neat if you document the timeline of your research/learning, and share it with us in a separate video!

  • @umarjamilai

    @umarjamilai

    6 ай бұрын

    Hi Andrew! Connect with me on LinkedIn and we can share our WeChat. Have a nice day!

  • @Patrick-wn6uj

    @Patrick-wn6uj

    4 ай бұрын

    Glad to see fellow shanghai people here hhhhhhh

  • @manishsharma2211
    @manishsharma22117 ай бұрын

    beautiful again, thanks for sharing these

  • @TheEldadcohen
    @TheEldadcohen6 ай бұрын

    Umar I've seen many of your videos and you are a great teacher! Thank you for your effort in explaining in plain (Italian accent) English all of these complicated topics. Regarding the content of the video - you showed the quantization-aware training and you were surprised of the worse result it showed in comparison to the post-training quantization in the concrete example you made. I think it is because you trained the post-training quantization on the same data that you tested it on, so the parameters learned (alpha, beta) are overfitted to the test data, that's why the accuracy was better. I think that if you had tested it with true test data, you probably would have seen the result you anticipated.

  • @bluecup25
    @bluecup257 ай бұрын

    Thank you, super clear

  • @ngmson
    @ngmson7 ай бұрын

    Thank your for your sharing.

  • @aminamoudjar4561
    @aminamoudjar45617 ай бұрын

    Very helpful thank you so much

  • @user-pe3mt1td6y
    @user-pe3mt1td6y5 ай бұрын

    Need more advanced videos about advanced Quantization!

  • @tetnojj2483
    @tetnojj24836 ай бұрын

    Nice video :) A video on the .gguf file format for models would be very interesting :)

  • @ziyadmuhammad3734
    @ziyadmuhammad3734Ай бұрын

    Thanks!

  • @amitshukla1495
    @amitshukla14957 ай бұрын

    wohooo ❤

  • @user-kg9zs1xh3u
    @user-kg9zs1xh3u7 ай бұрын

    vary good

  • @asra1kumar
    @asra1kumar4 ай бұрын

    Thanks

  • @DiegoSilva-dv9uf
    @DiegoSilva-dv9uf7 ай бұрын

    Valeu!

  • @Erosis
    @Erosis7 ай бұрын

    You're making all of my lecture materials pointless! (But keep up the great work!)

  • @tubercn
    @tubercn7 ай бұрын

    Thanks, Great video🐱‍🏍🐱‍🏍 But I have a question, because we'll dequantize the output of the last layer by calibration, why we need another "torch.quantization.DeQuantStub()" layer in the model to dequantize the output, it seems we have two dequantizes consequently

  • @user-hd7xp1qg3j
    @user-hd7xp1qg3j7 ай бұрын

    One request could you explain mixture of experts I bet you can breakdown the explanation good

  • @lukeskywalker7029
    @lukeskywalker70294 ай бұрын

    @Umar Jamil you said most embedded devices dont support floating point operatins at all? Is that right? What would be an example and how is that chip architecture called? Does an RaspberryPi or an Arduino operate on only integer operations internally?

  • @swiftmindai
    @swiftmindai7 ай бұрын

    I noticed a small correction needs to done at timestamp @28:53 [slide: Low precision matrix multiplication]. In the first line, the dot products between each row of X with each column of Y [Instead of Y, it should be W - the weight matrix]

  • @umarjamilai

    @umarjamilai

    7 ай бұрын

    You're right, thanks! Thankfully the diagram of the multiply block is correct. I'll fix the slides

  • @pravingaikwad1337
    @pravingaikwad13373 ай бұрын

    For one layer Y = XW + b, if X, W and b are quantized so we get Y in the quantized form, then what is the need of dequantizing this Y to feed it to the next layer?

  • @venkateshr6127
    @venkateshr61277 ай бұрын

    Could you please make a video on how to make tokenizers for other languages than English please.

  • @AleksandarCvetkovic-db7lm
    @AleksandarCvetkovic-db7lm3 ай бұрын

    Could the difference in accuracy between Static/Dynamic quantization and Quantization Aware Training be because the model was trained for 5 epochs for Static/Dynamic Quant and only one epoch for Quant Aware training? I tend to think that 4 more epochs make more difference than Quantization method

  • @bamless95
    @bamless954 ай бұрын

    Be careful, cpython does not do JIT compilation, it is a pretty stragithforward stack-based bytecode interpreter

  • @umarjamilai

    @umarjamilai

    4 ай бұрын

    Bytecode has to be converted into machine code somehow. That's also how .NET works: first C# gets compiled into MSIL (an intermediate representation), and then it just-in-time compiles the MSIL into the machine code for the underlying architecture.

  • @bamless95

    @bamless95

    4 ай бұрын

    Not necessarily, bytecode can just be interpreted in place. In a loose sense it is being "converted" to machine code, meaning that we are executing different snippets of machine code through branching, but JIT compilation has a very different meaning in the compiler and interpreter field. What python is really doing is executing a loop and a switch branching on every possible opcode. By looking at the interpreter implementation on the cpython github repo in `Python/ceval.c` and `Python/generated_cases.c.h` (alas youtube is not letting me post links) you can clearly see there is no JIT compilation involved.

  • @bamless95

    @bamless95

    4 ай бұрын

    What you are saying about C# (and for that matter java and some other languages like luaJIT or v8 javascript) is indeed true, they typically JIT the code either before or during interpretation. But cpython is a much simpler (and thus slower) implementation of a bytecode interprer, that does not implement neither JIT compilation nor any form of serious code optimization (aside from a fairly rudimentary peephole optimization step)

  • @bamless95

    @bamless95

    4 ай бұрын

    Don't get me wrong, I think the video is phenomenal. Just wanted to correct a little imperfection that, as a programming language nerd, I feel it is important to get right. Also, greetings from italy! It is good for once to see a fellow Italian doing content that is worth watching on YT 😄

  • @dzvsow2643
    @dzvsow26437 ай бұрын

    Aslamu aleykum Brother. Thanks for your videos! I have been working on game development using pygame for a while and I just want to start deep learning in python so could you make a road map video?! Thank you again

  • @umarjamilai

    @umarjamilai

    7 ай бұрын

    Hi! I will do my best! Stay tuned

  • @theguyinthevideo4183
    @theguyinthevideo41835 ай бұрын

    This may be a stupid question, but what's stopping us from just setting the weights and biases to be in integer form? Is it due to the nature of backprop?

  • @umarjamilai

    @umarjamilai

    5 ай бұрын

    Forcing the weights and biases to be integers means adding more constraints to the gradient descent algorithm, which is not easy and computationally expensive. It's like I ask you to solve the equation x^2 - 5x + 4 = 0 but only for integer X. This means you can't just use the formula you learnt in high school for quadratic equations, because that returns real numbers. Hope it helps

  • @elieelezra2734
    @elieelezra27347 ай бұрын

    Umar, thanks for all your content. I step up a lot thanks to your work! But there is something I don't get about quantization. Let's say you quantize all the weights of your large model. The prediction is not the same anymore! Does it mean you need to dequantize the prediction? If yes, you do not talk about it right? Can I have your email to get more details please?

  • @umarjamilai

    @umarjamilai

    7 ай бұрын

    Hi! Since the output of the last layer (the matrix Y) will be dequantized, the prediction of the output will be "the same" (very similar) as the dequantized model. The Y matrix of each layer is always dequantized, so that the output of each layer is more or less equal to the dequantized model

  • @alainrieger6905

    @alainrieger6905

    7 ай бұрын

    Hi thanks for your answer​@@umarjamilai Does it mean, for the post training quantization, that the more the layers in a model, the greater is the difference between the quantized and dequantized model since the error accumulates at each New layer? Thanks in advance

  • @umarjamilai

    @umarjamilai

    7 ай бұрын

    @@alainrieger6905 That's not necessarily true, because the error in one layer may be "positive", and in another "negative", and they may compensate for each other. For sure the number of bits used for quantization is a good metric on the quality of quantization: if you use less bits, you will have more error. It's like you have an image that is originally 10 MB, and you try to compress it to 1 MB or 1 KB. Of course in the latter case you'd lose much more quality than the first one.

  • @alainrieger6905

    @alainrieger6905

    7 ай бұрын

    ​@@umarjamilaithanks you Sir! Last question : when you talk about dequantizing layer's activations, does it mean that the values go back to 32 bits format ?

  • @umarjamilai

    @umarjamilai

    7 ай бұрын

    @@alainrieger6905 yes, it means going back to floating-point format

  • @007Paulius
    @007Paulius7 ай бұрын

    Thanks

  • @sabainaharoon7050
    @sabainaharoon70505 ай бұрын

    Thanks!

  • @umarjamilai

    @umarjamilai

    5 ай бұрын

    Thanks for your support!

Келесі