QLoRA-How to Fine-tune an LLM on a Single GPU (w/ Python Code)

Need help with AI? Book a call: calendly.com/shawhintalebi
In this video, I discuss how to fine-tune an LLM using QLoRA (i.e. Quantized Low-rank Adaptation). Example code is provided for training a custom KZread comment responder using Mistral-7b-Instruct.
👉 Series Playlist: • Large Language Models ...
🎥 Fine-tuning with OpenAI: • 3 Ways to Make a Custo...
📰 Read more: medium.com/towards-data-scien...
💻 Colab: colab.research.google.com/dri...
💻 GitHub: github.com/ShawhinT/KZread-B...
🤗 Model: huggingface.co/shawhin/shawgp...
🤗 Dataset: huggingface.co/datasets/shawh...
Resources
[1] Fine-tuning LLMs: • Fine-tuning Large Lang...
[2] ZeRO paper: arxiv.org/abs/1910.02054
[3] QLoRA paper: arxiv.org/abs/2305.14314
[4] Phi-1 paper: arxiv.org/abs/2306.11644
[5] LoRA paper: arxiv.org/abs/2106.09685
--
Homepage: shawhintalebi.com/
Socials
/ shawhin
/ shawhintalebi
/ shawhint
/ shawhintalebi
The Data Entrepreneurs
🎥 KZread: / @thedataentrepreneurs
👉 Discord: / discord
📰 Medium: / the-data
📅 Events: lu.ma/tde
🗞️ Newsletter: the-data-entrepreneurs.ck.pag...
Support ❤️
www.buymeacoffee.com/shawhint
Intro - 0:00
Fine-tuning (recap) - 0:45
LLMs are (computationally) expensive - 1:22
What is Quantization? - 4:49
4 Ingredients of QLoRA - 7:10
Ingredient 1: 4-bit NormalFloat - 7:28
Ingredient 2: Double Quantization - 9:54
Ingredient 3: Paged Optimizer - 13:45
Ingredient 4: LoRA - 15:40
Bringing it all together - 18:24
Example code: Fine-tuning Mistral-7b-Instruct for YT Comments - 20:35
What's Next? - 35:22

Пікірлер: 127

  • @ShawhinTalebi
    @ShawhinTalebi2 ай бұрын

    👉 Series Playlist: kzread.info/head/PLz-ep5RbHosU2hnz5ejezwaYpdMutMVB0 🎥 Fine-tuning with OpenAI: kzread.info/dash/bejne/ZoZ12KytY8m9n6w.html 📰 Read more: medium.com/towards-data-science/qlora-how-to-fine-tune-an-llm-on-a-single-gpu-4e44d6b5be32?sk=4dccc921ab3bd4adc90248293cb13740 💻 Colab: colab.research.google.com/drive/1AErkPgDderPW0dgE230OOjEysd0QV1sR?usp=sharing 💻 GitHub: github.com/ShawhinT/KZread-Blog/tree/main/LLMs/qlora 🤗 Model: huggingface.co/shawhin/shawgpt-ft 🤗 Dataset: huggingface.co/datasets/shawhin/shawgpt-youtube-comments -- Resources [1] Fine-tuning LLMs: kzread.info/dash/bejne/l3dqqsZqmKncn9Y.html [2] ZeRO paper: arxiv.org/abs/1910.02054 [3] QLoRA paper: arxiv.org/abs/2305.14314 [4] Phi-1 paper: arxiv.org/abs/2306.11644 [5] LoRA paper: arxiv.org/abs/2106.09685

  • @mouadkrikbou4596

    @mouadkrikbou4596

    2 ай бұрын

    well done ! well explained, I am a data scientist as well and love your videos, a lot of work behind the scenes to bring the koncepts in such simple yet interactive way!! many thanks Shawhin !!

  • @ShawhinTalebi

    @ShawhinTalebi

    2 ай бұрын

    @@mouadkrikbou4596 Thanks! This one took longer than usual to put together, so glad you enjoyed it :)

  • @soonheng1577
    @soonheng15779 күн бұрын

    wow, you are the genius of explaining super hard math concept into layman understandable terms with good visual representation. Keep it coming.

  • @manyagupta6375
    @manyagupta63752 ай бұрын

    Your explanations are amazing and the content is great. This is the best playlist on LLMs on KZread.

  • @Ali-me4tv
    @Ali-me4tv2 ай бұрын

    So far the best explanation on KZread about this topic

  • @user-se8ty5nz7d
    @user-se8ty5nz7dАй бұрын

    Much appreciate your work, Shaw! The most transparent and logical explanation of Qlora Fine-Tuning, you deserve much more. Wish you the best

  • @MrCancerbero1983
    @MrCancerbero19832 ай бұрын

    This is the best explanation that i've ever heard, thanks for all the work!!

  • @chris_zazzman
    @chris_zazzmanАй бұрын

    Amazing work Shaw - complex concepts broken down to 'bit-sized bytes' for humans. Appreciate your time & efforts :)

  • @africanbuffalo
    @africanbuffalo2 ай бұрын

    Thank you Shaw for yet another awesome video succinctly explaining complex topics!

  • @ShawhinTalebi

    @ShawhinTalebi

    2 ай бұрын

    Happy to help!

  • @RohitJain-ls2ov
    @RohitJain-ls2ov2 ай бұрын

    Exactly what I was looking for! Thanks for the video. Keep going!

  • @ShawhinTalebi

    @ShawhinTalebi

    2 ай бұрын

    Great to hear :)

  • @liubovnesterenko956
    @liubovnesterenko956Ай бұрын

    Thank you for this amazing video, great explanations, very clear and easy to understand!

  • 2 ай бұрын

    Learned a lot. Great video and very accessible. Well Done!

  • @ShawhinTalebi

    @ShawhinTalebi

    2 ай бұрын

    Great to hear! Glad it was helpful :)

  • @el_artmaga_
    @el_artmaga_2 ай бұрын

    Great video and your slides are very well organized!

  • @ShawhinTalebi

    @ShawhinTalebi

    2 ай бұрын

    Glad you like them!

  • @ai4sme
    @ai4sme2 ай бұрын

    Amazing explanation!!! Thank you Shaw!

  • @ShawhinTalebi

    @ShawhinTalebi

    2 ай бұрын

    Happy to help!

  • @aldotanca9430
    @aldotanca94302 ай бұрын

    Loved this, very informative and clear!

  • @ShawhinTalebi

    @ShawhinTalebi

    2 ай бұрын

    Thanks Aldo!

  • @ai-for-bim
    @ai-for-bimАй бұрын

    Amazing video ! You are the best, man ! Thank you so much.

  • @younespiro
    @younespiro2 ай бұрын

    thank u for sharing this knowledge , we need more videos like this

  • @ShawhinTalebi

    @ShawhinTalebi

    2 ай бұрын

    Happy to help! More to come :)

  • @BobTheZealot
    @BobTheZealot2 ай бұрын

    Great content, thank you!

  • @Wonderfulworldhahaha
    @Wonderfulworldhahaha5 күн бұрын

    Great content, Thank you.

  • @ifycadeau
    @ifycadeau2 ай бұрын

    Another fire video in the books!

  • @ShawhinTalebi

    @ShawhinTalebi

    2 ай бұрын

    Thanks! 🔥🔥😂

  • @wilfredomartel7781
    @wilfredomartel77812 ай бұрын

    ❤ really amazing work

  • @madhurjindal1364
    @madhurjindal13642 ай бұрын

    Man, you are amazing!

  • @medec021
    @medec0216 күн бұрын

    Awesome, thank you

  • @Blooper1980
    @Blooper19802 ай бұрын

    Thanks for this!!

  • @mr.daniish
    @mr.daniishАй бұрын

    This was a value bomb!

  • @dennou2012
    @dennou20122 ай бұрын

    Great content!!

  • @narendraparmar1631
    @narendraparmar1631Ай бұрын

    Thanks Shaw

  • @dhirajkumarsahu999
    @dhirajkumarsahu9994 күн бұрын

    Thank you so much, I have one doubt please, even if we set fp16 = True, still the optimization would happen in fp32 right, like you showed at 20:22

  • @felixmuller9062
    @felixmuller90622 ай бұрын

    Great Video!! How much GPU memory did you need in the end to fine tune mistral 7b?

  • @ShawhinTalebi

    @ShawhinTalebi

    2 ай бұрын

    Glad you liked it! It runs on Colab with 12.7GB system RAM and 15GB GPU RAM. I don't think it went above 10GB of GPU utilization.

  • @manuelbradovent3562
    @manuelbradovent3562Ай бұрын

    Great video, thanks !

  • @manuelbradovent3562

    @manuelbradovent3562

    Ай бұрын

    Additionally, probably also pruning was performed beside quantization in order to get such a low amount of trainable parameters.

  • @ShawhinTalebi

    @ShawhinTalebi

    Ай бұрын

    Thanks for the tip! I'll need to dig into that.

  • @CRCaritas
    @CRCaritasАй бұрын

    Thank you for the video! Just a small question. At the end, how would you run inference with your fine-tuned model? Do you save it first to the hub and then load it again? I'm not really sure how to correctly apply the lora adapter to the original model after fine-tuning.

  • @ShawhinTalebi

    @ShawhinTalebi

    Ай бұрын

    Yes, that's how I do it here! There's example code for this in the Colab under "Load Fine-tuned Model" colab.research.google.com/drive/1AErkPgDderPW0dgE230OOjEysd0QV1sR?usp=sharing

  • @ccapp3389
    @ccapp3389Ай бұрын

    Good stuff

  • @jjen9595
    @jjen95952 ай бұрын

    nice video bro

  • @Etienne_O
    @Etienne_O26 күн бұрын

    Thank you for sharing this! Have you tried Fine-tune Mixture of Experts like Mixtral 8x7B. Is the process really different? I want to do some testing of my own in the next week. Do you think this requires the same amount of vram as a 7b model or more? I have a macbook m3 pro max with 128gb of shared memory and a mac studio with 196gb of shared memory.

  • @ShawhinTalebi

    @ShawhinTalebi

    26 күн бұрын

    I haven't played with Mixtral 8x7B yet, so I don't have much insight. Hope to cover this in a future video :)

  • @naehalmulazim
    @naehalmulazimАй бұрын

    Thank you SO much for covering this, Sir! Small question: If I want to fine-tune a model to understand a new coding language whose syntax is similar to C++, any loose ideas or direction I would go about it in?

  • @ShawhinTalebi

    @ShawhinTalebi

    Ай бұрын

    There are many ways you can go about this. While I haven't done anything like that, I'd try taking an existing programming model like CodeLlama and doing self-supervised fine-tuning on example code with docstring-like comments.

  • @naehalmulazim

    @naehalmulazim

    Ай бұрын

    @@ShawhinTalebi Thank you so much! Whether QLora would be used there or should I not use any PEFT fine tuning and go for a full fine-tune would depend on experiments I guess.

  • @pawan3133
    @pawan313326 күн бұрын

    Beautifully explained, thanks!!! When you said, for PEFT "we augment the model with additional parameters that are trainable", how do we add these parameters exactly? Do we add a new layer? Also, when we say "%trainable parameters out of total parameters", doesn't that mean that we are updating a certain % of original parameters?

  • @ShawhinTalebi

    @ShawhinTalebi

    26 күн бұрын

    I explain how LoRA works here: kzread.info/dash/bejne/l3dqqsZqmKncn9Y.htmlsi=_3PK3Kj4Zxs844qg&t=6 Good question. We do not touch any of the original parameters. This just done to give a sense of the relative computational savings of PEFT.

  • @pisthaoct03
    @pisthaoct03Ай бұрын

    Thanks for sharing. I have a question, instead of quantized model, can i load the base mistral model and follow this process ?

  • @ShawhinTalebi

    @ShawhinTalebi

    Ай бұрын

    Yes, given that you have enough memory for the model.

  • @ahmadalhineidi6414
    @ahmadalhineidi641412 күн бұрын

    Great video and explanation! Thanks a lot. For the code, have you tried to use: from transformers import BitsAndBytesConfig nf4_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16 ) and then add that as quantization configs when loading the model? This would include the other aspects from the QLoRA paper, no?

  • @ShawhinTalebi

    @ShawhinTalebi

    5 күн бұрын

    Thanks for sharing! I'll need to try that out. I remember running into issues when trying this on my first pass.

  • @pravingaikwad1337
    @pravingaikwad133726 күн бұрын

    Is it like the base model is stored in 4bit and as the data (X vector) passes through the layer that layer is first dequantized and then the matrix multiplication is done (X*W)? And the same thing for LoRA as well? and after we get Y (by adding output of lora and base layer) the W and LoRA layers are again quantized back to 4bit? and Y is passed on to next layer? Also, if the LoRA is at the base of the model, does that mean to update the parameters of this LoRA we need to calculate the gradients of loss wrt all the W and LoRA matrices above it?

  • @ShawhinTalebi

    @ShawhinTalebi

    26 күн бұрын

    That's a great question. Honestly I'm not entirely sure, but what you said makes sense. For inference weights are dequantized layer by layer so that multiplication is possible with FP16 inputs, and no need to dequant LoRa weights since these are already FP16. No need to compute gradients for original parameters because those are frozen i.e. we treat those as constants.

  • @Jordano7000
    @Jordano7000Ай бұрын

    Hey would this model work if i wanted to input a DNA sequence for example ATCGTGC and the model to repond with the gene name (for example Gene X)?

  • @ShawhinTalebi

    @ShawhinTalebi

    Ай бұрын

    I don't know honestly, but it's worth a try. LLMs have a funny way of surprising us.

  • @Eliot-nr7zq
    @Eliot-nr7zq2 ай бұрын

    Thank you for sharing this fantastic video! Would it be worthwhile to explore a similar approach using unsupervised learning?

  • @ShawhinTalebi

    @ShawhinTalebi

    2 ай бұрын

    Glad you liked it! When it comes to fine-tuning, the closest thing would be semi-supervised learning. This could make sense if trying to further train a model on a knowledge base (e.g. sklearn documentation). However, empirically it seems fine-tuning tends to be a less effective way to endow a model with specialized knowledge compared to a RAG system.

  • @lalmimaj
    @lalmimaj2 ай бұрын

    Hi, how long did it take you to you to fintune Mistral in this example?

  • @ShawhinTalebi

    @ShawhinTalebi

    2 ай бұрын

    Took about 10 min to run in Colab

  • @trsd8640
    @trsd86402 ай бұрын

    Thank you for this great video! If you find a way to get this working on Apple silicon machines, we would love to see a video about it!

  • @ShawhinTalebi

    @ShawhinTalebi

    2 ай бұрын

    Thanks for the suggestion! Once I get something working I'll be sure to share it.

  • @nimesh.akalanka
    @nimesh.akalanka21 күн бұрын

    How can I fine-tune the LLAMA 3 8B model for free on my local hardware, specifically a ThinkStation P620 Tower Workstation with an AMD Ryzen Threadripper PRO 5945WX processor, 128 GB DDR4 RAM, and two NVIDIA RTX A4000 16GB GPUs in SLI? I am new to this and have prepared a dataset for training. Is this feasible?

  • @ShawhinTalebi

    @ShawhinTalebi

    19 күн бұрын

    That's a lot of firepower! You should be able to do full fine-tuning with that set up. Perhaps you can try using the example code as a jumping off point.

  • @jeffg4686
    @jeffg46862 ай бұрын

    Any idea if GPTQ support is coming to Mac M1 at some point?

  • @ShawhinTalebi

    @ShawhinTalebi

    2 ай бұрын

    I doubt it. There is an alternative format that works on Mac called GGUF.

  • @jeffg4686

    @jeffg4686

    2 ай бұрын

    @@ShawhinTalebi- thanks

  • @pepeballesteros9488
    @pepeballesteros948826 күн бұрын

    What's the loss function for this NLP task? I mean, What is the quantitative measure that determines a good response from a bad one?

  • @ShawhinTalebi

    @ShawhinTalebi

    26 күн бұрын

    I believe cross entropy is used here.

  • @pepeballesteros9488

    @pepeballesteros9488

    26 күн бұрын

    @@ShawhinTalebi Cheers, I'll look it out. Amazing content Shaw!

  • @u04vw9
    @u04vw914 күн бұрын

    Have you solved the mac issue? Thanks!

  • @HarshvardhanKanthode

    @HarshvardhanKanthode

    12 күн бұрын

    Lemme know as well, I was pretty bummed when I found out bitsandbytes doesn't work on M2

  • @ShawhinTalebi

    @ShawhinTalebi

    12 күн бұрын

    Not yet. However, that Llama3 is out I have an excuse to spend more time with it. Hope to revisit this in June.

  • @yotubecreators47
    @yotubecreators47Ай бұрын

    I can't save this video, do you know why, can you please enable saving videos to playlist

  • @ShawhinTalebi

    @ShawhinTalebi

    Ай бұрын

    That's strange. Are you still having this issue?

  • @yotubecreators47

    @yotubecreators47

    Ай бұрын

    No I can save it now :), thanks a lot

  • @samyio4256
    @samyio4256Ай бұрын

    When you say "memory" do you mean RAM or VRAM?

  • @ShawhinTalebi

    @ShawhinTalebi

    Ай бұрын

    Both! QLoRA specifically uses Nvidia's unified memory feature.

  • @yangyang1412
    @yangyang1412Ай бұрын

    what is the minium vram spec for this tutorial

  • @ShawhinTalebi

    @ShawhinTalebi

    26 күн бұрын

    Runs on Google Colab using 13GB of memory (6.5 CPU RAM + 6.5 VRAM).

  • @gk_12344
    @gk_12344Ай бұрын

    Does it work with GGUF models ?

  • @ShawhinTalebi

    @ShawhinTalebi

    26 күн бұрын

    I didn't try it, but I'm sure there is a way to do that.

  • @FrancescoFiamingo99
    @FrancescoFiamingo9912 күн бұрын

    dear Shaw, i m passionated old guy (i m 54 :)) in AI, is amazing how u can explain in simple words concepts , that even an old Mammoth like me can understand , of course , being a total "artisan" in this field , my job is totally different, i m facing problems that will look very simple at your eyes, usually i ask support to chat gpt 4 to learn , understand and correct, but this argument and some of the python libreries are too recent and not yet in the last version of chat gpt 4 , so i need your help, i m not using COLAB becosue i have already similar set on my machine, (16 + 16 like in your example) and i dowload both the model and the data set in my machine , but i m getting this error : ImportError: Found an incompatible version of auto-gptq. Found version 0.3.1, but only version above 0.4.99 are supported i tried to upgrade my version but seems no :ERROR: No matching distribution found for auto-gptq== (any higher then 0.3.1) how can solve the problem?

  • @ShawhinTalebi

    @ShawhinTalebi

    12 күн бұрын

    It seems like an issue setting up the environment. You can try manually setting the package versions when installing them on your machine based on the Google Colab code.

  • @FrancescoFiamingo99

    @FrancescoFiamingo99

    12 күн бұрын

    @@ShawhinTalebi i will try and let u know tks for feedback

  • @nikandr8685
    @nikandr8685Ай бұрын

    When I tried this, I got this Exception: Cannot copy out of meta tensor; no data! This happens in this step: NotImplementedError Traceback (most recent call last) Cell In[45], line 2 1 # configure trainer ----> 2 trainer = transformers.Trainer( 3 model=model, 4 train_dataset=tokenized_data["train"], 5 eval_dataset=tokenized_data["test"], 6 args=training_args, 7 data_collator=data_collator 8 ) 10 # train model 11 model.config.use_cache = False # silence the warnings. Please re-enable for inference! Do you have any Idea?

  • @ShawhinTalebi

    @ShawhinTalebi

    Ай бұрын

    This link might be helpful: github.com/AUTOMATIC1111/stable-diffusion-webui/issues/13087

  • @nikandr8685

    @nikandr8685

    Ай бұрын

    @@ShawhinTalebi Thank you. For others with the same problem: This solved it for me: import sys sys.argv.append("--disable-model-loading-ram-optimization")

  • @sparkledark3713
    @sparkledark3713Ай бұрын

    It's 264M parameters because it's the only ones which are trainable. Rest ones are frozen.

  • @ShawhinTalebi

    @ShawhinTalebi

    Ай бұрын

    Frozen from LoRA or something else?

  • @sparkledark3713

    @sparkledark3713

    28 күн бұрын

    @@ShawhinTalebi Like the main model parameters are frozen except LoRAs parameters. Maybe that's why

  • @edsonjr6972
    @edsonjr69722 ай бұрын

    my guess is that the q_proj has 264M parameters, and thats why it's showing only that.

  • @ShawhinTalebi

    @ShawhinTalebi

    2 ай бұрын

    Wouldn't that make it 264M trainable parameters then?

  • @itchainx4375

    @itchainx4375

    Ай бұрын

    @@ShawhinTalebi The training is for a smaller low rank matrix.

  • @itchainx4375

    @itchainx4375

    Ай бұрын

    Not for this reason, you can try changing target_modules to see changes in training parameters

  • @BenLewisE

    @BenLewisE

    Ай бұрын

    @@ShawhinTalebi I believe that @edsonjr6972 is right and that the trainable parameters is reduced significantly, because you are not _just_ targeting only certain layers, but also you are using LoRA decomposition to smaller low-rank matrices, and so the 264M is the probably the number of all of the parameters in the `q_proj` layers and then the 2M is the ~1% of those parameters that you are actually training due to LoRA

  • @xi8t-gk1oi
    @xi8t-gk1oi2 ай бұрын

    fp16=true causes training to fail with error "No inf checks were recorded for this optimizer.". Set fp16=false and training successfully completed but loss and eval loss are the same for every epoch.

  • @xi8t-gk1oi

    @xi8t-gk1oi

    2 ай бұрын

    I am trying to fine tune on my own dataset with 20000 messages in format: msg_id, sender_id, content, reply_to, interval (between this and previous message) to generate similar messages with similar format.

  • @ShawhinTalebi

    @ShawhinTalebi

    2 ай бұрын

    Are you running the provided script in Colab?

  • @xi8t-gk1oi

    @xi8t-gk1oi

    2 ай бұрын

    @@ShawhinTalebi no, on my own machine

  • @MrCancerbero1983

    @MrCancerbero1983

    2 ай бұрын

    Same here, but training doesn't take effect, i got the same answer after training

  • @MrCancerbero1983

    @MrCancerbero1983

    2 ай бұрын

    i changed the torch version to match colab via pip install torch==2.1.0, and it work

  • @sapandeepsandhu4410
    @sapandeepsandhu4410Ай бұрын

    Enlightening journey through the intricacies of Large Language Model (LLM) optimization! 🌌🖥 Your adept presentation not only demystifies the process but also serves as a beacon of inspiration for both burgeoning and seasoned developers navigating the vast seas of AI technology. The elegance with which you delineated the nuances of QLoRA and its transformative approach to fine-tuning LLMs on a singular GPU setup is nothing short of revelatory. 📘✨ It's a masterclass in making advanced AI technologies accessible and practical for a wider audience, empowering individuals to harness the full potential of LLMs without the necessity for extensive computational resources.

  • @PranavBaviskar
    @PranavBaviskar12 күн бұрын

    Getting Key Error: Mistral

  • @ShawhinTalebi

    @ShawhinTalebi

    12 күн бұрын

    I'm not able to replicate that error. Are you running the example in Colab?

  • @Ev4Nou4
    @Ev4Nou4Ай бұрын

    im so fucking lost

  • @ShawhinTalebi

    @ShawhinTalebi

    Ай бұрын

    This video goes pretty deep into the technical details. Watching some of the previous video in the series might help give more context. I also do office hours if you have any specific questions: calendly.com/shawhintalebi/office-hours

  • @jjen9595
    @jjen95952 ай бұрын

    i get this error OSError: TheBloke/dolphin-2.6-mistral-7B-dpo-laser-GGUF does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack. ¿que hago? :c

  • @ShawhinTalebi

    @ShawhinTalebi

    2 ай бұрын

    Not sure, I haven't come across that one before

  • @jjen9595

    @jjen9595

    2 ай бұрын

    @@ShawhinTalebi I solved it, you must put the correct model in the colab that is similar to the one you have, I still don't know how to make a meta for hugging face :c