Fine-tuning LLMs with PEFT and LoRA
Ғылым және технология
LoRA Colab : colab.research.google.com/dri...
Blog Post: huggingface.co/blog/peft
LoRa Paper: arxiv.org/abs/2106.09685
In this video I look at how to use PEFT to fine tune any decoder style GPT model. This goes through the basics LoRa fine-tuning and how to upload it to HuggingFace Hub.
My Links:
Twitter - / sam_witteveen
Linkedin - / samwitteveen
Github:
github.com/samwit/langchain-t...
github.com/samwit/llm-tutorials
00:00 Intro
00:04 - Problems with fine-tuning
00:48 - Introducing PEFT
01:11 - PEFT other cool techniques
01:51 - LoRA Diagram
03:25 - Hugging Face PEFT Library
04:06 - Code Walkthrough
Пікірлер: 112
Perfect balance of theory and hands-on with a colab attached to most of your videos. Much Much apreciated. I recommend this channel to all people who wants to follow this crazy trend of LLM releases. the best path to keep all of us up to date! I learn so much thanks to you Sam. Thanks a ton. Keep moving forward.
So this seems like the basis for a business: offer to train a custom model for product documentation, FAQ, etc with a specific product or company focus. Cool!
@Hypersniper05
Жыл бұрын
Or close domain semantic search with summarization
@handsanitizer2457
Жыл бұрын
@E Marrero can you explain that a bit more. I'm new to the machine learning space
@Hypersniper05
Жыл бұрын
@@handsanitizer2457 It's a bit too much to explain here but search in youtube for "openai embeddings" or "embedding searches" and you will have a general idea of how models can be used for searches, not only for open ai but other open source models as well. Fine tuning a model on close domain will help it understand your company's data better. You can also fine tune it to reply back in a certain way which opens the door to many options. Chatgpt was trained this way but more in conversational outputs
@ArjunKrishnaUserProfile
Жыл бұрын
Does chatbase use this technique? It does the training on website or file data very fast.
@Hypersniper05
Жыл бұрын
@@ArjunKrishnaUserProfile I am pretty sure it doesn't train the model , that would be way more expensive than embedding
This is great. Not so many channels on YT that do this kind of stuff. Would appreciate more like this, other frameworks like deepspeed, useful datasets, training parameters experiments, etc. so many interesting stuff that is not covered on YT.
You continue to make videos on exactly the things I'm trying to understand more deeply! Fantastic! There are a lot of detailed parameters in this video that you could certainly continue to elaborate on for those of us who aren't programmers...yet :) Looking forward to more of your vids!
Many have said it but I'll reiterate -- your LLM videos are really great to watch, both the pace and the way you go from high level overviews to the detailed info. I also appreciate that it's not just focused on ChatGPT/GPT-4/hosted-models all the time and talks more about local training/finetuning/inferencing.
Awesome! been waiting for your take on this topic
Sam, thanks for giving your audience their requests! The alpaca training video you made makes much more sense now
Awesome explanation, this is exactly what I was looking for. Thank you!
Thanks for the awesome explanation. Going to binge your videos.
Awesome stuff Sam. I’m in the process of using langchain to build a vector store and - whilst it’s fine for now - would be really interested in understanding the best way to then take this and use to generate a LORA. Feels like the logical next step.
i would love to see more videos about this and showing people how we could adapt this to our own projects and maybe even a video about 4bit tuning.
Thank you! Be great to see more on the data section - everyone always seems to gloss over that part, despite the fact that is clearly the most important part. Seen a lot of (from diff youtubers) 20-40 min vids on the configuration, barely mentioning the actual use of the data?
I would love a vid covering examples of the differently formatted types of datasets that can be used to train a lora and the types of abilities that the different kinds of dataset training will allow - or put another way - what kinds of behavioral changes in abilities can we use lora to fine-tune for in a model, and how do we then know what types of data formatting to use in order to get a chosen outcome. :D
How does LoRA differ from transfer learning? If I understand correctly TL means adding additional layers onto frozen pre-trained network and training it on new dataset, right?
This really useful, Thank you!
Wow thank you for you work
Very underrated channel. You deserve more viewers and subs.
@samwitteveenai
11 ай бұрын
Thanks for the kind words.
Hey, sorry I late to the party. I tried to load my LoRA model but when I checked the weights, the weights are the same with the original model. Is it supposed to do that? I already checked with my after-trained model and yes the weights is different.
Can you create more videos on instruction-prompt-tuning as well, as a further extension to this video? Amazing work!
I’d love a quick video like this on how to use checkpoints from PEFT training to do inference. When I’m training, I’m never sure how much is too much, and I can save checkpoints automatically easily to resume in case training stops. What I need to learn is how to use these checkpoints with the base model to do inference so I can test output quality against several checkpoints. Ideally I’d like to be able to do inference on a base model plus checkpoint, and then once I find a good result, merge the checkpoint into the base model so I can use it in production and keep VRAM low. (I am assuming inference on base model + checkpoint will use more vram)
very useful!! thanks a ton
Great lectures.
So badass. Thanks!
These fine-tuning-related topics are especially relevant to me right now. Currently training llama-30b variants at 4-bit. I’m very interested in how to roll adapters/checkpoints back into base models to keep VRAM usage down during inference (under 24GB)
@MridulSharmaMID
Жыл бұрын
Hi I am also interested. Can we connect via email?
@PavanAtGrowexx
5 ай бұрын
Hey, I am also facing the same issue, did you find any update and could help me out please?
Does anyone know the proper settings for generation with the story model? Mine tends to start ok and then becomes word spew halfway though.
Hey Sam, Thanks for the great informative video as always! Do you know of a way to see which neurons get activated during training? I am because I was thinking of ways to reduce the big models and the most obvious way I could think of would be to view which neurons are getting activated when training especially with falcon 170b, even 32b is to big for me and considering I don't need multiple languages I was hoping this would be a good approach to reduce the size of models? It would be cool to see a Brain Surgeon type debugger for LLMs. It would be good to run a different training datasets through different llms to see which neurons get activated and which ones do not and ideally have a way to disable them during inference to test and measure the differences of the output.
Excellent!
how is Lora fine-tuning track changes from creating two decomposition matrix? How the ΔW is determined?
This is a great way to understand how we can fine-tune a text classification task using an LLM. I want to know if there is a method through which we can make the LLM learn from data in JSON format, where there are multiple labels for information retrieval or conversational recommendation tasks.
How many examples is necessary in dataset for it to learn the certain pattern? With OpenAI you are fine with just 200 examples, which I don't think would work here.
This is gold! Thank you!
@joshmabry7572
Жыл бұрын
I'm looking to train the Wizard-Vicuna models but run into `ValueError: The following `model_kwargs` are not used by the model: ['token_type_ids']`
@samwitteveenai
Жыл бұрын
This could be because they have already folded a LoRA in there or the base model setup is different.
How do you handle "CUDA out of memory" error in free Colab notebook?
Great quick tutorial. This is good for English-only pretraining/fine-tuning. What is about non-English ? What are steps should we take to (1) extend vocab (2) pretraining (with or without LoRA) free-non-structure-text corpus (3) fine-tune with LoRA for each task ... ! Would love to have your tutorial on this road, it would be great. Thanks, Steve.
5:23 is it possible to train this on a Nvidia RTX 4090 FE (24GB RAM)?
Very rare videos found on youtube for this topic.
@samwitteveenai
Жыл бұрын
this is the first of a few on the topic
Hi Sam, thank you for the video. I'm getting RuntimeError: expected scalar type Half but found Float running in Colab with GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-a971aa0c-5408-727a-3b72-48b1926b5f66) On the training loop, what am i missing?
@ShlomiSchwartz
10 ай бұрын
It was a GPU issue, switching GPUs fixed it
@samwitteveenai
10 ай бұрын
YeahI don't think the Bitsnbytes fully supports v100 GPUs I have had issues with it in the past.
Hi Sam, thanks for the great video. I got a general question you might know the answer to. If I freeze pre-trained model weights (for example, BERT) and then train a classifier on top of its embeddings, does that called fine-tuning? If the weights are unfrozen, I know this can be called fine-tuning.
@samwitteveenai
Жыл бұрын
you can freeze some of the weights and tune the top layer etc and it is fine tuning yes.
How to re-train it with additional data? Great video!
What is the minimum GPU RAM Memory to run this code? I think I need a new GPU to run this on my local machine
If the only difference is in these added-on weights, is it possible to run multiple distinct finetuned models at the same time without duplicating the shared base pretrained model in memory?
@samwitteveenai
Жыл бұрын
Yes this is trick we are working on for production. You have multiple LoRA weights for different tasks etc. Very much beyond the scope of here though.
very great tutorial! with the saved pretrained model, how do we make prediction for classification problems?
@samwitteveenai
11 ай бұрын
You can do that with a much simper model like BERT etc. or a T5 or structure the data to do it with the causal LM
LoRa is not adding additional weights. Although it might seems so while training, at inference there are no additional parameters. It acts more like diff and patch (though in vector space).
Thanks a lot~
I have two sample dataset like bello 1) [{ "en": "Hello, how are you today?", "fr": "Bonjour, comment ça va aujourd'hui ?" },...] 2) [ { "text": "Ravi is a young man from India who loves panipuri." },... ] so how can i fine tune above dataset using falcon llm model Please help me
Hey man great video, hd a question do.u think a 500 m or 1b model could give good results similar to alpaca. What would be the smallest size a model can follow instructions?
@samwitteveenai
Жыл бұрын
Its a really interesting question and something I am currently doing research on. 500 is probably too small. 1.5B things get a bit more interesting. The big challenge with smaller models is you can't expect them to know facts correctly. So you want to use them more as retrieval generation models. They can do language but need to have the facts and context fed in at generation time etc.
@theunknown2090
Жыл бұрын
The cerebras-gpt models are really fast compared to gpt2, gpt-neo in inference like a cerebras2.7b inference speed is almost equal to gpt1.5b and gptneo 1.3B
@samwitteveenai I encounter RuntimeError: expected scalar type Half but found Float while running the training script specified in the colab notebook. Can you please helpme with pointers to solve the error. I am running in Colab (GPU 0: Tesla V100-SXM2-16GB)
@samwitteveenai
Жыл бұрын
Ok v100s had some problems with the 8bit part in the past, so it could be that.
@nayakdonkey
Жыл бұрын
@@samwitteveenai Thanks for the acknowledgement
@SubhamKumar-eg1pw
Жыл бұрын
@@nayakdonkey Were you able to solve the above RuntimeError? I am facing the same with V100 machine
Hi, I have a question for you. When/or will you be uploading the video about seq2seq models? I would like to see that one as well!
@samwitteveenai
Жыл бұрын
Yes I promised this and I will get to it. Will try to do it this week. Please remind me if I don't. Too many new LLMs and cool papers being released :D
@haticeobuz9081
Жыл бұрын
@@samwitteveenai Okay, thank you so much.
Love It.
Hello! Can you fine-tuning T5?
can you edit LoRa to LoRA in the tiitle? I was really confused for a second saying to myself what does long range radio do with LLMs
@samwitteveenai
Жыл бұрын
lol done, thanks for pointing it out.
Very informative!!!! does fine tunning with qlora/lora does support this kind of dataset? If not, what changes should i make in my output dataset? Review(col1) Nice cell phone, big screen, plenty of storage. Stylus pen works well. Analysis(col2) [{“segment”: “Nice cell phone”,“Aspect”: “Cell phone”,“Aspect Category”: “Overall satisfaction”,“sentiment”: “positive”},{“segment”: “big screen”,“Aspect”: “Screen”,“Aspect Category”: “Design”,“sentiment”: “positive”},{“segment”: “plenty of storage”,“Aspect”: “Storage”,“Aspect Category”: “Features”,“sentiment”: “positive”},{“segment”: “Stylus pen works well”,“Aspect”: “Stylus pen”,“Aspect Category”: “Features”,“sentiment”: “positive”}]
Would it be practical to train a small model on a 1660 super 6gb? I just want to add a personality for a home voice assistant
@samwitteveenai
11 ай бұрын
probably not train it, you might be able to do some inference with that but training it on something with more VRAM etc
what is the difference between lora and embedings?
At 10:19, why did you pass in data['train'] as train_dataset? How is the training process going to know that data['train']['quote'] is the feature and data['train']['prediction'] is the target?
@PavanAtGrowexx
5 ай бұрын
Did you find any solution? I have the same query
@dolby360
15 күн бұрын
I also have the same query
I finetuned BART, but the model output was extactly the same as the input ids. Whats possibly wrong ?
@RushikeshTade
Ай бұрын
Did you merge weights?
Can I train any llm model from hugging face like llama model
@samwitteveenai
Жыл бұрын
yes with most of them.
Maybe a tutorial to integrate langchain with flan but accesing an api rest to query data.
In LoraConfig() method r is not the number of attention head instead it is the rank of the matrix that your are decomposing. From High Rank to LowRank. Here rank is 16.
how do you customize your dataset?
@samwitteveenai
10 ай бұрын
I am planning to make some vids on fine tuning LLaMA2 so I will go more into that there. Basically you just want to feed it strings.
Great! Can we merge the peft weights with the actual weights and use it for inference? any downside to it other than the size? Also, wouldn't the weights get tampered if we save them locally instead and use it for inference?
@samwitteveenai
Жыл бұрын
yes you can do that. I might show that in a future video. no big downside for most use cases. Saving the LoRa weights locally as when you load them they will load the original weights as well. Not sure what you mean by tampered.
Awesome video! 12:10 The loss was not goin down tho brother..., try to update the video with model training converging. This one clerly did not
Can you do a video on finetuning a multimodal LLM (Video-LlaMA, LLaVA, or CLIP) with a custom multimodal dataset containing images and texts for relation extraction or a specific task? Can you do it using open-source multimodal LLM and multimodal datasets like video-llama or else so anyone can further their experiments with the help of your tutorial. Can you also talk about how we can boost the performance of the fine-tuned modal using prompt tuning in the same video?
In the past, I tried fine-tuning some GPT models, but the results weren't good. Maybe this new technique will give me a better outcome
@samwitteveenai
Жыл бұрын
fine-tuning comes down a lot to what you are tuning on and how much etc. LoRa has a lot of advantages and certainly worth a try.
👍
Why is it called causal?
Hey Sam is there a chance I can reach to out to you personally?
@samwitteveenai
Жыл бұрын
just reach on Linkedin is easiest
bitsandbytes seems to have lots of issues in terms of compatibility with various CUDA versions and outright doesn't support windows directly
@samwitteveenai
11 ай бұрын
Yes they don't support the older GPUs that well either
Hi Sam, Can you provide more videos on fine tunning? Especially with Mistra-Orca model. I like your videos very much. Thanks for sharing them.
@samwitteveenai
6 ай бұрын
Yeah I have been meaning to do this for a while. Next week will do some new ones.
@hosseinaboutalebi9998
6 ай бұрын
@@samwitteveenai Thanks so much Sam.
great video! I actually was hitting an error while trying to finetiune Dolly 2.0 model : RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes 2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet. this was fixed by commenting: model.gradient_checkpointing_enable() do you know why that might be the issue?
@samwitteveenai
10 ай бұрын
That video is quite old now, I think they have updated the library. I will try to take a look at it at some point. I am currently making some new Fine tuning vids so they should be out within a week.
Finally something real...
This should be a seq2seq model, because you are tagging (classifying) text. Actually a Sequence to Tag (Sequence Classification)
LORA is a low power wireless data transmission...