Fine-tune Mixtral 8x7B (MoE) on Custom Data - Step by Step Guide

Ғылым және технология

In this tutorial, we will walk through a step by step tutorial on how to fine tune Mixtral MoE from Mistral AI on your own dataset.
LINKS:
Colab (free T4 will not work): tinyurl.com/2hfk2fru
Mistral 7B fine-tune video: • Mistral: Easiest Way t...
‪@AI-Makerspace‬
Want to Follow:
🦾 Discord: / discord
▶️️ Subscribe: www.youtube.com/@engineerprom...
Want to Support:
☕ Buy me a Coffee: ko-fi.com/promptengineering
|🔴 Support my work on Patreon: / promptengineering
Need Help?
📧 Business Contact: engineerprompt@gmail.com
💼Consulting: calendly.com/engineerprompt/c...
Join this channel to get access to perks:
/ @engineerprompt
Timestamps:
[00:00] Introduction
[00:57] Prerequisites and Tools
[01:52] Understanding the Dataset
[03:35] Data Formatting and Preparation
[06:16] Loading the Base Model
[09:55] Setting Up the Training Configuration
[13:22] Fine-Tuning the Model
[16:28] Evaluating the Model Performance
All Interesting Videos:
Everything LangChain: • LangChain
Everything LLM: • Large Language Models
Everything Midjourney: • MidJourney Tutorials
AI Image Generation: • AI Image Generation Tu...

Пікірлер: 70

  • @MikewasG
    @MikewasG7 ай бұрын

    Thank you for sharing, this is very helpful! Looking forward to the next videos!

  • @dev_navdeep
    @dev_navdeep5 ай бұрын

    kudos, really simple and direct explaination.

  • @AbhishekShivkumar-ti6ru
    @AbhishekShivkumar-ti6ru6 ай бұрын

    very nicely explained!

  • @WelcomeToMyLife888
    @WelcomeToMyLife8887 ай бұрын

    Awesome content as usual! Thanks!

  • @engineerprompt

    @engineerprompt

    7 ай бұрын

    Thank you 😊

  • @ahmedmechergui8680
    @ahmedmechergui86807 ай бұрын

    Thanks for the video 😃 i just have a question , is it possible to use the model through an API and also provide the source files for the data with the response ?

  • @jprobichaud
    @jprobichaud7 ай бұрын

    🎯 Key Takeaways for quick navigation: 00:00 🚀 *Introduction to Fine-Tuning Mixtral 87B Model* - Overview of the video's purpose: fine-tuning Mixtral 87B model from Mistral AI on a custom dataset. - Mention of the popularity and potential of Mixtral 87B as a mixture of experts model. - Emphasis on practical considerations for fine-tuning, such as VRAM requirements and dataset details. 01:28 🛠️ *Installing Required Packages and Data Set Overview* - Installation of necessary packages: Transformers, TRL, accelerate, P torch bits, and bytes. - Discussion on using the Mosaic ML Instruct with 3 datasets for fine-tuning. - Overview of the dataset structure, splits, and sources. 03:45 📝 *Formatting Data for Fine-Tuning Mixtral 87B* - Explanation of the prompt template for fine-tuning, specific to Mixtral 87B Instruct version. - Discussion on rearranging data to make it more challenging by creating instructions from provided text. - Demonstration of a function to reformat the initial data into the desired prompt template. 06:28 🧩 *Loading Base Model and Configuring for Fine-Tuning* - Acknowledgment of the source for the notebook and clarification that the base version is used. - Setting configurations, loading the model, and tokenizer, along with using Flash attention. - Explanation of the importance of setting up configurations for a smooth fine-tuning process. 08:18 🔄 *Checking Base Model Responses Before Fine-Tuning* - Use of a function to check responses from the base model before any fine-tuning. - Illustration of the base model behavior in generating responses to a given prompt. - Recognition that the base model tends to follow next word prediction rather than explicit instructions. 10:06 📏 *Determining Max Sequence Length for Fine-Tuning* - Explanation of the importance of max sequence length in fine-tuning Mixtral 87B. - Presentation of a code snippet to analyze the distribution of sequence lengths in the dataset. - Emphasis on selecting a max sequence length that covers the majority of examples. 12:20 🧠 *Adding Adapters with Lura for Fine-Tuning* - Overview of the Mixtral 87B architecture, focusing on linear layers for adding adapters. - Introduction to Lura configuration for attaching adapters to specific layers. - Demonstration of setting hyperparameters and using the TRL package for supervised fine-tuning. 14:36 🚥 *Setting Up Trainer and Initiating Fine-Tuning* - Verification of multiple GPUs for parallelization during model training. - Definition of output directory and selection of training epochs or steps. - Importance of configuring the trainer, including considerations for max sequence length. 16:50 📈 *Analyzing Fine-Tuning Results and Storing Model* - Presentation of training and validation loss graphs, indicating a gradual decrease. - Acknowledgment of the need for potential longer training for better model performance. - Demonstration of storing the fine-tuned model weights locally and pushing to Hugging Face repository. 17:46 🔄 *Testing Fine-Tuned Model Responses* - Utilization of the fine-tuned model to generate responses to a given prompt. - Comparison of responses before and after fine-tuning, showcasing improved adherence to instructions. - Acknowledgment that further training could enhance the model's performance. Made with HARPA AI

  • @user-hc5os4fs5k
    @user-hc5os4fs5k7 ай бұрын

    can you also make a video on fine-tuning multimodal models like llava, cog-vlm

  • @kaio0777
    @kaio07777 ай бұрын

    Can you make this for home computer use in terms of my personal data and tech it to use tools on your system and online

  • @HarmeetSingh-ry6fm
    @HarmeetSingh-ry6fm6 ай бұрын

    Great video just have one question can we use the fine-tuned model as a pickle file?

  • @AI-Makerspace
    @AI-Makerspace6 ай бұрын

    Thanks for the tag @Prompt Engineering! What else is your audience requesting the most these days? Would love to find ways to create some value for them together!

  • @engineerprompt

    @engineerprompt

    6 ай бұрын

    Thanks for the amazing work you guys are doing! really appreciate it. I think deployment is a topic that will be really valuable to my audience. Let's explore how to collaborate.

  • @AI-Makerspace

    @AI-Makerspace

    6 ай бұрын

    @@engineerprompt absolutely! We started delving deper into deployment with LangServe and vLLM events in recent weeks. We'll connect to figure out next steps!

  • @sysadmin9396
    @sysadmin93967 ай бұрын

    Can I use this to train a model to answer questions from a list of pdfs?

  • @joaops4165
    @joaops41657 ай бұрын

    Could you make a tutorial teaching how to convert a model to ggml format?

  • @rishabhkumar4443
    @rishabhkumar44437 ай бұрын

    How can I use a generative model to manipulate content of my website Ex. Showing response from my site based on prompt given by the user

  • @user-nl4ry3wb1x
    @user-nl4ry3wb1x2 ай бұрын

    3:37 format 4:15 follow a different format 4:26 Indicate the end of user input 4:33 special token Indicate the end of model response 4:39 you need to provide your data in this format 5:08 def create_prompt 5:31 System message 6:16 Load our based model

  • @alexxx4434
    @alexxx44347 ай бұрын

    Thanks for the guide! How to continue fine-tuning process such as in this case? Can you load previous work (Lora) and carry on, or do you need to restart?

  • @engineerprompt

    @engineerprompt

    7 ай бұрын

    I think you can do that by storing different check points

  • @caiyu538
    @caiyu5386 ай бұрын

    Great

  • @garyhutson6270
    @garyhutson62705 ай бұрын

    What were your VM instance specs. It is struggling with an A100?

  • @researchforumonline
    @researchforumonline6 ай бұрын

    Thanks, what is the cost to do this? Server cost?

  • @Akshatgiri
    @Akshatgiri5 ай бұрын

    I've noticed that Mixtral 8x7b-instruct ( and other mistral models ) constantly repeat part of the system prompt. Have you noticed this / found a fix for it?

  • @lukeskywalker7029
    @lukeskywalker70296 ай бұрын

    IM sceptical this actually is effectively training mixtral MoE model and not making it worse!

  • @Tiberiu255
    @Tiberiu2557 ай бұрын

    why are you using packing in the SFTTrainer if you just said that you're going to pad the examples?

  • @big_sock_bully3461

    @big_sock_bully3461

    5 ай бұрын

    Can you explain ?

  • @shinygoomy2460
    @shinygoomy24605 ай бұрын

    how do you format a prompt that has multiple requests and responses within the same context???????

  • @Ai-Marshal
    @Ai-Marshal5 ай бұрын

    That's a great video. Thanks for sharing. After pushing the model to hugging face, how to host it independently on runpod using VLLM ? When I try to do that, it gives me error. Tried searching a lot of videos and articles. But of no use so far.

  • @FunkyByteAcademy

    @FunkyByteAcademy

    4 ай бұрын

    did you come right?

  • @DistortedV12
    @DistortedV127 ай бұрын

    Awesome man, any idea of how to get this running on a colab gpu or inference cost down?

  • @engineerprompt

    @engineerprompt

    7 ай бұрын

    Probably no way at the moment to run it on the colab gpu but you can look at the 2bit quantized version. If you are running this model as part of production pipeline, I would suggest to look at api providers such as together AI. They have really good pricing on it

  • @user-ed2wf6wr5g
    @user-ed2wf6wr5g3 ай бұрын

    So with two 3090s this should work? And what about using multiple different gpus for training? Like I have one 3090ti 24g and one 4060 8g

  • @user-cu3dr6pt7s
    @user-cu3dr6pt7s4 ай бұрын

    Could you please share the requirement.txt, i am having version conflicts despite using A100 GPU!

  • @abdeldjalilmouaz
    @abdeldjalilmouaz4 ай бұрын

    requires colab pro to work?

  • @user-rm8hx5ih4q
    @user-rm8hx5ih4q6 ай бұрын

    at 5:58, Why is the sample["response"] given as the input and sample["prompt"] is given as response

  • @DistortedV12
    @DistortedV127 ай бұрын

    Are you finetuning the mixtral instruct version they just released or base model??

  • @engineerprompt

    @engineerprompt

    7 ай бұрын

    In this video, just the base version

  • @VerdonTrigance
    @VerdonTrigance5 ай бұрын

    Hi, thanks for this step by step guide, but in case we want LLM to learn something new about our domain (let's say it will be book Lord of the Rings) and we later want to ask our model open questions about this book (like 'where Frodo gets his sword?') what should we do? We definetely cannot prepare dataset in form of QnA, so it should self-supervised training. But I never saw examples of doing this and I can't image how it supposed to be done? Is it even possible? Looks like we should start from base model, fine-tune it somehow with our book, and later we should apply fine-tuning for instruct on top of it, right? But in this case someone still should prepare this QnA? I'm frustrated.

  • @xXCookieXx98

    @xXCookieXx98

    4 ай бұрын

    Your use case sounds like a classic RAG one. It's not necessary to fine-tune for that. Although a fine-tuned model + RAG would probably create even better results, the effort here doesn't seem worth it. The video Building Corrective RAG from scratch with open-source, local LLMs from langchain (kzread.info/dash/bejne/d2anytOsidrek84.html) might help you, it also incudes a web search option, in case the provided context isn't sufficient, which should work pretty good with things like popular books. So, it's not limited to that and can be used in basically any domain. But you could also just build a RAG app without that. I would suggest a combination of a MultiQueryRetriever and a ParentDocumentRetriever for retrieving your context. Nevertheless, if you still want to fine tune: From what I have learned so far it is possible to create datasets using LLMs: e.g. you prompt an instruct LLM to create questions based on context chunks and then use those questions and chunks to create answers. You will find similar methods on this channel e.g. "automate dataset creation for Llama-2 with GPT-4".

  • @lostInSocialMedia.
    @lostInSocialMedia.7 ай бұрын

    can you finetune Uncensored Models of this with gemini pro ai ?

  • @PotatoMagnet

    @PotatoMagnet

    7 ай бұрын

    The base model ofmistral is uncensored, but you can't fine tune one model with another model. Both are of different architecture, you can't even merge or fine tune between same models of different parameters like between 7B and 13B either, so forget completely different models.

  • @divyagarh
    @divyagarh2 ай бұрын

    Great video! Could you please consider training and deploying it in Sagemaker?

  • @engineerprompt

    @engineerprompt

    2 ай бұрын

    I am going to create a video on deployment soon

  • @kanshkansh6504
    @kanshkansh65044 ай бұрын

    ❤👍🏼

  • @LeoAr37
    @LeoAr377 ай бұрын

    Can't we train the quantized version in a smaller GPU instead of training the full model?

  • @engineerprompt

    @engineerprompt

    7 ай бұрын

    Even training the quantized version of the full model will need a powerful GPU. That's why LoRa is used to add extra layers that are trained instead of the actual model. Hope this helps

  • @electricskies1707
    @electricskies17077 ай бұрын

    Can you clairfy, 1 epoch would be one run of the full data (34333 steps of your trimmed data) Why would you run this 2 epochs, does going over the data twice improve it? Also how did you determine 32 was a good batch size for this data size? (this is about 0.9% of the data?)

  • @LeoAr37

    @LeoAr37

    7 ай бұрын

    I think the companies that trained big LLMs usually used 2-3 epochs

  • @engineerprompt

    @engineerprompt

    7 ай бұрын

    Batch size determines how much data is fed to your model at once. 32 is the max I could do on the available hardware. Usually you will see that to be much lower. In regards to the epochs, you are right. In one epoch, the model will see each example once. If you have small amount of data, you might want to go over multiple epoch so the model can actually learn from the data but you need to be careful that the model can also overfit. For large amount of data (billions or trillions of tokens) its very expensive and time consuming to have several epochs over the data, that's why you mostly see models trained for one more two epochs only. Hope this helps.

  • @pallavggupta
    @pallavggupta7 ай бұрын

    Hi, I am trying to build an organisation level AI trained on my company data I would to know how can I create dataset for my data to be trained on mistral AI I was unable to find any tutorial on how to create a dataset for large data

  • @conscious_yogi

    @conscious_yogi

    6 ай бұрын

    Did you found solution for this?

  • @nishhaaann

    @nishhaaann

    6 ай бұрын

    Looking for same thing​@@conscious_yogi

  • @user-yd3zk4hb1o
    @user-yd3zk4hb1o7 ай бұрын

    So can't we run in colab or kaggle notebook?

  • @ilianos

    @ilianos

    7 ай бұрын

    in the video descr it says no (not on T4)

  • @luciolrv

    @luciolrv

    7 ай бұрын

    I could not run it in A100 of Colab. It complains of lack of memory, not too much: actually less than 1GB. The "copilot" of colab gives some suggestions such as reducing batch size or the max_split_size_mb parameter, but that does not reduce enough. Any ideas? Good notebook

  • @jonjino

    @jonjino

    7 ай бұрын

    ​@@luciolrv It complains of less than 1GB of memory, but that's because it's loading the model a bit at a time so the error message isn't accurate. Kaggle doesn't offer better GPU's either. You'll need to setup a VM with an A100 80GB or H100. Unfortunately you'll probably just have to go through the hassle of setting up a VM with one of those GPU's via GCP or AWS.

  • @protimaranipaul7107
    @protimaranipaul71075 ай бұрын

    Thank you share such wonderfull video! Waiting for a video that discuss about finetuning. So that we can use higher than 32k token. Have you or any person worked with the folloing? 0) How did we measure performance after fine tuning? Did they perform well? Perplexity? 1) Json files? Creating graphs to store the context? 2) and or Large csv/sql file? As llama code sql code is not working well 3) Any image/diffusion models? Appreciate it!

  • @AIEntusiast_
    @AIEntusiast_5 ай бұрын

    i wish someone made a video from collecting data example pdf, conver that to working dataset tha can be used to train model, everyone is using huggingface models and just retrain another llm

  • @user-ig2og2yq3b
    @user-ig2og2yq3b5 ай бұрын

    please let me know how to create a fixed forms with the below structures with special command to LLM: Give me score out of 4 for (based on the TOEFL rubric) without any explanation, just display the score. General Description: Topic Development: Language Use: Delivery: Overall Score: Identify the number of grammatical and vocabulary errors, providing a sentence-by-sentence breakdown. 'Sentence 1: Errors: Grammar: Vocabulary: Recommend effective academic vocabulary and grammar:' 'Sentence 2: Errors: Grammar: Vocabulary: Recommend effective academic vocabulary and grammar:' .......

  • @scortexfire
    @scortexfire4 ай бұрын

    How do I fine tune without prompt and instruction? I basically want the model to "know" about a thousand very recent web articles.

  • @engineerprompt

    @engineerprompt

    4 ай бұрын

    In this case, you probably want to further pretrain the base model with your dataset (you don't need prompt & instructions format) and then finetune it on a dataset. Or just use RAG.

  • @tomski2671
    @tomski26717 ай бұрын

    I think you can rent an H100 for $5/hour. So this would cost about $7

  • @hemeleh8683

    @hemeleh8683

    7 ай бұрын

    where?

  • @kunalr_ai
    @kunalr_ai7 ай бұрын

    64 gb vram kaha se laaoge pata nahi kaunse dataset par fine tune kiya hai bhai kisi kaam ka nahi hai ye video tere paise to view se aa gaye humare paise kaise banege

  • @bashafaris5908
    @bashafaris59087 ай бұрын

    🥹‼️I am student.. who has no budget at all..but intrested in training any of the llm with my own dataset What are the cost effective ways?

  • @jonjino

    @jonjino

    7 ай бұрын

    Get a 3B parameter model and play around with that. This can probably fit on the free T4 GPU in Google Colab since it's much smaller.

  • @matbeedotcom
    @matbeedotcom7 ай бұрын

    How much VRAM is necessary?

  • @engineerprompt

    @engineerprompt

    7 ай бұрын

    About 45GB

  • @matbeedotcom

    @matbeedotcom

    6 ай бұрын

    @@engineerprompt Do you suggest fine tuning on base model, and then further fine tuning with Q&A instruct format data?

Келесі