Fine tuning Whisper for Speech Transcription

Ғылым және технология

Get Life-time access to the ADVANCED Transcription Repo:
- trelis.com/advanced-transcrip...
Video Resources:
- Dataset: huggingface.co/datasets/Treli...
- Slides: docs.google.com/presentation/...
- Simple Whisper Transcription Notebook: colab.research.google.com/dri...
- Basic fine-tuning notebook: colab.research.google.com/git...
- PEFT Example: colab.research.google.com/dri...
Other links:
➡️ Trelis Resources and Support: Trelis.com/About
Chapters
0:00 Fine-tuning speech-to-text models
0:17 Video Overview
1:39 How to transcribe KZread videos with Whisper
7:39 How do transcription models work?
20:08 Fine-tuning Whisper with LoRA
43:32 Performance evaluation of fine-tuned Whisper
48:32 Final Tips

Пікірлер: 45

@AbdennacerAyeb3 ай бұрын
easy, simple, well organized. Thank you
@SiD-hq2fo3 ай бұрын
I cant thanks enough for the quality content you are providing please continue to upload such video!!
@gautammandewalker8935Ай бұрын
Great video! You are one of the best teachers I have ever heard.
@anasdavoodtk31603 ай бұрын
Great explanation. the drum story!. good work
@heski68473 ай бұрын
great, thx! I needed it.
@miblish51683 ай бұрын
This video really saved my @$$. I had Whisper&CoLab running a few moinths ago, but it broke. Your video and notebooks showed me why, and taught me several new tricks! Keep it up please.
@scifithoughts3611
Ай бұрын
@Trellis have you considered instead of fine tuning, use an LLM to correct the spelling of Whisper output? (Prompt it to fix “my strell” to “mistrell”, etc.)
@scifithoughts3611
Ай бұрын
Or another alternative is to prompt Whisper with the context and correct spelling of its common transcript mistakes?
@master20543 ай бұрын
good job!!
@m_tron99Ай бұрын
Great video. Can you do one on using WhisperX for diarisation and timestamping?
@onursarikaya1385Ай бұрын
Thank you! It's a great investment :)
@TrelisResearch
Ай бұрын
you're welcome
@user-kr2ec9sd8u3 ай бұрын
This video was very instructive, thanks! For my case, I need a model that recognize items on a list, it consists mainly of medical vocabullary, so a simple whisper model does not get them. Regarding the terms and their pronunciation I will record them in a later moment, but are they inserted in the "DatasetDict()" part of the code instead of Hugging Face's "common_voice"? Also, how is the taught model saved and used in a new project? Untill now I've only used a simple model = whisper.load_model("small") code line in my projects
@TrelisResearch
2 ай бұрын
Your training data will need to be prepared and included into the huggingface dataset (like the new dataset I created). To re-use the model, it's easiest to push it to huggingface hub as I do here, and then you can load it back down by using the same loading code I used for the base model. Technically I think it's possible to convert back to the openai format as well and then load it using a code snippet like you did. See here: github.com/openai/whisper/discussions/830#discussioncomment-4652413
@RustemShaimagambetov3 ай бұрын
Great video! How much data(rows) do we need to train to get acceptable results? Is it enough 5-6 rows ??
@TrelisResearch
3 ай бұрын
Yes, even 5-6 can be enough to add knowledge of a few new words. I only had 6 rows. Probably 12 or 18 would have been better here.
@PierreDELOM3 ай бұрын
Very instructive videos. Next one with Diarization ?
@TrelisResearch
3 ай бұрын
interesting idea, I'll add to my notes
@jetpro2 ай бұрын
Do you know how to export it to ONNX and correctly use it in deployment? Helpful video!
@TrelisResearch
2 ай бұрын
I haven't dug into that angle for ONNX but here's the guide for getting back from huggingface to whisper and probably you can go from there? github.com/openai/whisper/discussions/830#discussioncomment-4652413
@imranullah30973 ай бұрын
Kindly make a video on the following. Hifi-gan with transformer Multi model (text+image)
@TrelisResearch
3 ай бұрын
thanks, I'll add to my list. I was already planning on multi-modal some time. will take me a bit of time before getting to it
@user-yu8sp2np2x3 ай бұрын
Recently I faced a situation where I fine-tuned a model on a training set and it returns good results from the training set example or validation set examples but when I give an input which he has never seen then it tends to produce contextually irrelevant results. Could you suggest what one should do in such a case? One thing that we can do is to make our training dataset more extensive but other than else can we so something else?
@TrelisResearch
3 ай бұрын
create a separate validation set using data that is not from your training or validation set (could just be wikitext) and measure the validation loss on that during training. If it is rising quickly, then you are overtraining and need to train for less epochs and/or lower learning rate
@user-xd1ic9qk8d10 күн бұрын
good job!! but I'm not finding the checkpoints folders
@TrelisResearch
9 күн бұрын
They'll be generated when you run through the training . Also, you need to set save_dir output_dir to somewhere you want the files to be.
@LinkSF13 ай бұрын
Do you know if there’s a way to downsample the frequencies? Eg if I have a 24khz sample I want to downsample to 16khz, what would be the preferred way of doing this?
@TrelisResearch
3 ай бұрын
Howdy! Actually you can check in this vid there's a part towards the middle where I show how to downsample
@imranullah30973 ай бұрын
For low resource language how to train tokenizer and add and then fine tune whisper.?
@TrelisResearch
3 ай бұрын
oooh, yeah low resource is going to be tough. Probably the approach depends on language and whether it has close languages. Ideally you want to start with a tokenizer and fine-tuned model for a close language. If you do need to train a tokenizer, you can check this vid out here: huggingface.co/learn/nlp-course/chapter6/2?fw=pt
@tariqyahia90392 ай бұрын
Question, does the training file have to be in vtt format? or can it be in .txt?
@TrelisResearch
2 ай бұрын
has to have time stamps, so vtt (or srt and you can convert to vtt).
@simonsu-yz9vo2 ай бұрын
is it possible to fine tuning for speech translation?
@TrelisResearch
2 ай бұрын
yes, you just need to format the Q&A for that.
@Rems766Ай бұрын
I've trouble fine tuning the large-v3 model. When I am evalutating, the compute_metrics function do not call properly the tokenizer method and it do not work. Any idea why?
@TrelisResearch
Ай бұрын
hmm that's odd, I haven't trained the large myself, I assume you tried posting on the github repo? any joy there, feel free to share the link if you create an issue
@AndrewBawitlung3 ай бұрын
can u compare it with XLS-R?
@TrelisResearch
3 ай бұрын
thanks for the tip, will be a while before I get back to speech but I have noted that as a topic
@AndrewBawitlung2 ай бұрын
What to do when my language is not in the whisper tokenizer?
@TrelisResearch
2 ай бұрын
Probably imperfect, but maybe you could choose the closest language and then fine-tune from there.
@sumitjana77942 ай бұрын
I have transcripted text in .srt format, can I train with it??
@TrelisResearch
2 ай бұрын
Yes! And for this script you can just convert srt to vtt losslessly using an online tool.
@sumitjana7794
2 ай бұрын
thanks a lot @@TrelisResearch
@matbeedotcom3 ай бұрын
Would a DPO method theoretically work for more effectively fine tuning whisper?
@TrelisResearch
3 ай бұрын
yeah DPO could be good for general performance improvement. for adding sounds/words, standard finetuning is probably best (SFT).