Local AI Voice Cloning with Tortoise TTS - 2024 Installation (Check LATEST update in description)

Ғылым және технология

Links referenced in the video:
LATEST Update - • Updated AI Voice Cloni...
Github Repo - github.com/JarodMica/ai-voice...
Curate Dataset - • How to Make the PERFEC...
Training Better Models - • A Tip on Training Bett...
Timestamps:
Demo - 0:07
Installation - 0:40
Starting and Using - 2:27
Add Voices/Zero Shot Voice Cloning - 6:05
Training a Voice Model - 9:04
Generate Config - 13:33
Run training - 15:32
Using Trained Model - 17:12
Hardware for my PC:
Graphics Card - amzn.to/3pcREux
CPU - amzn.to/43O66Ir
Cooler - amzn.to/3p98TwX
RAM - amzn.to/3NBAsIq
SSD Storage - amzn.to/42NgMFR
Power Supply (PSU) - amzn.to/430bIhy
PC Case - amzn.to/447499T
Mother Board - amzn.to/3CziMXI
Alternative prebuilds to my PC:
Corsair Vengeance i7400 - amzn.to/3p64r22
MSI MPG Velox - amzn.to/42MnJHl
Cheapest and PC recommended:
Cyberpower 3060 - amzn.to/3XjtZoP
Come join The Learning Journey!
Discord - / discord
Github - github.com/JarodMica
TikTok - / jarodsjourney
If you found anything helpful, please consider supporting me and the content I am trying to produce!
www.buymeacoffee.com/jarodsjo...

Пікірлер: 458

@Mowgi6 ай бұрын
We're all very lucky to have someone dedicated to not only teaching us how to use these awesome technologies, but making it as simple and up to date as possible. Keep up the great work, we don't deserve you 🙌
@Jarods_Journey
6 ай бұрын
Thank you thank you 🙏🙏! Really much appreciate it and you're too kind 🥹
@PlaystationEu
6 ай бұрын
@@Jarods_Journeythanks a lot for your work, it's really awesome 😊
@pc_boy5371
6 ай бұрын
I agree with you a 100% love the channel
@brianlink391
5 ай бұрын
Speak for yourself - I totally deserve him! 😉
@SirRubyRed
3 ай бұрын
Is it not possible to download pretrained voices?
@BlueprintBro6 ай бұрын
Thank you so much for always making up to date and accessible guides for everyone!
@legend_of_ray6 ай бұрын
I managed to find the original repo a little while back. Glad your your keeping it alive...thanks for this!
@IOSALive3 ай бұрын
This made me so happy! I liked and subscribed!
@shawn49906 ай бұрын
After getting into AI and programs like Stable Diffusion over the last year, I had to learn some code with all that's required to get them to run properly. However, since I'm not a programmer, what ended up happening is I created more issues for myself, which took way too much time to google and fix my mistakes. Yes, I've learned a ton, but I've pulled nearly all of my hair out in the process. So, thank you for making this a code-free install. Saves me time and more hair-pulling. Again, thank you Jarod... your efforts are appreciated.
@Jarods_Journey
6 ай бұрын
Appreciate it! I know there are a lot of folks that are interested in AI but all of the code revolving around it and dependency managing... Is a hell scape. So, glad that my code free install can help others out there and it also makes sure the tutorial stays the same throughout time :)!
@2mShortFormCC
5 ай бұрын
GPT can code if you know what to ask for
@33rdframe
Ай бұрын
i am the 24th person to REALLY feel this message, lol. i never wanted to learn python 😂
@ShannonWare3 ай бұрын
This is an amazing video. Not only has it gotten me started with voice cloning, it is an excellent summary of quick and dirty model training.
@Samuel-wl4fw6 ай бұрын
Thanks a lot, have been struggling with dependencies, and have been following a few of your videos :)
@Nathanizer6 ай бұрын
Thanks a lot ! I was trying stuff with Conda but all didn't work out as I expected. So followed your video, and with the own custom voices. It all works perfectly. Thanks :)
@supaplay39476 ай бұрын
I'm so thankful for u making this video and for the community who makes these tools. I really want to change my video from silent type of video to more of a entertainment type videos but my main problem is my voice, I was born with bad voice and so I really need something like this for the voice of my video
@nodewizard6 ай бұрын
We have quantized LLMs and Turbo SDXL and LCM models. I think it's time for a turbo/quantized TTS in 2024. Thank you as always for your tutorials and updates.
@jonnysmith93283 ай бұрын
You're Awesome ! I love your videos. They make sense and easy to follow.
@lightning_dynamics4 ай бұрын
thank you so much for putting this all together, I'm making an audiobook and this helps a lot !!!
@haydar_kir6 ай бұрын
The way ai tts companies charging people is ridiculous. I am glad there are people like you. Thank you.
@compositeur8455
6 ай бұрын
You need an Nvidia GPU to run this crap, so it's not much better
@1ajayc
7 сағат бұрын
@@compositeur8455 most people have this already - its the most popular GPU
@Random_person_076 ай бұрын
Thanks so much for making this! it's awesome keep it up!
@joshuadelacruz39075 ай бұрын
Thanks, mate! This is such an awesome job!
@MatthewJettHall23 күн бұрын
OMG you rock!!! Thank you so much for putting this package together for us. It works amazing!!!! Thank you again!
@bwowzah6 ай бұрын
Fantastic video! I greatly appreciate the hard work and dedication you put into what you do on this channel. You've helped me out immensely.
@user-nq7nd8yz4z6 ай бұрын
Thanks for the work ! And the tutorial ! I have leave a subscripton to your channel ! Hope you are well and Start good into the New year!
@Jarods_Journey
6 ай бұрын
Thanks and you as well!
@MR.RECAPER6 ай бұрын
👌👌thanks, i have trying to install tortoice tts from your first video about it. but i always get error when installing pakages but this it was so easy and it actually worked.😊😊😊😊😊😊
@csiguszfoxoup6 ай бұрын
Thank you! Amazingly explained!
@pogiman3 ай бұрын
it worked!! thanks man!!
@schakuun19955 ай бұрын
Genuis!, great Tutorial thanks :)
@Jarods_Journey
5 ай бұрын
Appreciate it :)!
@tyc00n6 ай бұрын
super awesome, I tried doing that recently and gave up. Really good idea including all the dependencies so the process becomes 1. Download 2. Extract 3. Run like everything else people download 😊
@Jarods_Journey
6 ай бұрын
Thanks! The key is using the python embeddable packages, though there are a lot of steps to getting a package up and running correctly😅
@black_dragon274
5 ай бұрын
@@Jarods_Journey Why isn't there a GUI interface for this? Does it have to be through a terminal or browser? It's so primitive!
@UmakantMishra5 ай бұрын
Great package. I will install and explore it. Thank you for sharing your valuable knowledge and experience. Big Like.
@vrtech4736 ай бұрын
nice one ❤ Thanks!
@memesprophet6 ай бұрын
Mssive Respect to you my dude. Really needed this
@gu98386 ай бұрын
will try it out had issues with the cloning part a wile back so we will see thanks!
@Vulk7n6 ай бұрын
Thank you for making videos on rvc and tortoise tts , i hope that one click pipeline comes soon
@jurandfantom6 ай бұрын
Just noticed that you synch your voice with video
@puntogcb4 ай бұрын
Hey Jarod! Just wanted to drop a quick note of appreciation for your content on AI. Your journey into the world of artificial intelligence is both fascinating and informative. Thanks for making complex topics so engaging and easy to understand. Keep rocking those AI insights! 🚀 By the way, any chance trainig Spanish LATAM voices in the future? That would be fantastic! How would it work? Muchas muchas gracias! Abrazo de Argentina!
@paul.j4786 ай бұрын
that's freaking awesome!!
@HistoryIsAbsurd5 ай бұрын
Worth the sub thanks alot
@KurtStaInes6 ай бұрын
LMAO this program now became the Stable Diffusion of voice generation, I admit that it won't take that long for this to improve . Thanks for the fork looking forward for the documentation.
@nektrs3 ай бұрын
Thank you!
@rettbull91005 ай бұрын
My clone voice came out sounding horrible. I used same audio clips that I've used with RVC, which sounds really good. I used all the same setting and did like you said. Though for some reason my long clip was broken up into 0 to 4 sec clips. I made sure all my sets matched what you used. It original audio clip was 54 minutes long. Took over a day to train. edit: the graph lost-mel, green light, was almost at zero at the end of training. I trained it for 500 epochs.
@huyked6 ай бұрын
I wish all the github stuff (I'm a newbie/non-programmer) was this simple. Lol. Thank you!
@Jarods_Journey
6 ай бұрын
And that's why I wanna try and make it as hands off as possible :)! The learning curve sucks in the beginning, but it does get easier though the more you learn it for GitHub though!
@gregorymccollum91074 ай бұрын
😁Saved me hours. Keep working!
@Jarods_Journey
4 ай бұрын
Thank you, appreciate it!
@JiangXina4 ай бұрын
thank you so much
@Starpluck6 ай бұрын
Thank you for this tutorial. I will ensure you will be greatly rewarded for it. --Tutankhamun
@rubenrodenburg44784 ай бұрын
Thanks man
@DYLOGaming3 ай бұрын
Yo! Any reason why my vocals end up sounding super robotic? I'm using custom vocals, but idk why they sound filtered and very bad. Any assistance would be greatly appreciated!
@RobertSmith-kj6eb6 ай бұрын
Bro, I got this working real quick. It is amazing. I copied and pasted voices from a different tortoise-tts and it sounds great! Thanks for sharing!
@Samuel-wl4fw
6 ай бұрын
Where do you find some available voices? I tried to look but couldn't find any
@leighenhenkelman8648
6 ай бұрын
I'm looking for voices too!@@Samuel-wl4fw
@spiffylich33495 ай бұрын
Awesome Video! I'm a bit stuck, though- I have about a 45 minute clip of a character talking, and I've gone and processed it with UVR-5 and the audio-splitter project you linked, so I have a ton of smaller voice-line wav files. But when I try and train the model on them for ~200 epochs, the results I get from using the model are awful! its like around 50% of the words spoken by the generated audio are just noise, or the AI struggling very hard to speak a word. any tips for getting clearer audio? like, should I put my 45 minute video into the voice folder instead of the multiple clips?
@Cadaveri6 ай бұрын
Thank you so much for this release. Finally something that anyone can install and understand without problems! Btw are there any sort of pre-trained datasets or sound file databases available anywhere on the internet that you know of? (popular video game characters etc)?
@Jarods_Journey
6 ай бұрын
Np! As for dataset, I'm not sure, but am pretty sure the audio exists somewhere out there on the web!
@cuccurese5 ай бұрын
I did everything you told in the video, after all, my audio speech has an American accent, but my audio is in Italian language. :D i spent so much time and training.
@prizegotti
5 ай бұрын
It's not trained for Italian. Just American English and Japanese.
@cuccurese
5 ай бұрын
@@prizegotti Thanks!!!!
@DM-dy6vn3 ай бұрын
5:12 As far as "Samples" are concerned, I noted that the "sample_batch_size" is implicitly set to 16 in the code. You can see it in the console when generating. Having "Samples" set to 16 means that there is one batch to process. If you set Samples=100, then 6 full batches will be processes + 4 samples in 7th batch. The time needed is nearly proportional to the number of batches. Having said that, it is not "exponential". The iterations behave close to square root. Quadrupling "iterations" would approx. double the processing time. A batch of samples will be placed in VRAM, and depending on the length of a text chunk, it could push your GPU to the limit as far as VRAM is concerned. Setting "Samples" to something lower than 16 will free VRAM, but potentially lower the quality, since less samples will be used. Do not feed it overly long sentences. Use "Line delimiter" to separate your sentences during processing. You should avoid GPU using "Shared GPU memory" (my RTX 3090 can do this), because by opting for the PC RAM the processing will become even slower (slow data swapping).
@SkalekulАй бұрын
Do you have any idea why custom trained models don't work using hifigan, which produces the error 'tuple' object has no attribute 'device'
@NoahMine119 күн бұрын
wait so do you need the training part the voices dont sound bad without training its not much of a difference
@syrcon5 ай бұрын
Your videos are Awesome Jarod! You do such a good job explaining how to install and setup these repositories (even going the extra mile to fork them yourself to make them easier to work with)! Is it possible to fuse two voices together, or is it viable to train a model by combining two datasets from two different speakers?
@Jarods_Journey
5 ай бұрын
Appreciate it! For tortoise, I believe if you train on two voices, you get a mix or average between the two as this does occur when you use two different files as reference audio files. I actually haven't yet tried this for training so this may be a useful experiment to try.
@syrcon
5 ай бұрын
@@Jarods_Journey I'll have to try it out as well. I assumed that it would have negatively impacted the training of the model, but if it instead blends the two, then that would be really interesting.
@FrankGlencairn6 ай бұрын
After updating 7Zip, I was able to - at least - unpack it, but when running the bat file, the command window just shuts down after "loading autoregressive model" any ideas?
@parmesanzero76785 ай бұрын
Is there an ideal script for voice training? That is, is there an ideal series of things to have the speaker saying to get the best results for new speech from the voice model?
@HauntedVCRАй бұрын
hey! I appreciate your work. I followed and seem to be successful all the way up to the point of generating a configuration. it created my dataset with a lot of files in my folder then it is just not letting me select my dataset, I am not savvy in coding so i dont know what the cause could be since the files are put into training > vinny > audio help would be appreciated
@neros12773 ай бұрын
spent few hours learning from your videos on tortoise tts then tried to make my own module, i decided to go with cloaker from payday 3, result was stupiditely high pitched voice that sounded like shit, trained it on 10 clips 5 second each, set epochs to 200, would you say i should use more samples and more epoch on training? also should samples of character yellign and speaking soflty be traned together or make it separate modules?
@Razor755712 күн бұрын
Any suggestions how to make it clone a voice that had certain effects applied to it? Namely I mean Mr. House from Fallout New Vegas. I have the voice files from the game, but they have a slight "speaking through speaker" effect applied to them(Which is kinda important to keep too...), and the results are pretty bad, sounding nothing like they should and/or turning into completely another voice from one sentence to another. Should I try making entire model with them instead? If so what would be recommended settings?
@thebigbigdaddy5 ай бұрын
great video - did you ever entertain to integrate this with Twilio for creating phone gpt agents?
@bobbyboe6 ай бұрын
Thank you thank you man... finally I have this thing running! Question: Does DeepSpeed make a diffrent in Quality aswell?
@Jarods_Journey
6 ай бұрын
Deepspeed does not as far as my observation sinces it's just parallelizing the process of the autoregressive model to make it faster. At least that's my understanding of it :)!
@SirChogyal5 ай бұрын
I love this. But unlike other applications, why is this AI voice cloning messed up with large files?
@Chriscs74 ай бұрын
13:01 - should I click "Slice Segments" before "process and transcribe button' if my data set is 20 minutes long in a single .wav file
@kaziahmed2 ай бұрын
Follow the steps, got this error: Something went wrong 'tuple' object has no attribute 'squeeze'
@mohamedemam-58076 ай бұрын
thank you for your useful content :), i have a question . is it possible to use the trained models in rvc in tortoise tts ?
@Jarods_Journey
6 ай бұрын
Unfortunately not, they're different architectures so it won't work
@ash38445 ай бұрын
Hi, Thanks for the content. Does it work on Ubuntu? or only windows? facing few issues while running on ubuntu 22
@KaruHart6 ай бұрын
Legend
@maxzan190915 күн бұрын
I have a question, is it possible to use it as an API for automation of answers given by chatGPT and to process and read automatically the output audio ?
@midnitejesus2 ай бұрын
My model came out sounding nothing like it was trained on. I had 2300 super clean chopped samples for a character and realized my 3080 would take forever. I trained on 250 samples over 3 hours. The output was 7 models, from 60_gpt to 402_gpt. I tried them all and the voice is simply pitched too high and sounded nothing like the source files. I followed your instructions to the T. Any suggestions?
@user-vv2oh8ni6oАй бұрын
so we can essentially take elevenlabs generated voices, use them as samples, and clone them?
@ovideotube2 ай бұрын
do you know if there are some tortoise fork which runs on ubuntu or osx? thanks
@datorresramos5 ай бұрын
Nice video, super easy to understand how to install this Tortoise TTS, i have a question how can i access the webgui from another computer on the same network ?
@francsharma72762 ай бұрын
guys, If you are getting error of folder name of "voice". plz put voice sample in wav format only it will be resolved
@LucidFirAI3 ай бұрын
I am in love with this install method! Your tutorials a year ago were usable but kinda hard to follow, this method however is f'ing perfect :) Is there a way to control tortoise through command line so I can run it with a batch file? What is the best way to run it for stable outputs at the expense of perfection?
@bestof467Ай бұрын
How does 'Updated AI Voice Cloning with RVC Inference' different from this? Is RVC separate install procedure or included?
@Soljarag55 ай бұрын
Thanks so much for ut tutorials! What does the temperature setting do?
@Jarods_Journey
5 ай бұрын
Temperature is kinda like randomness. Higher means possibly more random and unstable, lower is more deterministic and stable.
@Soljarag5
5 ай бұрын
@@Jarods_Journey thanks man
@rushic244 ай бұрын
Hi, thanks for the video. Do you know if there is a way to retrain the model with new data?
@negociodenerd5 ай бұрын
Congratulations on the work, I've been following you for a few months now. I would like to know how I can create a model in other languages and make voice cloning at least acceptable.
@CptTurk815 ай бұрын
This is amazing. I can see there's an api option, do you have any guides on how to use it programmatically? Say for automation?
@KeremYurtsevenOfficial2 ай бұрын
I already trained a voice model. So I only have a pth and an index file. How can I use those on TTS?
@corbinangelo3359
2 ай бұрын
I'm very curious about that too, If you figured out a way. please let me know.
@cjmcneley88696 ай бұрын
Is there anywhere to gather voice training audio files. Or anywhere to gather already completed training voices.
@RobertJene6 ай бұрын
gettin ur 2024 video in early I see
@Jarods_Journey
6 ай бұрын
If I put 2023 on it, it'd be outdated a month later 😂
@andrewvz19143 ай бұрын
Is there somewhere we can import pre-trained models that have been downloaded elsewhere? I was trying to get it to work, but kept winding up with crashes, so I'm guessing I've got something wrong.
@carnacthemagnificent24986 ай бұрын
I was really excited to try this because I have not been able to get deepspeed running on my machine, period. However when I run this I get the error about mismatched latents but it adds "The specified pointer resides on host memory and is not registered with any CUDA device" and recalculating latents doesn't make it go away, every time you generate it's back. I guess I'm stuck with old original slow Tortoise.
@Jarods_Journey
6 ай бұрын
What's your Nvidia GPU? This error occurs on my machine if you don't wait for TTS to finish loading or you didn't re(load) TTS in the settings. It's specific to only when you have deepspeed enabled.
@audio.video.disco.25 күн бұрын
Please, do a series only on how to install and use each of these TTS models i'm not a programmer and im having a really hard time, i think you would get a lot of views from these video tutorials.
@LunaNK22Ай бұрын
I got CUDA out of memory error... so can I fix it? I have rtx 3050 4 gb vram
@jeffketter96776 ай бұрын
When I have hifigan enabled, I'm getting very "warbly" under-water sounding audio. I checked with both the random voice and one that I sampled, same effect. If I turn hifigan off, it sounds normal. Any ideas? I've been using this with your audiobook maker (thanks for all your work on these, by the way!) and was really hoping for the speed boost.
@Jarods_Journey
6 ай бұрын
Hifigan with some voices is MUCH lower fidelity for that speed gain. Some voices/trained models do better I've observed, but that's the tradeoff unfortunately
@madokahomura9293 ай бұрын
Thanks all worked great. UPD: I fixed it. If anyone encounters same thing just increase your paging size or set it to automatic. But suddenly I started running into problem. With tortoise TSS it simply doesn't load whisper larger (or higher model) if though everything worked perfectly. It just freezes with connection error. In configuration I get: "Batch size exceeds validation dataset size, clamping validation batch size to 0" and when I finally press train, I get "UserWarning: Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`." and it freezes. Sometimes this error appears "dll load failed while importing _iterative: the paging file is too small for this operation to complete." or "CUDA out of memory. Tried to allocate 12.00 MiB. GPU 0 has a total capacty of 15.99 GiB of which 13.53 GiB is free. Of the allocated memory 1.18 GiB is allocated by PyTorch, and 3.90 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF" With RVC the same thing basically. It complains about not having enough memory even though everything worked fine two weeks ago forcing me to reduce data set or batch size. Would really appreciate your help. Thanks. UPD: It was all because paging file size was too small (512mb). I set it automatically managed size and it fixed it.
@corbinangelo33592 ай бұрын
When I let it read about 14 lines of text, it always switches to random voice for a few lines. How can I stop this and keep it on the voice I picked?
@lombizrak24806 ай бұрын
how can I change change the voice to mel, the one used in the demo?
@user-el2jv2nn5c23 күн бұрын
I followed your data curation video and have ended up with loads of short recordings however in this video you use just a single large recording. When I add all my short recordings into the voices folder and train it does not work and does not create the dataset in the training folder.
@hamedmosavi94493 ай бұрын
when i open start.bat file , this error will appear : "SystemError: initialization of _internal failed without raising an exception" how can i handle this ?
@hamsteralliance6 ай бұрын
I haven't been able to find an answer to this, so I'm hoping you can help. What's going on when RVC training spits out a "nan"? More specifically, will it cause problems? My training output will look like: loss_disc=4.060, loss_gen=2.968 Then 15 epochs later I'll get a: loss_disc=nan, loss_gen=nan If I stop and restart training, it'll resume from the last checkpoint and start displaying normal numbers again. Anything you know about this would be appreciated, thanks! :D
@Jarods_Journey
6 ай бұрын
Mmph, nan is some undefined number. I'm not sure what causes it, but I've seen people report this occuring on logs. If you can still train successfully without problems, then you should be fine
@weightlossmotivation4070
6 ай бұрын
If you are trying to finetune the model and using the weights from the previous training instead of the base D and G pth, sometimes the generators die. So maybe stick with the base weights if you have changed them. Also you might have not trained them on enough steps (talking about the finetuned weights).
@PuwundaАй бұрын
Is this able to fully utilize a multi-GPU system, or does it only utilize one card?
@AMMV245 ай бұрын
"I was not in the mood" hahahah so identified.
@poszukujacprawdy4 ай бұрын
Hi Jarod, can I train my voice in Polish language as well or is it only for English?
@GoodNewsJim3 ай бұрын
How do I turn off the .wav saying,"I AM VERY HAPPTY" when using emotion?
@scedolin6 ай бұрын
hello,perhaps I am a unlucky,but I have a error at the start of the instalation OSError: [WinError 126] Le module spécifié est introuvable. Error loading "C:\AI\ai-voice-cloning\venv\Lib\site-packages\torch\lib\torch_python.dll" or one of its dependencies.
@Elrevisor2k4 ай бұрын
How do you create a voice model? For other languages? Great video
@daryladhityahenry6 ай бұрын
New Question: I never get good quality. I already use my audio file that I use for recording, and free from music etc. Pure vocal. But, I'm getting robot like sound no matter the quality, diffusion or hifigan. Already try to use "High Quality" too.. I train for 500 epoch, try each result ( every 100 epoch ), no one good. Already follow tutorial on split audio file too for data. Is there any missing steps? Thanks. Also, what is "Voice chuck" when we want to generate voice? Thank you so much... [nevermind this all] After I transcribe & process, all is done. On generate configuration, and click "Validate Training Configuration", it said "Empty dataset". But I already check training folder, and my folder audio srt all exists. Why is that? Thank you. I check the code, and it checks "train.txt" file to be empty. What's should be inside train.txt file? Hi! There's some problem with what you make here, which is: the both model for TTS & Whisper is running TT__TT.... My GPU can only hold one of them. So, I can't transcribe & process while TTS server running ( Not even running the process, just starting up ). What manual code that I need to run to transcribe all of the files? I mean, where's the source code located so I can run it manually without running TTS server? [/nevermind this]
@GATECH3D6 ай бұрын
Has anyone found solution for missing VRAM while trying to train?
@pb2806
5 ай бұрын
Tick 'Do not load TTS on Start' in settings. Works for me
@overdriveoutershaxson1837Ай бұрын
also I click on the start and now nothing pops up with it being a system 32 cmd.exe folder with nothing on it t
@narrativeninjaАй бұрын
i just hope that there is a description/function when i hover the buttongs i dont know what to click
@soundgif4 ай бұрын
Hey, thanks for this awesome video. Question - how is the autoregressive model tuned without the VQ-VAE? Since CLVP and CVVP operate on the VQ codes produced by the autoregressive output, wouldn't this harm selection of the samples generated by the autoregressor? I understand that the downstream diffusion model (and presumably the hifigan) operate on the final latents produced by the autoregressive model (and not the codes), so in theory this could be used to tune the autoregressive model weights, but wouldn't it result in poor sample selection performance -- since the autoregressive mel code head can't be trained without the VQ-VAE? Also, just curious - why choose to train the autoregressive model without training the diffusion model (possibly in tandem)? Has any experimenting been done in this area?
@Jarods_Journey
4 ай бұрын
We do have the VQVAE, it's the dvae.pth model inside of the models folder. I'll give you the 2 blogs posts about this: 152334h.github.io/blog/tortoise-fine-tuned/ and 152334h.github.io/blog/tortoise-fine-tuning/ which are better explanations than I can give at the moment. As for training the diffusion model, I don't have a strong enough understanding yet on what finetuning would do for it, but as far as my understanding is with the AR model, we are training in new representations for the tokens in its vocabulary so that it can output appropriate mel tokens for whatever dataset you use.
@michaelmezher9635
4 ай бұрын
Wow! Wish I knew the VQVAE was available before! I'd think tuning the diffusion model may be useful for dramatically different voices from whats found in libritts, since theoretically the space of what can be represented in the diffused Mels is limited to these voice characteristics. This is especially true because the diffusion model is trained (fine tuned after autoregressive model convergence) on the autoregressive latents, not the Mel codes.
@Chriscs74 ай бұрын
11:56 - What model is better in the generate tab ? base, whisperX or something else? You need to explain what gives the most accurate cloning not only what is faster to train