The Secrets Behind Voice Cloning & AI Covers
Ғылым және технология
To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/bycloud . The first 200 of you will get 20% off Brilliant’s annual premium subscription!
Have you ever wondered how are AI covers made? How are presidents playing overwatch together? Well in this video you'll find out all of the details about AI generated voice, AI voice cloning or voice deepfake that is literally everywhere on the internet right now. From memes to AI covers, AI voice synthesis has become the spotlight without people knowing what and how it is being done. In this video, I'll cover the basics of how AI voice works and how people are using this technology to do things that you have seen.
Special thanks:
- Synthetic Voices
- JustinJohn
- and my editor Askejm
Online Services
[Uberduck] uberduck.ai/
[Fakeyou] fakeyou.com/
[ElevenLabs] elevenlabs.io/
Local UIs
[Tacotron2] github.com/BenAAndrew/Voice-C...
[Tacotron2 Tutorial] • Voice Cloning App
[Ultimate Voice Remover 5] github.com/Anjok07/ultimatevo...
[TorToiSe] git.ecker.tech/mrq/ai-voice-c...
[TorToiSe Tutorial] • Local Voice Cloning fo...
[so-vits-svc 4.0] github.com/voicepaw/so-vits-s...
[so-vits-svc 4.0 Tutorial] • Super Fast Voice To Vo...
[so-vits-svc 5.0 (NEW)] github.com/PlayVoice/so-vits-...
[RVC] github.com/RVC-Project/Retrie...
[RVC Tutorial] • AI Voice Cloning for S...
This video is supported by the kind Patrons & KZread Members:
🙏Andrew Lescelius, alex j, Chris LeDoux, Alex Maurice, Miguilim, Deagan, FiFaŁ, Tony Jimenez, Panther Modern, Jake Disco, Demilson Quintao, Shuhong Chen, Hongbo Men, happi nyuu nyaa, Carol Lo, Mose Sakashita, Miguel, Bandera, Gennaro Schiano, gunwoo, Ravid Freedman, Mert Seftali, Mrityunjay, Richárd Nagyfi, Timo Steiner, Henrik G Sundt, projectAnthony, Brigham Hall, Kyle Hudson, Kalila, Jef Come, Jvari Williams, Tien Tien, BIll Mangrum, owned, Janne Kytölä
[Discord] / discord
[Twitter] / bycloudai
[Patreon] / bycloud
[Music] massobeats - lotus
[Profile & Banner Art] / pygm7
[Video Editor] @askejm
0:00 Intro
2:06 Text-to-Speech AI backbones
3:38 Vocoder AI backbones
4:54 Voice2Voice AI backbones
7:51 TalkNET
8:10 Online services
10:42 Local UIs
11:46 Ultimate Combo?!
13:14 TorToiSe + RVC vs ElevenLabs Pro voice
15:29 Sponsor & Outro
Пікірлер: 171
To plug the sponsor: try everything Brilliant has to offer free for a full 30 days, visit brilliant.org/bycloud . The first 200 of you will get 20% off Brilliant’s annual premium subscription! P.S. Nothing in this video is voiced by a real person. All the voices are fake (except for 12:32 lol) The first 1 min (0:00~0:58) is generated using voice2voice with my real voice as the reference. 0:58~12:47 is generated with the combo which I mentioned in 11:46. From 11:46 till the end is all ElevenLabs Pro Voice Cloning.
@bycloudAI
11 ай бұрын
@@thelegendguyofficial dw the music and the content is not HAHAHA and will probably not be anytime soon here's the music yt link kzread.info/dash/bejne/YoatyaRmncTWo84.html this person makes banger lofi, go support them
@NevelWong
11 ай бұрын
@@bycloudAI So.... if it's ai generated, it cannot be copyrighted, right? So if I use this copyright-free voice to train a model of, and I then use that model to narrate my own videos, that would be legal, right? I am equal parts concerned and titillated.
@jamessharpe2630
11 ай бұрын
@@NevelWongvoices in general can't be copyrighted. If it was a slogan(arrangement of sounds) or roar/yell then yeah copyrightable.
@Mark_Rober
11 ай бұрын
I was thinking to myself every so often 'his voice sounds a bit fake' but I swear it was just because this video was about cloning AI voices and if you had done anything else, like make a minecraft video for example, I wouldn't even have imagined it being AI.
@Deagan
11 ай бұрын
based.
I didnt realize this was AI narrated until you said it was... I just assumed the scuff in the audio was due to using a worse mic like from a laptop or some screw up when editing, it sounded off but not AI off. As much as I believe AI is the future, we are clearly going to be in for a very very rough ride from here on out. You'll basically only be able to trust that something was real if you saw it in person, no audio, no pictures, and no video will be trustworthy.
@gh0stpyram1d
11 ай бұрын
fr i had a whole mental picture of how this admin looked and i realize that was a mental picture of a robot lmaoooo
@asdfssdfghgdfy5940
11 ай бұрын
Nah there are relatively simple ways of digitally signing things to prove you said them or filmed them etc. It will become a problem for the masses for sure especially if people keep believing whatever they see on Facebook. It will be easy enough for the more tech savvy peeps, or people who are required to vet things (e.g. Reporters) to work out if they are real or not. Or at least if they have been signed or not.
@quazar-omega
11 ай бұрын
Then the Matrix credits roll in inside your eyes
WTF I thought that was your voice. I guess generative AI these days is something else.
@albertsitoe7340
11 ай бұрын
I I struggle to understand how society will even function in the next 50 to 100 years
@David.Alberg
11 ай бұрын
@@albertsitoe7340Bro all the experts struggle if the society will function in 3-5 years 😂
@Kynatosh
11 ай бұрын
I heard artifacts so I had doubts
Everything you always wanted to know about speech synthesis* (*but you've never found). Thanks mate for this masterclass ! ❤
Nice information dump, good job on collecting all this info. To be honest, this tech is good enough that I wouldn't be surprised if any of your previous videos were voiced by AI too. As a random youtube viewer I have no idea if cartoon cloud's voice is a real person or totally generated anyway.
Your videos are the best, seriously! Not only do you keep us in the loop about all the cool AI stuff, but you also manage to make it super entertaining. Big thumbs up, man! :D
First time viewer here. When this video showed up in my feed, that click-baity title almost made me skip it, but this is definitely the best video about different options for TTS and voice cloning I've seen yet. Well done. I'll definitely stick around and see what other videos you've made.
"most boring" bit you mention is actually the most useful info in this video, links to websites and what theyre for
Thank you I ve been searching for this so long
That Asmongold cameo lol
You knocked this one out of the park. A+ video.
this is so cool i wanted to do this for so long! thank you!
Thanks so much for this - this is a great place to start for AI voice generation on local machines. I'm eager to experiment on mine
Fantastic overview. Much thanks, bycloud
I missed your videos man, good work, keep it up
The fact that you have to let us know that was not an actual real discord call with asmongold, as if the intelligence in the choice of words did not give it away already
@Askejm
11 ай бұрын
TRUE
@shadowrealms2676
10 ай бұрын
@@Askejm BIG W!
Best video about AI voice cloning I've found so far on the internet. I'm saving it to revisit later when I have more powerful hardware to run the Tortoise and RVC combo. In the meantime I think Eleven Labs will suit my needs. Thanks for all the great info. Subscribed.
Wew lad, one of the best vids that I watched in months. God-tier quality!
BRUH made whole video with this, EPIC!
6:47 nope. It was sovits. They used my weeknd model. Sovits is pretty good at raw studio quality vocals assuming the dataset is good. Which my weeknd model isnt it lol
At last I managed that! Thank You ByCloud !
Beautiful video! Really helpful :)
Might be good to mention you can run Whisper locally to transcribe audio. The large-v2 model is better than whatever KZread uses, even if slow.
@Askejm
11 ай бұрын
Well its included by default in MRQs tortoise ui and i think RVC uses it too
So if I get it right. 1) record voice 2) use whisper to get transcription (+some fixes of text) 3) use text-voice model that is similar to our voice 4) use voice-voice (that model need to be trained on our own) --- -Training of voice happens once. -we are doing all of that to make our dialog more smooth, but we still make voice over to video for correct speed and length of video (not a case when video is created after voice creation).
super interesting, as an AI Product Owner i find your videos invaluable to quickly catching up with all tech at once.
Wow, that was dense - awesome!
amazing video, if this isn't a 1/10 confetti video just know it deserves to be
@bycloudAI
11 ай бұрын
its a 10/10 bottom feeder lol rip
@pikaa-si9ie
11 ай бұрын
@@bycloudAI I'll give you a like to try to push the algorithm 👍😁😁
I was actually fooled too and didn't realize it wasn't his voice until he pointed it out. Any imperfection you hear could be confused with his accent anyway and his monotone voice also helps so it makes it extra hard to spot
@dudedude-su7pt
11 ай бұрын
There thousands of channels like this lol. Most people don't know what voice is robotic or real
wtf this is the first time AI actually fooled me
@ojsef39
11 ай бұрын
i was eating while watching and only notices it because of the muffle and the red line im peripheral vision hahaha
@ojsef39
11 ай бұрын
oh damn, i wasn’t at the part where he revealed it yet. im shocked hahah
@handle__
11 ай бұрын
@@ojsef39same. When I first saw the comments when I haven't yet reached that part I thought people meant the red line parts, but then mind blown🤯😮
@wham7125
8 ай бұрын
Definitely not the first time, but you wouldn't know that of course.
I had absolutely no idea that your voice was completely ai generated... WHAT?!?!?!
@quinnherden
4 ай бұрын
Definitely not. Just that one section :)
At 0:09 I realized that was AI model of your voice. It's hilarious to listen to AI talking about how great voice deepfake is 😂
@Askejm
11 ай бұрын
well thats funny because the first minute is his real voice
@ShepoPL
11 ай бұрын
@@Askejm You're wrong my guy. Listen carefully when he talks with high pitch and compare it with his other videos where he talks this way. You will hear the slight difference
@Askejm
11 ай бұрын
@@ShepoPL no, he did narrate it normally. the artifacts is probably because we added V2V for it to be consistent with the rest of the video. as this was done with RVC v1, it leaded to some artifacting despite a ground truth input
@quinnherden
4 ай бұрын
@@AskejmHe mentions at the end that this is AI
Goated Ai channel
Can you please tell how did you train TorToise TTS in your voice. I saw the repo but it is not mentioned how to fine-tune it on your voice
Hey! your videos are very professional and well edited! You deserve this like and comment.
I was watching this video at 2x speed and got giga fooled by your ai voice, I really couldn't tell this wasn't you.
Great video!
Love your videos!
Just waiting for Live V2V to become viable in the open source space. Would be insane for tabletop RPGs and VA for solo projects. Live RVC is kinda working, but not very well.
@4.0.4
11 ай бұрын
VA for solo projects doesn't need to be live, why trade quality for speed in that case?
@Kisai_Yuki
11 ай бұрын
It already is. You can use the RVC software to create an ONNX and then take the ONNX to MMVCServerSIO. It will work with very little tweaking. The problem is that RVC is more of an auto-tune. It will not change someone's gender, accent or age. It can only create a voice filter. And what is being passed off as "AI singing cover" is really just laundering someone elses singing through this pitch tuning. So taking one singer and using it to sing a different singer, tuned ON that singer, isn't actually a cover, at least not by what the term "cover" means. But it is useful for creating a character voice. So if one were so inclined, a D&D campaign could be made very interesting by using the RVC to train voices (eg a deeper voice for barbarian troll, and a higher pitch voice for a dwarf or halfling) and the GM could create unique NPC's for characters without having to strain their voice.
RVC retains to core trained voice meanwhile sounding smooth. The SO-VIST-SVC removes most of the trained voice personallity , makes it more based on the voice in the source audio and make the voice sound flat weirdly enough, Even for talking RVC has the better strengths . Tho it suffers from sharp note transitions like c2 to c5 which can cause issues.
@stephantual
7 ай бұрын
Exactly. And don't get me started about accents ;) My 'charming' french accent is the bane of these tools.
this video is gold
Bro You Are Amazing.
WOAH i didnt notice it was AI and I work with audio constantly. trippy!
Thanks!!
That "listening to right now" hit me like a freight train. Came to the comments and happy to see everyone else is having a simmilar reaction.
Do we currently have any TTS pipeline with good enough quality for non-english languages?
@Askejm
11 ай бұрын
your best bet is probably 11labs multilingual, which still only supports a handful of languages
another great video. keep it up brother! QUESTION: I want to wait until fall because AMD is gona enable shader conversion (basically allowing high end consumer cards to use CUDA coded AI tools) until i buy a new gfx card, I really struggle learnign new things with my 6gb 1660 Super but i aslo don't ant to support Nvidias incredible greed and market anipulation. Would your ecommend me to wait and support AMD or what would be the route you would go? I want to go full Audio synth setup and im already using Stable diffusion 1.5
are we only limit to voice cloning? any voice generator that generate new voice like changing parameters or combine two voice give one new voice?
great video. really love all of this AI content (keywords for youtube ;P )
Lol, on my smartphone i cant even tell a difference between your Real voice and fake ones!
@krishp1104
11 ай бұрын
At the end he says ALL audio in this video is AI generated
@BHBalast
11 ай бұрын
@@krishp1104 NOT all, there was a Little fragment. :)
@krishp1104
11 ай бұрын
@@BHBalast no literally all audio in the video is AI generated
@BHBalast
11 ай бұрын
@@krishp1104 I Dont get it, in his comment he says one fragment is not.
How do I use a cloned voice to read aloud a pdf file?
Can you please make a tutorial on how to do this its very confusing
Watching at 2x completely smooths out any bumps that rvc has. The cadence sounds off after pointing out that it is AI.
Non ironically still the best primer on the topic - 5 month on! (which is prehistory in AI) - 🤠
Great vid
I really like this type of video from you! The ai news was great, but as a layman it was too scattered
Ah that “crappy” free KZread course Harvard let us have 😂 I actually took the Java CS50 class there and it was very good… I like that they record them so you can watch later!
What about BARK? But I guess it's not so good. Also, what option would be the best in terms of inference speed?
I can't wait for asmin to react to this
Can you have this narrate your weekly AI news videos? I loved that series, and I really would watch them all the same with this voice, I didn't notice until you exposed yourself.
Is RVC still better now that so-vits-svc 5.0 is out?
the biggest problem with TTS is that you need to make a transcription file for all your audio files. So tacotron needing 1-3 hrs of transcribed audio and that can take a very long time to do. RVC and SVC doesn't need transcripts so it's much easier to make training data.
@Askejm
11 ай бұрын
just use whisper
What's up with skipping like a dozen webuis for audio. Not just for this video but many others on the audio AI also just end up showing some barebones default UI and completely miss the projects that are specifically improving the UI and UX.
@quinnherden
4 ай бұрын
Can you suggest some? :)
@FenrirRobu
4 ай бұрын
@@quinnherden I have forgotten a few but there's bark infinity, audio webui, tts webui, then for music there's also audiocraft-webui, Audiocraft plus. RVC has some specific additional UIs, there's also the tortoise RVC pipeline but I'm not sure if it's an UI. I watched the video again and I will say that it's well researched but it focuses on teaching about the technology, rather than showing the best ways to use it. If you want to hardcore go on tortoise, mrq might still be the best (although I think already during this video mrq was migrated to mrq's audio tools or something), RVC's original UI has the most buttons and unexplained options. I'm glad he didn't mention coqui because, at least 6 months ago it was just a closed source tortoise clone.
12:32 My mind blew up.
i need this tts cause i need to make videos that are usually long and i have to keep moving so that means background noise earlier i use to record room and then start recording but it used to take me over 2 weeks just to create a 5 min audio and that is too damn long pperiod. i thing need to do research in all this ool cause i dont have that much of money to invest in any of the company is offering for
Bark is also very interesting
The tacotron one sounded better than the tortoise one
9:13 they made it so you can make your own
How well does the TorToiSe + RVC combo work with other languages?
@Toliman.
11 ай бұрын
It would be reliant on the RVC training of phoneme and language salience of the native recording. Accents are naturally difficult. Ie accents and pronunciation is usually not neutral, so if you use a TTS to generate the non-english version, RVC will interpolate the accent and pronunciation based on the native accent it was generated with. So, if you generate an Austrian voice first, then pass it to a Japanese RVC, it will struggle to find matching properties. But, if you use a Japanese speaker to create English phonemes, and the RVC has examples of these equivalent phonemes, it will substitute. The effect is weird, which is why accents are difficult to emulate.
we're witnessing bycloud turning himself to an ai then he's gonna upload himself to a cloud and live forever
15:12 Is nobody absolutely terrified of this? We could get to the point that someone could grab a minute of you talking and be able to use it accurately anywhere for anything.
I played with allot of these free tools, and 5he most difficult part (as usual) is installing them, lol
Some of those songs that sound good have a lot of work put into them as well. A lot of post processing as well with other audio tools
What I get from this video is EEC, VTC, CCT, VTC, and HIGHGAN. 😂
This is the best video ive seen on this topic many thanks brother! I sent you a message on twitter but i couldnt DM because im not verified but i would like you to help me create a pipeline.
Thank god, I skipped sleep, to click on this video. Awesome survey
for just tts VITS is one of the best options
Gothic-Bot ❤
Weird, I've always done Eleven Labs + RVC, not Tortoise
@Askejm
11 ай бұрын
well imo 11labs is already good enough quality, its resemblance that it lacks. tortoise solves that, and RVC makes up for the subpar quality
@CassBOTRR
11 ай бұрын
@@Askejm i mean for RVC i just set the index rate up super high and it sounds good enough to be the actual person lol
@Askejm
11 ай бұрын
@@CassBOTRR well one should be a little cautious with just jamming the index rate up. the rvc v2 is a lot more intrusive tho in my experience while also sounding better, but i feel like the resemblance you can get is just lackluster since youre limited to only 1 minute
Wow
Neat
does the voice-to-voice follow the inflections in the original voice? ie if i a scream, the generated voice would scream too. even if this is a 10/10 video, its still good. ive been wanting to know how to clone the voice of a younger version of me, and now i know exactly what my options are (i tried researching before, to no avail). thank you ! :DD
@Askejm
11 ай бұрын
yeah. i made bycloud do the crazy frog with V2V and it worked totally fine. the ding worked but surprisingly all the verbal sound effects were cloned too and it genuinely sounded like him
@homeyworkey
11 ай бұрын
@@Askejm oh 100%, this video is extremely convincing. i do have a suspicion though that it is easy as most of his videos his voice is pretty flat (not an insult btw, its calming and i like it) if he had more variance in what voices he made, ie whispering, yelling, singing, speaking fast, speaking slow, even when you speak louder or quieter, its not as a simple as 'lowering the volume', the actual voice changes with it. this would require alot of data and to identify the 'tone' of the original voice recording, so you can interpret it for the fake voice generation. ik its complex but im just wondering where we are at with that sort of stuff.
@Askejm
11 ай бұрын
@@homeyworkey well i found it to work pretty well. also, rvc uses a pretrained model
is it just me that thought that tocotron sounded a lot better than tortoise ?
@Askejm
11 ай бұрын
I think tortoise sounds better but by far the most noticeable thing is how tacotron has very poor resemblance
Is this just me or this vid had a different thumbnail?
@Askejm
11 ай бұрын
he switches it a lot after release, as does him and other youtubers often do
I'm stupid, I just heard lots of words jumbled together; RVC, VITS, VCS, JBC, RVC, BC?!?
Nice
Bar none the best video on the topic. If your mother's tongue is American english, the FLOSS path is the best (use a cloud GPU for speed). But accents are unique to the person (im native french and my english is hit and miss on certain words, which currently no ai can learn, no matter how much data i give it). Even in the best case scenario, it's far from 'perfect' and the affect is overall very flat, as we can hear in this video. But it will get better over time, i'm sure.
us techno bros are not into karaoke xD
what about BARK?
@madcatlady
11 ай бұрын
I have the Bark Webui on my PC and it's a crazy lucky dip what you get, some sing and none sound the same as the previous one
11:50
OKay... I zoned out playing Genshin with this playing on my second monitor and I hear Asmongold and go "Wait wtf !?" and I went back and rewatched the whole thing for context and HOLY CRAP I DID NOT DOUBT THAT IT WAS YOUR VOICE THE WHOLE TIME ! Man... what a time to be alive. A tad too early to pilot mechs in space... just just in time for AI Waifus and have food delivered to your door while you watch anime, explore the stars with hyper-realistic games and argue with strangers on the other side of the world about made up problems.
[solved] this is a channel by fireship, but completely run by ai
my god this is too many tools
@rootatnite
11 ай бұрын
too little*
❤
After about 30 seconds I realised it was AI
gj
Bycloud will soon be THE AI news source as this stuff gets more complicated and controversial and eventually will be completely self sufficient and ran by its own AI models trained on bycloud AI news videos 😶
Golf clap.
wtf I was actually fooled. I thought this was your real voice
the sponsor is not your real voice no way it is