AssemblyAI
Күн бұрын
112,222
1

Diffusion models explained in 4-difficulty levels

In this video, we will take a close look at diffusion models. Diffusion models are being used in many domains but they are most famous for image generation. You might have seen diffusion models at work through Dall-e 2 and Imagen.
Let's look into how diffusion models learn and manage to create high-resolution, realistic images.
Check out the blog post for a more detailed look at diffusion models. www.assemblyai.com/blog/diffu...
Get your Free Token for AssemblyAI Speech-To-Text API 👇www.assemblyai.com/?...
▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬▬
🖥️ Website: www.assemblyai.com
🐦 Twitter: / assemblyai
🦾 Discord: / discord
▶️ Subscribe: kzread.info?...
🔥 We're hiring! Check our open roles: www.assemblyai.com/careers
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
#MachineLearning #DeepLearning

Пікірлер: 128

@pi5549 Жыл бұрын
'3/4/5-levels' looks like a very powerful way of explaining concepts. I'd like to see the higher levels be longer, and really drill down into the heart of the matter. So that the final level is communicating at an expert level. +1 / subbed.
@cosmingurau Жыл бұрын
Sorry, but I don't understand something very important. WHY would you add the noise and then substract the noise? Correct me if I'm wrong, but the rightmost noise image in this example is basically an encoded image of the original dog image, that can be decoded deterministically with the neural network, in multiple steps. That's nice and dandy. And I do understand that the noise image is not like a RAR archive, which, were it to be slightly modified, would just yield corruption errors, and instead the modified noise image would still generate... an image. NOW. 1. How do you get from the user text prompt to the noise image of what the user WANTED, that will THEN be denoised (decoded)? 2. How is it so that not every OTHER noise result from the text prompt (except previously deterministically encoded images like this dog image for example) will output just a bunch of garbled mess? And yes, I know that is sometimes the case, I used Stable Diffusion daily.
@synthoelectro Жыл бұрын
now that's some quantum technology, man... Being one of the beta testers of Stable Diffusion helps me understand this even more.
@AssemblyAI
Жыл бұрын
Awesome!
@uquantum4 ай бұрын
Thanks so much for a useful presentation…what a good idea to present in several levels!
@yricktube Жыл бұрын
The way you describe it is how the get the 'original' picture back. But all the content generated is new (in that combination). I was waiting for the explanation of the step that describes how a new combination (so new content) is generated into existence from the latent space through diffusion, not only the method on how to get the starting picture back from the noise...
@polyfoxgames9006
Жыл бұрын
You pair it with CLIP, which takes in a text string and image and outputs the distance between them. You denoise while lowering this distance
@I77AGIC
Жыл бұрын
The video did explain this very briefly. It's as simple as creating a random noise image and feeding it into the same model you used during training. It will turn it into a real image exactly the same way it happened during training. You just get rid of the part that turns an image into noise and use the part that turns noise into an image. You don't have to use CLIP or any text at all. That's a whole other ball game
@kaushiks7303 Жыл бұрын
Thank you so much for the elegant explanation.
@sotasearcher6 ай бұрын
Such a great video to dive in! I'm live streaming learning about Diffusion, right now!
@Kaleubs4 ай бұрын
Thanks for this video, this was very insightfull. Still have a lot to learn about this topic that will revolutionize our world so much
@user-wr4yl7tx3w Жыл бұрын
This was so helpful. Love this format of starting easier and add layers of explanations.
@AssemblyAI
Жыл бұрын
Great to hear, thanks!
@malikfahadsarwar22818 ай бұрын
It would be good if you also explain the reverse process in detail as you explained the forward process
@shashankshekharsingh29122 ай бұрын
Now, that's a great explanation for Diffusion Models.
@Democracy_Manifest11 ай бұрын
This is an excellent video. Love the format. Well done, more please!
@AssemblyAI
11 ай бұрын
Thanks, will do!
@sinsernadeesoyo6 ай бұрын
This video was awesome! Well done :) and thank you
@hamidzemirline73183 ай бұрын
thanks for this great presentation
@paramino Жыл бұрын
This is very good intro for quick understanding of the concept 👍
@AssemblyAI
Жыл бұрын
Glad it was helpful!
@zhaoyufei9096 Жыл бұрын
really good video! I have checked few blogs explain how diffusion mode works, still can not understand. But after see your video one time , i have a better understanding how diffusion actually works! Really thanks!
@AssemblyAI
Жыл бұрын
That's great to heat Zhao! Thank you for watching. :)
@AIMLDLNLP-TECH7 ай бұрын
Appreciate your explanation skill. Q. What is diffusion model Ans. Let's say you tell your best friend, Sarah, about this amazing new flavor. Sarah gets excited and tells her friend, Tom. Then Tom tells his cousin, Emily. Emily, in turn, tells her family, and the news keeps spreading from person to person, creating a chain reaction. This process of your ice cream flavor information spreading from one person to another is like how a drop of ink spreads in water. At first, it's just a small spot, but then it spreads out and covers more and more area as time goes on. In the diffusion model, experts study how things, whether it's information, ideas, or products like your ice cream flavor, spread through a community of people. They try to understand how fast it spreads, how many people it reaches, and what factors influence its spread. By understanding these patterns, they can learn a lot about how people share and adopt new things!
@yousufmamsa11 ай бұрын
Great explanation of diffusion models. Thank you.
@AssemblyAI
11 ай бұрын
Glad it was helpful!
@inetmiguel5 ай бұрын
Nice explanation! I feel like the video title is misleading, it is just one explanation going deeper and not complete without the deeper levels of knowledge and differs a lot from other videos that start from zero the explanation at different levels. This is more like 4 shades of Diffussion :D Thanks for sharing!
@soulaymanal-abdallah6410 Жыл бұрын
AMAZING !! Thanks so much!
@MrAlextorex Жыл бұрын
Diffusion models actually predict a bit of noise to remove from the input noisy image at inference time. The noise is added to images just to produce training data.
@OpuYT Жыл бұрын
Thank you for your explanation!
@AssemblyAI
Жыл бұрын
You're welcome!
@alirezaakhavi994310 ай бұрын
really amazing video thank you very much! subbed! :)
@randomaccessofshortvideos62145 ай бұрын
❤🎉 amazing lecture
@faridalaghmand48023 ай бұрын
Excellent:)
@whentheinternetwasgood8049 Жыл бұрын
so much good info! Thank you!!!
@AssemblyAI
Жыл бұрын
You're very welcome!
@John-eq8cu6 ай бұрын
I want to understand diffusion models so I can understand how it's possible for artificial intelligence to produce an image. Your explanation helps. A bit.
@alaad10095 ай бұрын
Thank You !
@BartoszBielecki Жыл бұрын
Regarding level 3. Is every single pixel diffused at each step, or there is a subset that is randomly chosen? Is the sampling separate for every pixel or we take one value and then multiply it by each pixel? Subsequent diffusions work on the already diffused value, I guess (we don't try to remember what was the mean of the original pixel, but just use the new one)?
@JanMatusiewicz Жыл бұрын
Thanks for clear explanations and link to the blog!
@AssemblyAI
Жыл бұрын
You're very welcome!
@andikafaishal2230 Жыл бұрын
my brain cannot handle this
@user-wr4yl7tx3w Жыл бұрын
Wow this is so helpful.
@AssemblyAI
Жыл бұрын
Great to hear!
@talktovipin1 Жыл бұрын
Thanks for the nice explanation. Appreciate if you can present similar type of explanation and compare DDPM vs DDIM.
@AssemblyAI
Жыл бұрын
You're very welcome Vipin! Noted your recommendation!
@jhanolaer8286 Жыл бұрын
Beautiful❤
@0xeb- Жыл бұрын
Good job.
@dandogamer Жыл бұрын
This was a great explanation! I tried to read the blog first but the maths notation was way over my head
@AssemblyAI
Жыл бұрын
Thank you Chewie :)
@al-aminibrahim1394 Жыл бұрын
thanks for this
@AssemblyAI
Жыл бұрын
You bet!
@akrammekbal8936 Жыл бұрын
diffusion model can add noise to image1 and then in the revers process it make a different image (not the same) ?????? plz rpns?
@thobeycampion5387 Жыл бұрын
wow someone finally pulled this off
@chaneydw Жыл бұрын
A very confusing, yet somehow great explanation of diffusion models. Thank you!
@AssemblyAI
Жыл бұрын
A confusing but somehow positive feedback. :D Thank you!
@Grifter Жыл бұрын
Fascinating stuff, Great explanation.
@AssemblyAI
Жыл бұрын
Thank you!
@Fsh984 ай бұрын
NICE
@ahmedsinger9435 Жыл бұрын
Tysm
@AssemblyAI
Жыл бұрын
You're very welcome. :)
@automatalearninglab Жыл бұрын
Great Video! B)
@AssemblyAI
Жыл бұрын
Thank you!
@harshadmane87856 ай бұрын
Great
@Arrogan283 ай бұрын
Wow, this is really great, it definitely helped me understand how these models are working. However I did have one question. In your explanation of how a gaussian noise was created for an image, I was a bit confused. As i have had to generate an image of pure noise following a gaussian distribution before, but in those cases I just generated it by for each pixel, calling a function to get some random number generated following a gaussian noise distribution, usually centered where 0.5 would say be the zero value for that distrution, and so basically remapping the -1 to 1 distribution to be 0 to 1. ie Xnew = (X/2) + 0.5. Hopefully that makes sense. But the way you described it sounded like the noise was created by placing a sort of splat on the image following say a guassian distribution, and then place down subsequence splats in positions that are based on that first previous splats position on the image. I guess this is needed so you can generated all the inbetween time steps from image to pure noise. Rather then just teh final image.. But I didn't quite get exactly how you are creating the noise. For example are you actually splatting a sort of guassian distrution that happens over several pixels for each position, or is it just effecting that one pixel. I could see it happenign both ways and wasn't quite certain from your explanation which one was happening. ie do you come up with the position, then on that pixel just create one value that follows the guassian distrution curve to pick it's value. Or are you placing some splat that at it's center is say the brightest, but falls off to zero, following a gaussian distribution curve? If the latter then how wide is that, ie what would be the radius in pixels for that? And in either case, how is that mixed with the image? Do you multiply the image by the value in that pixel from the noise you generated, or do you mix between them? I doubt anyone with read this, as it's quite a long comment/question for a yoututbe video, but thought it woudln't hurt to try, as I am very interested in how these models work, and the under the hood details...
@jayseb Жыл бұрын
Great explanation, easy to follow. So in essence, the first step is fixed, then is variable for the decoding if I understand it right?
@MrAlextorex
Жыл бұрын
First step is just to generate training data: final images with coresponding noisy images and the number of steps used to add noise
@targetdexter Жыл бұрын
Great explanation!
@AssemblyAI
Жыл бұрын
Glad you think so!
@rasmustoivanen2709 Жыл бұрын
Can you post a follow up explanation on how the text conditionalized generation works. Like Imagen I quess for example used T5 but how actually that text embedding affects the generated image and how it is trained. Cause in the end we have a system where the "noise" is generated with some text embedding so I am curious how that process works
@AssemblyAI
Жыл бұрын
Thank you for this suggestion Rasmus, noted!
@MrAlextorex
Жыл бұрын
The text is transformed to visual embeddings (visual concepts) using another model and it is fed to diffusion model alogside the noisy image. The other model is CLIPS which was trained separately on images with associated descriptions
@cevxj Жыл бұрын
Thanks
@AssemblyAI
Жыл бұрын
You're very welcome!
@audiogus2651 Жыл бұрын
Anyone else see a horse in this drop of paint? 1:00
@BogdanEchoMilosevic Жыл бұрын
Having just watched 5 videos on this, umm, "topic?" I feel as if I have been in a coma for 25 years. I am looking for the simplest possible explanation on how this whole AI thing works, yet there don't seem to be any videos that can explain that without using already established terminology that, to me, is completely foreign. Your video is obviously well made, and you are good at explaining this, especially with the example of a drop of paint in water, but I am obviously so far from even beginning anything beyond. Apart from understanding "noise", I have no clue as to what "diffusion", or "model" or anything means. I could always watch videos on any topic, i.e. quantum physics, rocket science, robotics, or anything, and get the basic idea, but this time I feel like I'm years behind... If you could make a video explaining this as if you would explain it to someone in kindergarten, I would definitely come back and watch
@ONDANOTA Жыл бұрын
what role do images in the training set play? are diffusion models violating copyright or not?
@abail7010
Жыл бұрын
This depends on the data the model is trained on! There is not only one specific diffusion model instead you can train as much models as you like. If your training data contains copyright limited images you won't be able to use the model for commercial purposes but there are many open source non-copyright datasets out there!
@adammason1587 Жыл бұрын
@AssemblyAI Why 255 (Probability Density Graph), does it have to do with binary? Network Engineer here, and I am trying to draw correlations between IP address ranges being 255 and subnet ranges being 255 and the graph you displayed. They all have binary masks in common hence why I am asking.
@vyndecimibd
Жыл бұрын
It has to do with binary indeed. 0-255 just represents all the 8-bit values possible. When dealing with standard colored pixels we have an 8-bit value for each of the red, green and blue values. Having a 24-bit value for each pixel is simply the standard, it already gives 16+ million unique colors that are possible for a pixel. en.wikipedia.org/wiki/List_of_monochrome_and_RGB_color_formats#24-bit_RGB
@vasudevankannan98232 ай бұрын
Can diffusion models be used in denoising audio. If yes, how?
@Gurugurustan Жыл бұрын
Can someone explain why do we need to know the initial position of the ink in water if we already knew where the ink was first introduced?
@AssemblyAI
Жыл бұрын
Think of it as not as the position but the shape of the ink. We're trying to reach the initial shape right after the moment it was dropped.
@kartikpodugu Жыл бұрын
Learnt a lot of new things from this video. Why it is called as UNet Why it is called as diffusion model. What diffusion model does and how it does. Thanks
@saraebrahimi37955 ай бұрын
that was aweeeeessssommmmmmeeeee
@S.Mullen Жыл бұрын
The explanation of adding noise was well done, but the reverse process--by far the strangest process--was not really explained at all. You introduced, but did not explain, some learning process. This unexplained process "somehow" gets back the image. Every "explanation" of SD always skips over this step! Why? (Also skipped, how the text prompt is "combined" with the image. Folks mumble about CL??, but never clearly explain it.) You are a very very good presenter. Please take 15 minutes to "explain" SD.
@RTukka
Жыл бұрын
Yeah, I am having this frustration as well, except I think I may understood the concepts more poorly than you. Regarding the process of how the Level 4 part "somehow" gets back to the image, it could be because UNets are just really complicated, so it almost has to be handwaved? Every explanation I've seen of them (which is not many) immediately descends into highly technical language. It's evidently a step wise process but I don't understand really anything about what is happening in each of those steps, and what data is used during them. I also don't think I understand the point of _gradually_ adding noise to the original image if you just end up with 100% noise at the end, and then that's where the denoising process starts. Exactly when and how are the partially noised images used? In the UNet? If this is the case, either the explanations of UNets I've seen are missing that info, or they're explaining it in a way that I completely fail to comprehend. In addition, the explanations I've seen tend to use a single image as an example of how the model is trained. But I understand that these models are trained on many images. So the steps laid out in this video are repeated on thousands of images to train the model to generate an image of a dog (or any image??), but how is information from repeating that process combined into the algorithm or latent space or whatever? Do you start with a virgin model or some generalized model or latent space, which then gets modified when you train it on the first image, and then you carry those modifications over when you train the next image? It seems like that ought to be how it works, but if it is, I think a great explanation for how this stuff works would make that explicit. And then, yeah, how do text prompts work? Both at a basic level, with just a single word prompt like "dog," but also, how are complicated multi-prompt words managed? (I imagine many of the common "mistakes" of diffusion models might be illustrative.)
@abail7010
Жыл бұрын
A U-Net is a standardized deep-learning model that takes the image as an input and has another image (with the same dimensions) as an output. It is trained the conventional way with the so called gradient decent algorithm that aims to minimizes the least squared error loss function. In this case, the model aims to predict a mask of the image which represents the noise that was added to the previous step so that we can simply substract that noise from the noisy image to get back to the original image. I hope that was at least somewhat helpful? :)
@jocke8277
Жыл бұрын
@@abail7010 UNet predicts the noise, and a scheduler removes the noise from the image right?
@abail7010
Жыл бұрын
@@jocke8277 On a high level, yes that is true! :)
@jwithy Жыл бұрын
“OK level one… non-equilibrium thermodynamics” 🥴
@bayesianlee6447
Жыл бұрын
level 0 - annealed Langevin dynamics
@AssemblyAI
Жыл бұрын
Hahah I understand the frustration :D But it's just what Diffusion Models are based on so you don't actually have to understand non-equilibrium thermodynamics. :D -Mısra
@xgalarion865911 ай бұрын
Good explanation but i do hate when papers add needless maths and physics which are tangential at best when they should be describing their model in a simple way.
@PhilipRittscher Жыл бұрын
"Full noise" that contains a message is not "white noise". These input "white noise" images are just a puzzle containing info for a computer algorithm to solve. I would not, at this point want to bet our future - or even crossing the street - on "advanced AI"
@MrAlextorex
Жыл бұрын
"Full noise" is just used for AI to see patterns in it like a kid see shapes in a noisy TV screen. It is just a way to give imagination to AI. To get what you want you can guide the generation using text prompts.
@trentkuhn Жыл бұрын
would you say the process is fractal?
@generichuman_
Жыл бұрын
It most definitely is not fractal
@angelxiii3181 Жыл бұрын
I wish my brain was smart enough to understand!
@jonathaningram8157 Жыл бұрын
there is the whole language part missing.
@dcodeai369 Жыл бұрын
I have one question. I hate maths but I love to train models. I tried to learn math but godd it's 😵😵. Any advice?
@ujjalkrdutta7854
Жыл бұрын
Stick to applied ML then. In that case you can make use of existing frameworks and libraries to implement models for solving problems, without knowing the working under the hood. 1. But if you do want to understand the math, the only way is to refer to better learning resources, and keep trying iteratively. Often, it's not the math alone, but the way it is being taught, that makes a whole lot of difference in one's understanding. For eg, back in grad school, I used to refer to Salman Khan's math videos to get the actual understanding of linear algebra concepts (which could not be attained even after reading a few standard books) 2. Having said that, each one of us has to maintain a trade-off in math deep dive vs actual implementation. No ones knows all things a 100%
@dcodeai369
Жыл бұрын
@@ujjalkrdutta7854 What you said about math is true. I'm sticking with applied ml for now. There is a lot to explore there. Thank you for your time
@zenchiassassin2837 ай бұрын
Any level >= 5 ?
@yaruuvva10 ай бұрын
I need level 5/6/7 of explanations
@abraruralam3534 Жыл бұрын
This feels like it should not be possible...then again, its not too different from us humans imagining faces in the clouds. Computers just take this hallucination to the next level.
@terjeoseberg9906 ай бұрын
Wow! There’s another video if yours below this one, and your hair is so different that I didn’t recognize that it’s you.
@truejim Жыл бұрын
In the level 1 explanation, what’s the point of introducing the phrase “thermodynamic equilibrium”? Most lay people understand what it means when we say food coloring diffuses into clear water. Reminding the viewer why that happens from a physics standpoint makes the level 1 explanation less clear, not more clear.
@retroathlete5814 Жыл бұрын
Fine, you add noise to an image and then restore it. VERY simple concepts (even if very hard in practice). But the magic of DALL-E, Midjourney & Stable Diffusion is the creation of NEW images. This is the third video I'm watching that explains the same trivial diffusion concept. Guess I'll have to ask ChatGPT instead.
@curvingorbit8262
3 ай бұрын
Exactly! I've watched and read numerous explanations of diffusion models, but not one so far has told me how the process ends with an image DIFFERENT from the one with which it began.
@chrisyoutube84885 ай бұрын
I came to the comments to see if this was Mandy Moore.
@bunnystrasse3 ай бұрын
Who is the lady? Her @
@kaiboshvanhortonsnort359 Жыл бұрын
I dunno about all that, I just type in 'boobs' and the thing delivers. Whatever math those silicon wafers decide to subject themselves to, that's on them.
@resurrection3554 ай бұрын
You are beautiful
@samsara2024 Жыл бұрын
6 minutes explaining nothing and at the end.. blablabla super fast about convolution... and nothing clear :/
@potrishead Жыл бұрын
Sorry, but this video is very frustrating. Nothing was explained in terms of either the technique for reversing or how it relates to new image creation when prompting, which is obviously what we are mostly interested on.
@rae1220
Ай бұрын
Then this just isn’t the video for you. This was purely explaining the concept Helped me a lot
@milesgreb35375 ай бұрын
This stuff just sucks man
@Adityak19972 ай бұрын
who all think she's AI generated ??
@MistereXMachina Жыл бұрын
Can we take a moment to appreciate how silly it is to say, "we're gonna explain this in 4 levels - 1 being the easiest, 4 being the hardest" and immediately starting level 1 with: "diffusion models were inspired by non-equilibrium thermodynamics from physics and as you can understand from the name this field deals with system d that are not in thermodynamic equilibrium" next time ask ChatGPT to write it for you lmao, imagine going up to a five year old and being like, "Hey kid, you're familiar with thermodynamic equilibrium right? Well the area of machine learning concerned with image generation using diffusion models takes that principle, but is inspired by its inverse."
@franzmkrumenacker2519
Жыл бұрын
Please note that what she referenced in Level 1 is secondary school stuff. The authors obviously assumed *this* basis to build upon, not that of a five-year-old kid.
@cipherxen211 ай бұрын
She might not have technical background. No technical person will mispronounce variance as variation.