No video

Imagen, the DALL-E 2 competitor from Google Brain, explained 🧠| Diffusion models illustrated

Imagen from Google Brain 🧠 is competing with DALLE-2 when it comes to generating amazing images from just text! Here is an overview of Imagen, DALLE-2 and GLIDE, which are all diffusion-based text-to-image generators.
SPONSOR: Weights & Biases 👉 wandb.me/ai-co...
📺 Diffusion models and GLIDE explained: • Diffusion models expla...
📺 DALL-E: • OpenAI's DALL-E explai...
📺 GPT-2 leaks training data: • Leaking training data ...
Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
Don Rosenthal, Dres. Trost GbR, banana.dev -- Kyle Morris, Julián Salazar, Edvard Grødem, Vignesh Valliappan, Kevin Tsai, Mutual Information, Mike Ton
Check out our daily #MachineLearning Quiz Questions: / aicoffeebreak
➡️ AI Coffee Break Merch! 🛍️ aicoffeebreak....
Paper 📜: Saharia, Chitwan, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, Seyedeh Sara Mahdavi, Raphael Gontijo Lopes, Tim Salimans, Jonathan Ho, David Fleet and Mohammad Norouzi. “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding.” (2022). arxiv.org/abs/...
🔗 Imagen website (Google Brain): imagen.researc...
🔗 DrawBench: docs.google.co...
🔗 MIT technology review: www.technology...
💭 Gary Marcus on Imagen compositionality: garymarcus.sub...
🖼 Twitter thread generating images from your input text: / 1529497457234780162
📜 VQGAN-CLIP paper: arxiv.org/abs/...
📜 High-Resolution Image Synthesis with Latent Diffusion Models paper: arxiv.org/abs/...
💻 If interested in the basic code of diffusion models, here is a wonderful annotated diffusion model from 🤗: huggingface.co...
▶ Outline:
00:00 Generating images from text
00:40 Weights&Biases (Sponsor)
01:40 A brief history of text-to-image generators
04:44 How does Imagen work?
06:36 Classifier-free guidance
07:55 What is Imagen good at? & DrawBench
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
Patreon: / aicoffeebreak
Ko-fi: ko-fi.com/aico...
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔗 Links:
AICoffeeBreakQuiz: / aicoffeebreak
Twitter: / aicoffeebreak
Reddit: / aicoffeebreak
KZread: / aicoffeebreak
#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research​
Music 🎵 : Til I Hear'em Say (Instrumental) - NEFFEX

Пікірлер: 50

  • @mizupof
    @mizupof2 жыл бұрын

    Video after video. I don't understand how "only" 16.2k people follow you. I'll share this diamond. Thank you so much.

  • @AICoffeeBreak

    @AICoffeeBreak

    2 жыл бұрын

    Thanks! 🤗

  • @harumambaru
    @harumambaru2 жыл бұрын

    4:20 We don't want to minimize the work done here or the achievements, we just want to emphasize how the naming of things and the right introduction to the public... !!! I don't want to minimize value of this revue but personally for me this observation is much more valuable than technicalities of all of those papers (because I don't use them directly in my work)

  • @zoombapup
    @zoombapup2 жыл бұрын

    After having played with Dall-E 2 for a few weeks, there are a few things that strike me. 1) This is a huge advancement in coherent image generation, most of the resulting images feel "good" in the sense that previously I couldn't say. 2) That is going to impact the creative industries hard 3) Having a corporation gatekeep these models is hugely problematic. They don't allow you to prompt for whatever you like, they assert copyright etc. 4) We need an open source freely available version, not least because being "allowed" access is too tenuous for artistic use where you might break terms of conditions. 6) We need better methods of guiding. DallE-2 has "variations" which helps, but I want to tell the model what I like in parts of generated images and have it understand the latents that produced those outputs specifically. So overall, this is a really exciting new set of models. I agree with you about the photorealism part, I often switch between them for more creative outputs. Truly revolutionary!

  • @drdca8263

    @drdca8263

    2 жыл бұрын

    Point 5?

  • @zoombapup

    @zoombapup

    2 жыл бұрын

    @@drdca8263 Dammit :) 5) We need more resolution and non-square outputs!

  • @undrash

    @undrash

    2 жыл бұрын

    Yes!! Thank you for saying this! Absolutely in love with this tech, but this keeps me up at night. I started the work on this as soon as I noticed these issues. There are a few very promising projects out there already, for example Boris Dayma's dalle-mini and ruDalle both amazing. I am a software developer but I have a background in visual art and design, so my plan is to get together the software people with the art people to collaborate on this. It cannot be left in the hands of corporations. This is basically a copyright loophole that they managed to exploit to loot every single digital artist and photographer on the open web. Current copyright laws are designed to protect human from human. Human against AI doesn't stand a chance. The genie however is out of the bottle now. There's no way of putting it back. The only way to fight this is to create a much better open-source and freely available option for the public, and I am convinced it can be pulled off. Anyway, sorry for the rant. I am passionate about this. If anybody is interested in getting involved please reach out to me, let's bounce some ideas and try to steer the future into a positive direction.

  • @undrash

    @undrash

    2 жыл бұрын

    @@zoombapup I joined the LAION discord about a week ago and have been going through their material. Amazing community!

  • @TheGamingChad.

    @TheGamingChad.

    2 жыл бұрын

    how did you get access to dall e 2?

  • @samanthaqiu3416
    @samanthaqiu3416 Жыл бұрын

    thanks for these videos, watching the whole series

  • @liam9519
    @liam95192 жыл бұрын

    omg FINALLY I understand "classifier-free guidance" after watching this. I really dislike whoever came up with this name, it should be called "classifier-overweighted guidance" or something because it's not at all 'classifier-free', is it? great vid as always!

  • @arirahikkala

    @arirahikkala

    2 жыл бұрын

    No, they're two different tricks. Overweighting the classifier guidance came first. Classifier-free guidance, really informally and practically, means: You train a conditional diffusion model, but sometimes hide the conditioning signal, so you get a generative model that can work as both conditional and unconditional. Then at each diffusion step, you get both of their score estimates (directions to step toward), and step further toward the class-conditioned direction and away from the unconditioned direction.

  • @liam9519

    @liam9519

    2 жыл бұрын

    @@arirahikkala Ahhh ok. So in the context of this approach vs. just overweighting I can see how they came to that name, as they compare gradients to a 'classifier-free' model, which wasn't there before. It's still not the best naming, imo, but makes more sense.

  • @Skinishh

    @Skinishh

    2 жыл бұрын

    @@liam9519 I think you did not get it. Watch the video again. There is no classifier, it's just passing the input with and without text, measuring the difference and then taking a gradient step on that difference multiplied by a scalar

  • @liam9519

    @liam9519

    2 жыл бұрын

    @@Skinishh Yes but to train the text-conditional part of the model, we needed the model to learn to "classify" (perhaps implicitly) whether a text/image pair belong together or not. A conditional model implies a classifier, to some extent. So I don't think it's really 'classifier-free'.

  • @Skinishh

    @Skinishh

    2 жыл бұрын

    @@liam9519 the text part is a pretrained frozen language model. There is no classifier

  • @bryanpiguave9445
    @bryanpiguave94452 жыл бұрын

    Your channel is a hidden gem! Thanks uploading DL content. 😁👍

  • @bpmsilva
    @bpmsilva2 жыл бұрын

    Excellent video with a rich and detailed discussion on comparing DALLE-2 and Imagen! I'm not sure if Imagen's results are better than the DALLE-2 results, although the authors of the former gave a decent argument using the DrawBrench benchmark. I thought they didn't want human faces due to copyrights and legal issues, but the biases issues you mentioned seem to be the correct answer.

  • @jessedeng3300
    @jessedeng33002 жыл бұрын

    Hi! I very much enjoy your videos. Thank you, if you do requests I would greatly appreciate an explanation on flowbased models as have been trying to wrap my head around it for a while and your educational abilities are excellent.

  • @deeper_learning
    @deeper_learning2 жыл бұрын

    Would you mind introducing the social medias that discuss the latest hot papers by facebook and google?

  • @AICoffeeBreak

    @AICoffeeBreak

    2 жыл бұрын

    You can follow @ak92501 on Twitter for a daily digest of the most important papers. A lot of facebook and google papers among them, but luckily not limited to that.

  • @AICoffeeBreak

    @AICoffeeBreak

    2 жыл бұрын

    Another great one is @arankomatsuzaki .

  • @DerPylz

    @DerPylz

    2 жыл бұрын

    I suggest AI Coffee Break! ☕

  • @deeper_learning

    @deeper_learning

    2 жыл бұрын

    @@DerPylz cool fan of the channel 👏

  • @conan_der_barbar
    @conan_der_barbar2 жыл бұрын

    A video on LDM / Stable Diffusion would be great, especially since it's not as prominent in the public as DALL-E 2

  • @the_nows
    @the_nows2 жыл бұрын

    Can't wait for my corgi-al visit

  • @killers31337
    @killers313372 жыл бұрын

    People are more critical to LLMs than they are to other people. It's absolutely normal for people to mis-hear or misinterpret something. People ask clarifying questions all the time. But if a model misinterprets something, it's considered a limitation, "lack of understanding" or a failure. Don't forget that a model is required to produce an answer instantly, it's not allowed to take a pause and think, or ask clarifying question, etc. I'd argue that drawing a "a horse riding an astronaut" as a horse-riding astronaut is actually correct in these circumstances. There's likely non-zero amount of noise and non-grammatical language use in training data. So it is reasonable to interpret it as a possible error if person-riding-a-horse is extremely more likely than horse-riding-a-person than to draw what is requested literally. In that case it might be possible to solve it with prompt like "Draw this precisely".

  • @alexandrupapiu3082
    @alexandrupapiu3082 Жыл бұрын

    From their paper: "On the set with no people, there is a boost in preference rate of Imagen to 43.6%, indicating Imagen’s limited ability to generate photorealistic people." - so they don't show people examples because they're probably not great and/or likely scary looking :)

  • @GRAMBOStudio
    @GRAMBOStudio2 жыл бұрын

    Is Miss coffee Bean controlled by you and editing? Or is it auto selected based off of emotion detection from a language model?

  • @AICoffeeBreak

    @AICoffeeBreak

    2 жыл бұрын

    I always think of implementing that. I wonder what would be first: me hiring an editor, or me sitting down to automate her. What scares me is the integration with the editing software, not so much the project per se when in an controlled environment where input and output are already given.

  • @GRAMBOStudio

    @GRAMBOStudio

    2 жыл бұрын

    @@AICoffeeBreak Well Im a professional editor. I like your stuff a lot. Let me know if I can help someday :)

  • @AICoffeeBreak

    @AICoffeeBreak

    2 жыл бұрын

    Thanks! Just checked out your channel. Cool stuff! 😎

  • @mixer0014

    @mixer0014

    2 жыл бұрын

    @@AICoffeeBreak A youtuber ‚carykh’ has created an open source tool for that exact purpose. It was released only a few days ago. You might want to check it out, should be his newest video.

  • @AICoffeeBreak

    @AICoffeeBreak

    2 жыл бұрын

    Thanks, just had a look into it! It is not exactly what I want, since he focuses on lip-syncing (which Ms. Coffee Bean does not do). One still needs to annotate the emotions / stances in the script manually. Lip-syncing is something that Adobe tools can do to if one has the money. 🤑 I really should sit down and do this, just fine-tune an emotion recognition Coffee Bean model on my existing videos.

  • @Skinishh
    @Skinishh2 жыл бұрын

    T5 does not use a CLS token. What is the CLS token you refer to?

  • @AICoffeeBreak

    @AICoffeeBreak

    2 жыл бұрын

    Great question! I was re-using the visuals from GLIDE. 🙈 For Imagen, it is like this: "The network is conditioned on text embeddings via a pooled embedding vector, added to the diffusion timestep embedding similar to the class embedding conditioning method used in [they cite Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2022.]". The idea is to summarize the text into one vector, similar to the [CLS] token.

  • @Skinishh

    @Skinishh

    2 жыл бұрын

    @@AICoffeeBreak thanks for the answer! And keep up with the great videos! Huge fan here!

  • @johnvonhorn2942
    @johnvonhorn29422 жыл бұрын

    To allow more creativity just apply a Unix approach; pipe in text from yet another program that allows a _laissez faire_ attitude and can play _fast and loose_ with said text. Extending the avocado chair we might input, "a chair" where we define the set within the text program and the image generator pumps out a load of fruit chairs. Let's get Dr. Pârcălăbescu seated and comfortable, she is a VIP. ❤ te iubim, Romania ❤

  • @mianzhipan3327
    @mianzhipan3327 Жыл бұрын

    wowwwww! I just realized you are the author of VALSE and SEE PAST WORDS. i am also interested in probing vlp models and your papers inspire me a lot. but i have always wondered what is the reason of the phenomenon discovered in VALSE🤣🤣🤣 why these models cannot distinguish those small but meaning-changing modifications in language🤪🤪🤪

  • @AICoffeeBreak

    @AICoffeeBreak

    Жыл бұрын

    Thanks a lot for writing! 😃 It's mostly the training of these models, they are almost never in the situation to distinguish small changes, only very large ones (image-caption vs. Image-random caption).

  • @mianzhipan3327

    @mianzhipan3327

    Жыл бұрын

    @@AICoffeeBreak thanks for your replying!! agree with your opinion. actually we can say these model cannot 'understand ' the languages in fact. But I think to enable models to distinguish such changes during pretrain is quite difficult. From the perspective of adversial training, we want the model to be less sensitive to those small and meaning-preserved change (adversarial examples). While we want the model to be sensitive to those changes in VALSE (some paper call them constrastive examples). Maybe image-text matching head cannot learn such complicated decision boundary 😄

  • @yolgezerisvicrede
    @yolgezerisvicrede2 жыл бұрын

    The revolution will come through when explicit context will be enabled to drive these models (just my humble opinion)

  • @hayvenforpeace
    @hayvenforpeace Жыл бұрын

    I’m so impressed that Imagen can produce coherent text. Dalle-2’s gibberish text is cringeworthy (and sometimes hilarious).

  • @hmistry
    @hmistry2 жыл бұрын

    But how do we USE it???

  • @sdmarlow3926
    @sdmarlow39262 жыл бұрын

    But is this really AI?

  • @maalqua

    @maalqua

    2 жыл бұрын

    yes 100%

  • @danielalorbi

    @danielalorbi

    2 жыл бұрын

    yes, 10% But in the coloquial sense "A System/ML model that does something that looks intelligent." ...But of course, you of all people already knew that, didn't you Steven? Why even ask, lol.