OpenAI Sora and DiTs: Scalable Diffusion Models with Transformers

Sora: openai.com/sora
Sora paper (Video generation models as world simulators): openai.com/research/video-gen...
DiTs - Scalable Diffusion Models with Transformers paper: arxiv.org/abs/2212.09748
My notes: drive.google.com/file/d/1h2pc...

Пікірлер: 14

  • @AbhishekSingh-qp5xk
    @AbhishekSingh-qp5xk4 ай бұрын

    Incredible explanations. Love the clarity of thought and illustrations to visualize the concepts.

  • @yuxiangzhang2343
    @yuxiangzhang23434 ай бұрын

    All concepts beautifully explained! Very intuitive but accurate at the same time! Thank you so much!

  • @xplained6486
    @xplained64862 ай бұрын

    great explaination, not too much detail and not too little. You hit a very good balance which makes it easy to follow the concepts :)

  • @systemdesignstudygroup315
    @systemdesignstudygroup3154 ай бұрын

    I was just looking for this on your channel! Thanks!

  • @progzyy
    @progzyy2 ай бұрын

    Hey! Already watched some of your videos before, but randomly got onto this video again when I needed to learn about DiTs I love how you explain so well and deeply things, even if sometimes you explain basic stuff, it just helps reinforces the learning and it's good Even if it's 1 hour long, it feels like everything is needed

  • @maxziebell4013
    @maxziebell40134 ай бұрын

    Great walkthrough.

  • @johntanchongmin
    @johntanchongmin2 ай бұрын

    Enjoy your content! Keep it up!

  • @signitureDGK
    @signitureDGK4 ай бұрын

    great explanation. I could see how they probably used a ViViT model for Sora. Vivit models have temporal and spatial encoders for self-attention mechanisms probably two DiT blocks ( factorized encoder ViViT model). Also, When would the multihead cross attention model version be used? Let's say for generating images from text prompts with more than 1000 classes. Or perhaps conditioning on even more stuff like audio etc. The DiT Block with cross attention would be preferred? Great video!

  • @thebgEntertainment1
    @thebgEntertainment14 ай бұрын

    Great video

  • @bibiworm
    @bibiworm4 ай бұрын

    11:58 are you talking about ODE solver, ordinary differential equation? Thanks.

  • @regrefree
    @regrefree4 ай бұрын

    Good explanation on the background part. Question, when you explained cross-attention, you said q=z and k,v = [c;t], they don't have the detail in the paper but I think it should be the other way q=[c;t] and k,v=z right?

  • @gabrielmongaras

    @gabrielmongaras

    4 ай бұрын

    Usually the conditioning goes into the keys and queries such as in the Stable Diffusion paper. If you have Q, K, V of shape (N, d), (M, d), and (M, d) where N is the sequence length and M is the context length, then the output shape is SM[(N, d)(d, M)](M, d) -> (N, M)(M, d) -> (N, d). However, if we invert this then we have Q, K, V of shape (M, d), (N, d), and (N, d), then the output shape is SM[(M, d)(d, N)](N, d) -> (M, N)(N, d) -> (M, d) which is a sequence in terms of the conditioning sequence.

  • @regrefree

    @regrefree

    4 ай бұрын

    @@gabrielmongaras Yep I agree. I just read the stable diffusion paper carefully, and you are right they use Q = Z and K, V = C. I would have guessed they would be reverse, since the output of the UNet is Z_T-1 using Z_T. Also their cross attentions weights' shape doesn't makes sense, I am sure they made a mistake. They should have said Q, V=Z and K=C.

  • @bibiworm
    @bibiworm4 ай бұрын

    14:13 don’t quite understand equation for x_t-1