No Priors Ep.61 | OpenAI's Sora Leaders Aditya Ramesh, Tim Brooks and Bill Peebles

Ғылым және технология

AI-generated videos are not just leveled-up image generators. But rather, they could be a big step forward on the path to AGI. This week on No Priors, the team from Sora is here to discuss OpenAI’s recently announced generative video model, which can take a text prompt and create realistic, visually coherent, high-definition clips that are up to a minute long.
Sora team leads, Aditya Ramesh, Tim Brooks, and Bill Peebles join Elad and Sarah to talk about developing Sora. The generative video model isn’t yet available for public use but the examples of its work are very impressive. However, they believe we’re still in the GPT-1 era of AI video models and are focused on a slow rollout to ensure the model is in the best place possible to offer value to the user and more importantly they’ve applied all the safety measures possible to avoid deep fakes and misinformation. They also discuss what they’re learning from implementing diffusion transformers, why they believe video generation is taking us one step closer to AGI, and why entertainment may not be the main use case for this tool in the future.
Show Notes:
0:00 Sora team Introduction
1:05 Simulating the world with Sora
2:25 Building the most valuable consumer product
5:50 Alternative use cases and simulation capabilities
8:41 Diffusion transformers explanation
10:15 Scaling laws for video
13:08 Applying end-to-end deep learning to video
15:30 Tuning the visual aesthetic of Sora
17:08 The road to “desktop Pixar” for everyone
20:12 Safety for visual models
22:34 Limitations of Sora
25:04 Learning from how Sora is learning
29:32 The biggest misconceptions about video models

Пікірлер: 30

  • @jonkraghshow
    @jonkraghshow14 күн бұрын

    Really great interview. Thanks to all.

  • @garsett
    @garsett13 күн бұрын

    Smart! 😊 Personalisation and esthetics. Cool. But also PRACTICAL worldbuilding please. How can this help create quality lifestyles? Happy communities? A convivial society?

  • @erniea5843
    @erniea584313 күн бұрын

    Cool interview, awesome to see a glimpse into the innovation being done to develop these video models

  • @leslietetteh7292
    @leslietetteh729214 күн бұрын

    Interesting video! It really highlights the potential of using 3D tokens with time as an added dimension :). My experience with diffusion models and video generation didn't show anything quite like Sora's temporal coherence. Looking ahead, I'm excited about the prospects of evolving from polygon rendering to photorealism via image-to-image inference. While I might be biased due to my interest in this rendering, I think incorporating 'possibility' as an additional dimension, as suggested by "imagining higher dimensions", could address issues like the leg switching effects we currently see. Such physics-consistent behavior could potentially be borrowed from game engine scenarios, where, unlike an apple that behaves predictably when dropped, a leg has specific movement constraints (also affected by perspective shifts). It’s a speculative route, but it might be worth exploring if it promises substantial improvements.

  • @tianjiancai1118

    @tianjiancai1118

    13 күн бұрын

    Maybe internal 3D modling should be introduced to solve the issue you have mentioned (leg switching, or so called "entity inconsistency".

  • @leslietetteh7292

    @leslietetteh7292

    10 күн бұрын

    @@tianjiancai1118 How so? (NB: you're familiar with how diffusion models work? It's just learning to denoise an image, or a cube in this case. I just suggest that it learns to denoise the branching possibilities rather than a cube, so it knows what is not a possibility - suggesting, not guaranteeing the idea will work. There are things like ControlNets though, so if this internal 3D modelling is a valid idea, please share)

  • @tianjiancai1118

    @tianjiancai1118

    10 күн бұрын

    Sorry to clear that, but internal 3d modeling is hard to achieve in a diffusion model (as far as I know). What I mean is somehow a totally new arch.

  • @Glowbox3D
    @Glowbox3D13 күн бұрын

    As a 3d artist, filmmaker and actor, SORA has me super excited. I can't wait to play around with this tech. It's pretty crazy how all these modalities are happening at once--image, video, voice, sound effect, and music. All the pipelines needed to create media. There will be a time not far off, where we can plug in the prompt, and SORA 5 will create all the needed departments. As the human working with this, I would of course be heavily involved in the iterative generation and direction of each piece of media...and in the end the edit would be mine. I wonder how much 'authorship' a creator will have or be given.

  • @boonkiathan

    @boonkiathan

    10 күн бұрын

    but prior to commercially utilizing the SORA output there must be clarity on the source of the training data it can't be OpenAI pushing it to creators, and the creators saying they trust OpenAI this is almost the exact same issue as textual generation for fun and brainstorming, fair use i suppose

  • @EnigmaCodeCrusher
    @EnigmaCodeCrusher13 күн бұрын

    Great interview

  • @JustinHalford
    @JustinHalford14 күн бұрын

    Compute and data are converging on becoming interchangeable sides of the same coin. Flops are all you need.

  • @oiuhwoechwe
    @oiuhwoechwe14 күн бұрын

    I'm old. these guys look like they just left high school.

  • @voncolborn9437

    @voncolborn9437

    13 күн бұрын

    Haha, I'm 71. I know exactly what you mean. The average age of the developers of the first Mac was 28 years old. It seems like the average age of the AI community is so young but that gives these super smart people a lot of years to get things straightened out.

  • @mosicr

    @mosicr

    12 күн бұрын

    They almost have . Peebles is just out of university.

  • @amritbro
    @amritbro14 күн бұрын

    Im definitely following these three talented guys on X. Really great interview and without a doubt Sora is already making an impact in Hollywood like once Pixar did during a steve jobs era.

  • @AIlysAI
    @AIlysAI14 күн бұрын

    Really all these amazing things are just possible with transformers, nothing much innovation but just apply transformers to X and scale it. The most innovative thing they did was a tokenization method as boxes the rest is mechanics.

  • @leslietetteh7292

    @leslietetteh7292

    14 күн бұрын

    Adding another axis in the form of imaginary numbers improved our ability to model higher dimensional interactions before. That's negative, bordering on bias - if it isn't innovation, then why didn't everyone else do it?

  • @BadWithNames123
    @BadWithNames12310 күн бұрын

    vocal fry contest

  • @phen-themoogle7651
    @phen-themoogle765113 күн бұрын

    The Matrix basically

  • @davidh.65
    @davidh.6513 күн бұрын

    Why would they hype Sora up and then not even have a timeline for releasing a product??

  • @tianjiancai1118

    @tianjiancai1118

    13 күн бұрын

    Because they are still working on prevention from misuse

  • @jeffspaulding43
    @jeffspaulding4314 күн бұрын

    our subconscious does a much better job at modeling physics. you conscious mind imagines the apple falling vaguely. your subconsious mind can learn to juggle several apples without dropping them so it knows when they will be where

  • @leslietetteh7292

    @leslietetteh7292

    13 күн бұрын

    We perceive possibility (which can be thought of as an extra dimension, idea from "imagining extra dimensions"). I would think if trained on branching "possibilities" it'd be much more consistent physics. But especially with the idea of polygon-rendering to photoreal image-to-image inference on the horizon, there's more of a focus on speeding up inference these days (see Meta's amazing work on "Imagine flash" with emu). With this sort of temporal consistency, if openai manages to get inference speed up, could just use a traditional videogame physics engine with photoreal inference laid on top. It'll probably sell a lot, especially if they map electrical signals through the spinal cord to touch input and replicate that. Seeing and touching the real world through vr will be epic, and yeah probably sell loads. Could train the next gen of AI engineers (think deep-sea or deep space repair) in a simulation that looks identical to, and behaves identically to the real world.

  • @tianjiancai1118

    @tianjiancai1118

    13 күн бұрын

    Branching possibility introduces higher cost in an exponential way, so knowing how to (ralatively) precisely predict something is also important. Human certainly learn possibility, and we learn certainty too.

  • @leslietetteh7292

    @leslietetteh7292

    12 күн бұрын

    @tianjiancai1118 Certainly. I'm almost sure it'd have a positive effect on modelling what are essentially 4d interactions effectively, but with the sort of inference speed ups we're seeing now, I'm pretty sure image-to-image inference, polygon rendering to photorealistic is the way to go for the easy win.

  • @tianjiancai1118

    @tianjiancai1118

    12 күн бұрын

    You have memtioned "easy win". I would argue that any generation without understanding its nature can't be precise enough. Reference speed is important, but reference quality is also important to achieve indistinguishable (or so called no mistake) result. Though you can speed up reference and offer realtime generation, they are still cases requiring resonable results.

  • @leslietetteh7292

    @leslietetteh7292

    10 күн бұрын

    @@tianjiancai1118 "Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation" is a really good paper by Meta that you should read, its achieves super-fast inference without really compromising on quality. there are some pretty good demos of the quality they're achieving with real-time inference.

Келесі