Open AI's Sora team thinks we've only seen the "GPT-1 of video models"

Primary Topic

This episode delves into the advancements and future potential of OpenAI's Sora, a generative video model designed to produce complex, high-definition video content from text prompts.

Episode Summary

In this enlightening discussion on the "No Priors" podcast, the OpenAI Sora team explores the capabilities and implications of their groundbreaking video model, Sora. The conversation covers technical aspects of the model, such as its use of scalable transformers and diffusion techniques to generate visually and temporally coherent video clips. The team discusses the potential of Sora as a step towards artificial general intelligence (AGI), emphasizing its ability to simulate realistic, complex environments. They highlight feedback from early users, mostly artists, and discuss the challenges and ethical considerations in developing such advanced AI technology. The episode provides a comprehensive overview of how generative video models could revolutionize content creation, education, and simulation, underscoring the vast untapped potential akin to the early stages of language model development.

Main Takeaways

  1. Sora represents an early but significant advancement in video modeling, akin to the initial stages of language models.
  2. The model can generate complex scenes, showing potential as a tool for creating detailed, dynamic simulations.
  3. Feedback from artists has emphasized the need for greater control over the generated content.
  4. Ethical and safety considerations are crucial as the technology evolves, particularly regarding misinformation and deepfakes.
  5. The technology's future applications could extend beyond entertainment to fields like education and simulation.

Episode Chapters

1: Introduction to Sora

An overview of Sora's capabilities and its role in advancing towards AGI. The team discusses technical aspects and potential applications.

  • Aditya Ramesh: "Sora is on a critical pathway to AGI."

2: Feedback and Future Plans

Discussion on the initial feedback from artists and the ongoing development to enhance Sora's functionality and safety.

  • Tim Brooks: "We're learning about Sora's impact through feedback from artists and red teamers."

3: Ethical Considerations

The team addresses safety concerns related to the potential misuse of video models and strategies to mitigate these risks.

  • Aditya Ramesh: "We need to think hard about safety issues to reach a position that we think is going to be best."

4: Vision for the Future

Insights into the long-term potential of Sora and generative video models, including their implications for AGI and various industries.

  • Bill Peebles: "Sora will supersede traditional capabilities and enable more intelligent world modeling."

Actionable Advice

  1. Explore generative AI tools to enhance creative projects, especially in video production.
  2. Engage with AI ethics discussions to understand and mitigate potential risks associated with advanced models.
  3. Stay informed about the latest advancements in AI to leverage new tools for educational and simulation purposes.
  4. Participate in beta testing and feedback opportunities for new technologies to influence their development.
  5. Consider the long-term implications of integrating AI into your business or creative endeavors.

About This Episode

AI-generated videos are not just leveled-up image generators. But rather, they could be a big step forward on the path to AGI. This week on No Priors, the team from Sora is here to discuss OpenAI’s recently announced generative video model, which can take a text prompt and create realistic, visually coherent, high-definition clips that are up to a minute long.

Sora team leads, Aditya Ramesh, Tim Brooks, and Bill Peebles join Elad and Sarah to talk about developing Sora. The generative video model isn’t yet available for public use but the examples of its work are very impressive. However, they believe we’re still in the GPT-1 era of AI video models and are focused on a slow rollout to ensure the model is in the best place possible to offer value to the user and more importantly they’ve applied all the safety measures possible to avoid deep fakes and misinformation. They also discuss what they’re learning from implementing diffusion transformers, why they believe video generation is taking us one step closer to AGI, and why entertainment may not be the main use case for this tool in the future.

People

Aditya Ramesh, Tim Brooks, Bill Peebles, Sarah Guo, Elad Gil

Companies

OpenAI

Books

None

Guest Name(s):

None

Content Warnings:

None

Transcript

Elad Gil
Hi, listeners. Welcome to another episode of no priors. Today we're excited to be talking to the team behind OpenAI's Sora, which is a new generative video model that can take a text prompt and return a clip that is high definition, visually coherent, and up to a minute long. Sora also raised the question of whether these large video models are world simulators and applied the scalable transformers architecture to the video domain. We're here with the team behind it, Aditya, Ramesh, Tim Brooks, and build peoples.

Welcome to no priors, guys. Thanks so much for having us to start off. Why don't we just ask each of you to introduce yourselves so our listeners know who we're talking to? Aditya, mind starting us off? Sure.

Aditya Ramesh
I'm Aditya. I lead the Sora team together with Tim and bill. Hi, I'm Tim. I also lead the Sora team. And Bill also lead the Sora team.

Elad Gil
Simple enough. Maybe we can just start with the OpenAI mission is Agi. Greater intelligence is text to video on path to that mission? How'd you end up working on this? Yeah, we absolutely believe models like Sora are really on the critical pathway to Agi.

Bill Peebles
We think one sample that illustrates this kind of nicely is a scene with a bunch of people walking through Tokyo during the winter. And in that scene, there's so much complexity. So you have a camera which is flying through the scene. There's lots of people which are interacting with one another. They're talking, they're holding hands.

There are people selling items at nearby stalls. And we really think this sample illustrates how soar is on a pathway towards being able to model extremely complex environments and worlds, all within the weights of a neural network. And looking forward, in order to generate truly realistic video, you have to have learned some model of how people work, how they interact with others, how they think, ultimately, and not only people, also animals and really any kind of object you want to model. And so, looking forward, as we continue to scale up models like Sora, we think we're going to be able to build these world simulators where essentially anybody can interact with them. I, as a human, can have my own simulator running, and I can go and give a human in that simulator work to go do, and they can come back with it after they're done.

And we think this is a pathway to AGI, which is just going to happen as we scale up Sora in the future. It's been said that we're still far away, despite massive demand for a consumer product. Like, what is that on the roadmap, what do you have to work on before you have broader access to Sora? Tim, you want to talk about that? Sure.

Tim Brooks
Yeah. So we really want to engage with people outside of OpenAI and thinking about how Sora will impact the world, how it will be useful to people. And so we don't currently have immediate plans or even a timeline for creating a product. But what we are doing is we're giving access to Sora to a small group of artists, as well as to red teamers to start learning about what impact Sora will have. And so we're getting feedback from artists about how we can make it most useful as a tool for them, as well as feedback from red teamers about how we can make this safe, how we could introduce it to the public.

And this is going to set our roadmap for our future research and inform, if we do in the future, end up coming up with a product or not exactly what timelines that would have. Can you tell us about some of the feedback you've gotten? Yeah. So we have given access to Sora to, like, a small handful of artists and creators just to get early feedback. Um, in general, I think a big thing is just controllability.

Aditya Ramesh
So right now, the model really only accepts text as input. And while that's useful, it's still pretty constraining in terms of being able to, uh, specify, like, precise descriptions of what you want. So we're thinking about, like, you know, how to extend the capabilities of the model potentially in the future so that you can supply inputs other than just text. Do you all have a favorite thing that you've seen artists or others use it for, or a favorite video or something that you found really inspiring. I know that when it launched, a lot of people were really stricken by just how beautiful some of the images were, how striking, how you'd see the shadow of a cat in a pool of water, things like that.

Sarah Guo
But I was just curious what you've seen emerge as more and more people have started using it. Yeah, it's been really amazing to see what the artists do with the model because we have our own ideas of some things to try, but then people who, for their profession, are making creative content are so creatively brilliant and do such amazing things. So shy kids had this really cool video that they made, this short story arrowhead with this character that has a balloon, and they really, like, made this story. And there it was really cool to see a way that Sora can unlock and make this story easier for them to tell. And I think there it's even less about a particular clip or video that Sora made and more about this story that these artists want to tell and are able to share, and that Sora can help enable that.

Tim Brooks
So that is really amazing to see. You mentioned the Tokyo scene. Others. My personal favorite sample that we've created is the Bling zoo. So I posted this on my twitter the day we launched Sora, and it's essentially a multi shot scene of a zoo in New York, which is also a jewelry store.

Bill Peebles
And so you see saber toothed tigers kind of decked out with bling. It was very surreal. Yeah, yeah. I love those kinds of samples because as someone who loves to generate creative content, but doesn't really have the skills to do it, it's so easy to go play with this model and to just fire off a bunch of ideas and get something that's pretty compelling. The time it took to actually generate that in terms of iterating on prompts was really less than an hour to get something I really loved.

I had so much fun just playing with the model to get something like that out of it. It's great to see. The artists are also enjoying using the models and getting great content from that. What do you think is a timeline to broader use of these sorts of models for short films or other things? Because if you look at, for example, the evolution of Pixar, they really started making these Pixar shorts, and then a subset of them turned into these longer format movies.

Sarah Guo
And a lot of it had to do with how well could they actually world model, even little things like the movement of hair or things like that? And so it's been interesting to watch the evolution of that prior generation of technology, which I now think is 30 years old or something like that. Do you have a prediction on when we'll start to see actual content, either from sora or from other models that will be professionally produced and sort of part of the broader media genre? That's a good question. I don't have a prediction on the exact timeline, but one thing related to this I'm really interested in is what things other than traditional films people might use this for.

Tim Brooks
I do think that maybe over the next couple years we'll see people starting to make more and more films, but I think people will also find completely new ways to use these models that are just different from the current media that we're used to, because it's a very different paradigm when you can tell these models kind of what you want them to see, and they can respond in a way, and maybe they're just like new modes of interacting with content that really creative artists will come up with. So I'm actually most excited for what totally new things people will be doing that's just different from what we currently have. It's really interesting because one of the things you mentioned earlier, this is also a way to do world modeling, and I think it show you. You've been at OpenAI for something like five years, and so you've seen a lot of the evolution of models in the company and what you've worked on. And I remember going to the office really early on, and it was initially things like robotic arms, and it was self playing games and things, or self play for games and things like that.

Sarah Guo
As you think about the capabilities of this world simulation model, do you think it'll become a physics engine for simulation where people are actually simulating wind tunnels? Is it a basis for robotics and uses it there? Is it something else? I'm just sort of curious, where are some of these other future forward applications that could emerge? Yeah, I totally think that carrying out simulations in the video model is something that we're going to be able to do in the future at some point.

Aditya Ramesh
Bill actually has a lot of thoughts about this sort of thing, so maybe you can. Yeah, I mean, I think you hit the nail on the head with applications like robotics. There's so much you learn from video, which you don't necessarily get from other modalities, which companies like OpenAI have invested a lot in the past, like language, you know, like the minutiae of, like, how arms and joints move through space, you know, again, getting back to that scene in Tokyo, how those legs are moving and how they're making contact with the ground in a physically accurate way. So you learn so much about the physical world just from training on raw video, that we really believe that it's going to be essential for things like physical embodiment moving forward and talking more. About the model itself.

Elad Gil
There are a bunch of really interesting innovations here. Not to put you on the spot, Tim, but can you describe for a broad technical audience what a diffusion transformer is? Totally. Sora builds on research from both the Dali models and the GPT models at OpenAI. And diffusion is a process that creates data, in our case, videos, by starting from noise, and iteratively removing noise many times until eventually you've removed so much noise that it just creates a sample.

Tim Brooks
And so that is our process for generating the videos. We start from a video of noise, and we remove it incrementally. But then architecturally, it's really important that our models are scalable and that they can learn from a lot of data and learn these really complex and challenging relationships and videos. And so we use an architecture that is similar to the GPT models, and that's called a transformer. And so diffusion transformers, combining these two concepts and the transformer architecture allows us to scale these models.

And as we put more compute and more data into training them, they get better and better. And we even released a technical report on Sora, and we show the results that you get from the same prompt when you use a smaller amount of compute, an intermediate amount of compute and more compute. And by using this method, as you use more and more compute, the results get better and better. And we strongly believe this trend will continue, so that by using this really simple methodology, we'll be able to continue improving these models by adding more compute, adding more data, and they will be able to do all these amazing things. We've been talking about having better simulation and longer term generations.

Elad Gil
Bill, can we characterize at all what the scaling laws for this type of model look like yet? Good question. So, as Tim alluded to, one of the benefits of using transformers is that you inherit all of their great properties that we've seen in other domains, like language. So you absolutely can begin to come up with scaling laws for video, as opposed to language. And this is something that we're actively looking at in our team and not only constructing them, but figuring out ways to make them better.

Bill Peebles
If I use the same amount of training compute, can I get an even better loss without fundamentally increasing the amount of compute needed? So these are a lot of the questions that we tackle day to day on the research team to make sora and future models as good as possible. One of the questions about applying transformers in this domain is tokenization. And so, by the way, I don't know who came up with this name, but latent spacetime patches is a great Sci-Fi name here. Can you explain what that is and why it is relevant here?

Elad Gil
Because the ability to do minute long generation and get to visual and temporal coherence is really amazing. I don't think we came up with it as a name so much as a descriptive thing of exactly what. That's what we call it. Even better, though. One of the critical successes for the LLM paradigm has been this notion of tokens.

Bill Peebles
So if you look at the Internet, there's all kinds of text data on it. There's books, there's code, there's math. And what's beautiful about language models is that they have this singular notion of a token which enables them to be trained on this vast swath of very diverse data. There's really no analog for prior visual generative models. So what was very standard in the past, before Sora, is that you would train, say, an image generative model or a video generative model on just like 256 by 256 resolution images, or 256 by 256 video.

That's exactly like 4 seconds long. This is very limiting because it limits the types of data you can use. You have to throw away so much of the visual data that exists on the Internet, and that limits the generalist capabilities of the model. So with Sora, we introduced this notion of space time patches, where you can essentially just represent data. However, it exists in an image, in a really long video, and a tall vertical video, by just taking out cubes.

So you can essentially imagine a video is just like a stack, a vertical stack of individual images. And so you can just take these 3d cubes out of it. And that is our notion of a token when we ultimately feed it into the transformer. And the result of this is that Sora can do a lot more than just generate, say, like 720p video for some fixed duration. You can generate vertical videos, widescreen videos.

You can do anything between one to two aspect ratio to two to one. It can generate images. It's an image generation model. And so this is really the first generative model of visual content that has breadth in a way that language models have breadth. So that was really why we pursued this direction.

Elad Gil
It feels just as important on the input and training side, right? In terms of being able to take in different types of video. Absolutely. And so a huge part of this project was really developing the infrastructure and systems needed to be able to work with this vast data in a way that hasn't been needed for previous image or video generation systems. A lot of the models before Sora that were working on video were really looking at extending image generation models.

Tim Brooks
And so there was a lot of great work on image generation. And what many people have been doing is taking an image generator and extending it a bit. Instead of doing one image, you can do a few seconds. But what was really important for Sora and was really this difference in architecture was instead of starting from an image generator and trying to add on video, we started from scratch. And we started with the question of how are we going to do a minute of HD footage?

And that was our goal. And when you have that goal, we knew that we couldn't just extend an image generator. We knew that in order to emit an HD footage, we needed something that was scalable, that broke down data into a really simple way so that we could use scalable models. So I think that really was the architectural evolution from image generators to what led us to Sora. That's a really interesting framework because it feels like it could be applied to all sorts of other areas where people aren't currently applying.

Sarah Guo
End to end deep learning. Yeah, I think that's right. And it makes sense because in the shortest term, right, we weren't the first to come out with a video generator. A lot of people, and a lot of people have done impressive work on video generation, but we were like, okay, we'd rather pick a point further in the future and just work for a year on that. There is this pressure to do things fast because AI is so fast, and the fastest thing to do is let's take what's working now and let's add on something to it that probably is, as you're saying, more general than just image to video, but other things.

Tim Brooks
But sometimes it takes taking a step back and saying, what will the solution to this look like in three years? Let's start building that. Yeah, it seems like a very similar transition happened in self driving recently, where people went from bespoke, edge case sort of predictions and heuristics and a little bit of DL to end to end, deep learning in some of the new models. So it's very exciting to see it applied to video. One of the striking things about Sora is just the visual aesthetic of it.

Sarah Guo
And I'm a little bit curious, how did you go about either tuning or crafting that aesthetic? Because I know that in some of the more traditional image gen models, you both have feedback that helps impact evolution of aesthetic over time. But in some cases, people are literally tuning the models. And so I'm a little bit curious how you thought about it in the context of sora. Yeah, well, to be honest, we didn't spend a ton of effort on it.

Aditya Ramesh
For Sora, the world is just beautiful. Yeah. Oh, this is a great answer. Yes, I think that's maybe the honest answer to most of it. I think Sora's language understanding definitely allows the user to steer it in a way that would be more difficult with other models.

So you can provide a lot of hints and visual cues that will sort of steer the model toward the type of generations that you want. But it's not like the aesthetic is deeply embedded. Yeah, not yet. But I think moving to the future, I feel like the models kind of empowering people to sort of get it to Grok. Your personal sense of aesthetic is going to be something that a lot of people will look forward to.

Many of the artists and creators that we talk to, they'd love to just upload their whole portfolio of assets to the model and be able to draw upon a large body of work when they're writing captions and have the model understand the jargon of their design firm accumulated over many decades and so on. So I think personalization and how that will kind of work together with aesthetics is going to be a cool thing to explore later on, I think, to. The point Tim was making about just new applications beyond traditional entertainment. I work and I travel, and I have young kids, and so I don't know if this is something to be judged for or not, but one of the things I do today is generate what amount to short audiobooks with voice cloning, dolly images and stories in the style of the magic treehouse or whatever, around some topic that either I'm interested in, like, oh, you know, hang out with Roman Emperor X. Right.

Elad Gil
Or something the girls my kids are interested in. But this is computationally expensive and hard and not quite possible. But I imagine there's some version of desktop Pixar for everyone, which is like, you know, I think kids are going to find this first, but I'm going to narrate a story and have magical visuals happen in real time. I think that's a very different entertainment paradigm than we have now. Totally.

Tim Brooks
I mean, are we gonna get it? Yeah, I think we're headed there. And a different entertainment paradigm and also a different educational paradigm and a communication paradigm. Entertainment's a big part of that, but I think there are actually many potential applications. Once this really understands our world, and so much of our world and how we experience it is visual.

And something really cool about these models is that they're starting to better understand our world and what we live in and the things that we do, and we can potentially use them to entertain us, but also to educate us. And, like, sometimes if I'm trying to learn something, the best thing would be if I could get a custom tailored educational video to explain it to me, or if I'm trying to communicate something to someone. You know, maybe the best communication I could do is make a video to explain my point. So I think that entertainment, but also kind of a much broader set of potential things that video models could be useful for. That makes sense.

Elad Gil
I mean, that resonates in that. I think if you asked people under some certain age cut off, they'd say the biggest driver of education in the. World is YouTube today. Better or worse? Yeah.

Sarah Guo
Have you all tried applying this to things like digital avatars? I mean, there's companies like synesthesia, hey gen, et cetera. They're doing interesting things in this area, but having a true something that really encapsulates a person in a very deep and rich way seems kind of fascinating as one potential adaptive approach to this. I'm just sort of curious if you've tried anything along those lines yet, or if it's not really applicable, given that it's more like text to video prompts. So we've really focused on just the core technology behind it so far.

Tim Brooks
So we haven't focused that much on, for that matter, particular applications, including the idea of avatars, which makes a lot of sense, and I think that'd be very cool to try. I think where we are in the trajectory of Sora right now is like, this is the GPT, one of this new paradigm of visual models, and that we're really looking at the fundamental research into making these way better, making it a way better engine that could power all these different things. So our focus is just on this fundamental development of the technology right now, maybe more so than specific downstream. That makes sense. Yeah.

Sarah Guo
One of the reasons I ask about the avatar stuff as well is it starts to open questions around safety. And so I was a little bit curious how you all thought about safety in the context of video models and the potential to deepfakes or spoofs or things like that. Yeah, I can speak a little bit to that. It's definitely a pretty complex topic. I think a lot of the safety mitigations could probably be ported over from dolly three, for example, the way we handle greasy images or gory images, things like that, there's definitely going to be new safety issues to worry about.

Aditya Ramesh
For example, misinformation, or, for example, do we allow users to generate images that have offensive words on them? And I think one key thing to figure out here is how much responsibility do the companies deploying this technology bear? How much should social media companies do, for example, to inform users that content they're seeing may not be from a trusted source? And how much responsibility does the user bear for using this technology to create something in the first place? So I think it's tricky, and we need to think hard about these issues to sort of reach a position that we think is going to be best.

Sarah Guo
For people that makes sense. It's also, there's a lot of precedent. Like people used to use Photoshop to manipulate images and then publish them and make claims. And it's not like people said that therefore the maker of Photoshop is liable for somebody abusing the technology. So it seems like there's a lot of precedent in terms of how you can think about some of these things as well.

Aditya Ramesh
Yeah, totally. We want to release something that people feel like they really have the freedom to express themselves and do what they want to do. But at the same time, sometimes that's at odds with doing something that is responsible and sort of gradually releasing the technology in a way that people can get used to it. I guess a question for all of you, maybe starting with Tim, is, and if you can share this grade, if not understood, but what is the thing you're most excited about in terms of the future product roadmap or where you're heading or some of the capabilities that you're working on next? Yeah, great question.

Tim Brooks
I'm really excited about the things that people will create with this. I think there are so many brilliant, creative people with ideas of things that they want to make, and sometimes being able to make that is really hard because it requires resources or tools or things that you don't have access to. And there's the potential for this technology to enable so many people with brilliant creative ideas to make things. And I'm really excited for what awesome things they're going to make and that this technology will help them make. Bill, maybe one question for you would just be, if this is, as you just mentioned, the GPT one, we have a long way to go.

Elad Gil
This isn't something that the general public has an opportunity to experiment with yet. Can you sort of characterize what the limitations are or the gaps are that you want to work on besides the obvious around length? Right? Yeah. So I think in terms of making this something that's more widely available, there's a lot of serving kind of considerations that have to go in there.

Bill Peebles
So a big one here is making it cheap enough for people to use. So we've said in the past that in terms of generating videos, it depends a lot on the exact parameters of the resolution and the duration of the video you're creating, but it's not instant, and you have to wait at least a few minutes for these really long videos that we're generating. And so we're actively working on threads here to make that cheaper in order to democratize this more broadly. I think there's a lot of considerations as a DTN, Tim, we're alluding to on the safety side as well. So in order for this to really become more broadly accessible, we need to make sure that especially in an election year, we're being really careful with the potential for misinformation and any surrounding risks.

We're actively working on addressing these threads today. That's a big part of our research roadmap. What about just core, for lack of better term quality issues? Are there specific things like if it's object permanence or certain types of interactions you're thinking through? Yeah.

So as we look forward to the GPT-2 or GPT-3 moments, I think we're really excited for very complex long term physical interactions to become much more accurate. So to give a concrete example of where Sora falls short today, if I have a video of someone playing soccer and they're kicking around a ball, at some point that ball is probably going to vaporize and maybe come back. So it can do certain kinds of simpler interactions, pretty reliable, you know, things like people walking, for example. Um, but these types of more detailed object to object interactions are definitely, uh, you know, still a feature that's in the oven, and we think it's going to get a lot better with scale, but that's something to look forward to. Moving forward.

Elad Gil
There's one sample that I think is like a glimpse of the few. I mean, sure that there are many, but there's one I've seen, uh, which is, um, you know, a man taking a bite of a burger and the bite and being in the burger in terms of, like, keeping steak, which is very cool. Yeah, yeah, we are really excited about that one. Also, there's another one where it's like a woman painting with watercolors on a canvas, and it actually leaves a trail. So there's glimmers of this kind of capability in the current model, as you said, and we think it's going to get much better in the future.

Is there anything you can say about how the work you've done with sora affects the broader research roadmap? Yeah, so I think something here is about the knowledge that Sora ends up learning about the world just from seeing all this visual data. It understands 3D, which is one cool thing, because we haven't trained it to. We didn't explicitly bake 3d information into it whatsoever. We just trained it on video data.

Tim Brooks
And it learned about 3d because 3d exists in those videos. And it learned that when you take a bite out of a hamburger, that you leave a bite mark. So it's learning so much about our world, and when we interact with the world, so much of it is visual. So much of what we see and learn throughout our lives is visual information. So we really think that just in terms of intelligence, in terms of leading toward AI models that are more intelligent, that better understand the world like we do, this will actually be really important for them to have this grounding of, like, hey, this is the world that we live in.

There's so much complexity in it. There's so much about how people interact, how things happen, how events in the past end up impacting events in the future, that this will actually lead to just much more intelligent AI models, more broadly than even generating videos. It's almost like you invented, like, the future visual cortex plus some part of the reasoning parts of the brain or something sort of simultaneously. Yeah, and that's a cool comparison, because a lot of the intelligence that humans have is actually about world modeling. Right.

All the time, when we're thinking about how we're going to do things, we're playing out scenarios in our head. We have dreams where we're playing out scenarios in the head. We're thinking in advance of doing things. If I did this, this thing would happen. If I did this other thing, what would happen?

Right? So we have a world model, and building Sora as a world model is very similar to a big part of the intelligence that humans have. How do you guys think about the analogy to humans as having a very approximate world model versus something that is as accurate as, let's say, a physics engine in the traditional sense, because if I hold an apple and I drop it, I expect it to fall at a certain rate. But most humans do not think of that as articulating a path with the speed as a calculation. Do you think that sort of learning is parallel in large models?

Bill Peebles
I think it's a really interesting observation. I think how we think about things is that it's almost like a deficiency in humans that it's not so high fidelity. So the fact that we actually can't do very accurate long term prediction, when you get down to a really narrow set of physics, is something that we can improve upon with some of these systems. And so we're optimistic that Sora will supersede that kind of capability and will, in the long run, enable it to be more intelligent one day than humans as world models. But it is certainly an existence proof that it's not necessary for other types of intelligence.

Regardless of that, it's still something that Sora and models in the future will be able to improve upon. Okay, so it's very clear that the trajectory prediction for throwing a football is going to be better. The next versions of these models and. Minus, let's say, if I could add something to that. This relates to the paradigm of scale and the bitter lesson a bit about how we want methods that as you increase, compute, get better and better.

Tim Brooks
And something that works really well in this paradigm is doing the simple but challenging task of just predicting data. And you can try coming up with more complicated tasks. For example, something that doesn't use video explicitly, but is maybe in some like space that simulates approximate things or something. But all this complexity actually isn't beneficial when it comes to the scaling laws of how methods improve as you increase scale. And what works really well as you increase scale is just predict data.

And that's what we do with text, we just predict text. And that's exactly what we're doing with visual data with Sora, which is we're not making some complicated, trying to figure out some new thing to optimize. We're saying, hey, the best way to learn intelligence in a scalable matter is to just predict data that makes sense. And relating to what you said, Bill, predictions will just get much better with no necessary limit that approximates humans. Is there anything you feel like the general public misunderstands about video models or about Sora, or you want them to know?

Aditya Ramesh
I think maybe the biggest update to people with the release of Sora is that internally, we've always made an analogy, as Bill and Tim said, between Sora and GPT models, in that when GPT one and GPT-2 came out, it started to become increasingly clear to some people that simply scaling up these models would give them amazing capabilities. It wasn't clear right away. If we're scaling up next, token prediction result in a language model that's helpful for writing code to us. It's felt pretty clear that applying the same methodology to video models is also going to result in really amazing capabilities. And I think Sora one is kind of an existence proof that there's one point on the scaling curve now, and we're very excited for what this is going to lead to.

Elad Gil
Yeah, amazing. Well, I don't know why it's such a surprise to everybody, but their lesson wins again. Yeah, yeah. I would just say that, as both Tim and Aditya were alluding to, we really do feel like this is the GPT. One moment, and these models are going to get a lot better very quickly, and we're really excited both for the incredible benefits we think this is going to bring to the creative world, what the implications are long term for AGI.

Bill Peebles
And at the same time, we're trying to be very mindful about the safety considerations and building a robust stack now to make sure that society is actually going to get the benefits of this with while mitigating the downsides. But it's exciting times, and we're looking forward to what future models are going to be capable of. Yeah, congrats on such an amazing, amazing release. Find us on Twitter opryorspod. Subscribe to our YouTube channel if you want to see our faces.

Elad Gil
Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week, and sign up for emails or find transcripts for every episode at no dash pryors.com dot.