Sora 团队专访:怎么开发的?生成要多久?啥时候能用?
刚刚,Sora 的核心团队接受了一个采访
透露了很多未说的信息
我把采访记录回听了 4 遍,
整理下了英文逐字稿,并翻译成了中文
主持人
能邀请各位百忙之中抽空来参加这次对话,真是十分荣幸~
在对话开始之前,要不先 对赛博禅心的朋友们 做个自我介绍?比如怎么称呼,负责哪些事情?
First of all thank you guys for joining me. I imagine you're super busy, so this is much appreciated. If you don't mind, could you go one more time and give me your names and then your roles at OpenAI.
Bill Peebles
Bill Peedles,在 OpenAI 负责 Sora 项目
My name is Bill Peebles. I'm a lead on Sora here at OpenAI.
Tim Brooks
Tim Brooks,负责 Sora 项目的研究
My name is Tim Brooks. I'm also a research lead on Sora.
Aditya Ramesh
Aditya,一样的,也是负责人
I'm a Aditya. I lead Sora Team
主持人
我对 Sora 了解一些,主要还是看了你们发布的那些宣传资料、网站,还有一些演示视频,真挺牛的。能简单说说 Sora 究竟是咋实现的吗?我们之前有讨论过 DALL-E 和 Diffusion,但说实话,我对 Sora 的原理确实摸不透。
Okay, so I've reacted to Sora. I saw the announcement and the website and all those prompts and example videos that it made that you guys gave, and it was super impressive. Can you give me a super concise breakdown of how exactly it works? Cause we've explained DALL-E before and diffusion before, but how does Sora make videos?
Bill Peebles
简单来说,Sora 是个生成模型。最近几年,出现了很多很酷的生成模型,从 GPT 系列的语言模型到 DALL-E 这样的图像生成模型。
Yeah, at a high level Sora is a generative model, so there have been a lot of very cool generative models over the past few years, ranging from language models like the GPT family to image generation models like DALL-E.
Bill Peebles
Sora 是专门生成视频的模型。它通过分析海量视频数据,掌握了生成各种现实和虚拟场景的视频内容的能力。
具体来说,它借鉴了 DALL-E 那样基于扩散模型的思路,同时也用到了 GPT 系列语言模型的架构。可以说,Sora 在训练方式上和 DALL-E 比较相似,但架构更接近 GPT 系列。
Sora is a video generation model, and what that means is it looks at a lot of video data and learns to generate photorealistic videos. The exact way it does that kind of draws techniques from both diffusion-based models like DALL-E as well as large language models like the GPT family. It's kind of like somewhere in between; it's trained like DALL-E, but architecturally it looks more like the GPT family. But at a high level, it's just trained to generate videos of the real world and of digital worlds and of all kinds of content.
主持人
听起来,Sora 像其他大语言模型一样,是基于训练数据来创造内容等。那么,Sora 的训练数据是什么呢?
It creates a huge variety of stuff, kind of the same way the other models do, based on what it's trained on. What is Sora trained on?
Tim Brooks
这个不方便说太细😊
但大体上,包括公开数据及 OpenAI 的被授权数据。
We can't go into much detail on it, but it's trained on a combination of data that's publicly available as well as data that OpenAI has licensed.
Tim Brooks
不过有个事儿值得分享:
以前,不论图像还是视频模型,大家通常只在一个固定尺寸上进行训练。而我们使用了不同时长、比例和清晰度的视频,来训练 Sora。
One innovation that we had in creating Sora was enabling it to train on videos at different durations, as well as different aspect ratios and resolutions. And this is something that's really new. So previously, when you trained an image or video generation model, people would typically train them at a very fixed size like only one resolution, for example.
Tim Brooks
至于做法,我们把各种各样的图片和视频,不管是宽屏的、长条的、小片的、高清的还是低清的,我们都把它们分割成了一小块一小块的。
But what we do is we take images, as well as videos, of all wide aspect ratios, tall long videos, short videos, high resolution, low resolution, and we turn them all into these small pieces we call patches.
Tim Brooks
接着,我们可以根据输入视频的大小,训练模型认识不同数量的小块。
通过这种方式,我们的模型就能够更加灵活地学习各种数据,同时也能生成不同分辨率和尺寸的内容。
大聪明:看看我写过的这篇《中学生能看懂:Sora 原理解读》
And then we're able to train on videos with different numbers of patches, depending on the size of the input, and that allows our model to be really versatile to train on a wider variety of data, and also to be used to generate content at different resolutions and sizes.
主持人
你们已经开始使用、构建和发展它一段时间了,可否解答我一个疑惑?
我本身是做视频的,能想到这里要处理的东西有很多,比如光线啊、反光啊,还有各种物理动作和移动的物体等等。
所以我就有个问题:就目前而言,你觉得 Sora 擅长做什么?哪些方面还有所欠缺?比如我看到有个视频里一只手竟然长了六个手指。
You've had access to using it, building it, developing it for some time now. And obviously, there's a, maybe not obviously, but there's a ton of variables with video. Like I make videos, I know there are lighting, reflections, you know, all kinds of physics and moving objects and things involved. What have you found that Sora in its current state is good at? And maybe there are things that are specifically weaknesses, like I'll show the video that I asked for in a second, where there are six fingers on one hand. But what have you seen are our particular strengths and weaknesses of what it's making?
Tim Brooks
Sora 特别擅长于写实类的视频,并且可以很长,1分钟那么长,遥遥领先。
但在一些方面它仍然存在不足。正如你所提到的,Sora 还不能很好的处理手部细节,物理效果的呈现也有所欠缺。比如,在之前发布的一个3D打印机视频中,其表现并不理想。特定场景下,比如随时间变化的摄像机轨迹,它也可能处理不佳。因此,对于一些物理现象和随时间发生的运动或轨迹,Sora 还有待改进。
It definitely excels at photo realism, which is a big step forward. And the fact that the videos can be so long, up to a minute, is really a leap from what was previously possible. But some things it still struggles with. Hands in general are a pain point, as you mentioned, but also some aspects of physics. And like in one of the examples with the 3D printer, you can see it doesn't quite get that right. And also, if you ask for a really specific example like camera trajectory over time, it has trouble with that. So some aspects of physics and of the motion or trajectories that happen over time, it struggles with.
主持人
看到 Sora 在一些特定方面做得这么好,实在是挺有趣的。
像你提到的,有的视频在光影、反射,乃至特写和纹理处理上都非常细腻。这让我想到 DALL-E,因为你同样可以让 Sora 模仿 35mm 胶片拍摄的风格,或者是背景虚化的单反相机效果。
但是,目前这些视频还缺少了声音。我就在想,为 AI 生成的视频加上 AI 生成的声音,这个过程是不是特别有挑战性?是不是比我原先想象的要复杂很多?你们认为要实现这样的功能,我们还需要多久呢?
It's really interesting to see the stuff it does well, because like you said, there are those examples of really good photorealism with lighting and reflections and even close-ups and textures. And just like DALL-E, you can give it styles like shot in 35mm film or shot, you know, like from a DSLR with a blurry background. There are no sounds in these videos, though. I'm super curious if it would be a gigantic extra lift to add sound to these, or if it's more complicated than I'm realizing. How far does it feel like you are from being able to also have AI-generated sound in an AI-generated video?
Bill Peebles
这种事情很难具体说需要多久,并非技术难度,而是优先级排期。
我们现在的当务之急是要先把视频生成模型搞得更强一些。毕竟,以前那些AI生成的视频,最长也就四秒,而且画质和帧率都不太行。所以,我们目前的主要精力都在提升这块。
当然了,我们也觉得视频如果能加上声音,那效果肯定是更棒的。但现在,Sora 主要还是专注于视频生成。
It's hard to give exact timelines with these kinds of things. For first one, we were really focused on pushing the capabilities of video generation models forward, because before this, you know, a lot of AI-generated video was like 4 seconds of pretty low frame rate and the quality wasn't great. So that's where a lot of our effort so far has been. We definitely agree though that, you know, adding in these other kinds of content would make videos way more immersive. So it's something that we're definitely thinking about. But right now, Sora is mainly just a video generation model and we've been focused on pushing the capabilities in that domain, for sure.
主持人
你们在 Sora 身上做了大量工作,它的进步有目共睹。我很好奇,你们是怎么判断它已经达到了可以向世界展示的水平的?
就像 DALL-E 一样,它在发布之初就惊艳全场,这一定是一个值得铭记的时刻。另外,在 Sora 已经表现出色的方面,你们是如何决定下一步的改进方向的呢?有什么标准或者参考吗?
So okay, DALL-E has improved a lot over time. It's gotten better, it's improved in a lot of ways and you guys are constantly developing and working towards making Sora better. First of all, how did you get to the point where you'd gotten good enough with it that you knew it was ready to share with the world and we had this mic drop moment? And then how do you know how to keep moving forward and making things that it's better at?
Tim Brooks
你可能会注意到,我们目前并没有正式的发布 Sora,而是通过比如博客、Twitter、Tiktok 等渠道发布一些视频。这里的主要原因是,我们希望在真正准备好之前,更多的获得一些来自用户的反馈,了解这项技术如何能为人们带来价值,同时也需要了解在安全方面还有哪些工作要做,这将为我们未来的研究指明方向。
现在的 Sora 还不成熟,也还没有整合到 ChatGPT 或其他任何平台中。我们会基于收集到的意见进行不断改进,但具体内容还有待探讨。
我们希望通过公开展示来获取更多反馈,比如从安全专家那里听取安全意见,从艺术家那里了解创作思路等等,这将是我们未来工作的重点。
A big motivation for us, really the motivation for why we wanted to get Sora out in this, like a blog post form, but it's not yet ready, is to get feedback to understand how this could be useful to people, also what safety work needs to be done. And this will really set our research roadmap moving forward. So it's not currently a product. It's not available in ChatGPT or anything. And we don't even have any current timelines for when we would turn this into a product. But really, right now we're in the feedback-getting stage. So we want to, you know, we'll definitely be improving it, but how we should improve it is kind of an open question. And we wanted to show the world this technology that's on the horizon, and start hearing from people about how could this be useful to you, hear from safety experts how could we make this safe for the world, hear from some artists how could this be useful in your workflows, and that's really going to set our agenda moving forward.
主持人
有哪些反馈,分享一下?
What have you heard so far?
Tim Brooks
有一个:用户希望对生成的视频有更精细、直接的控制,并非只有简单的提示词。
这个挺有趣的,也这无疑是我们未来要重点考虑的一个方向。
One piece of feedback we've definitely heard is that people are interested in having more detailed controls. So that will be an interesting direction moving forward, whereas right now it's about, you know, you have this maybe kind of short prompt. But people are really interested in having more control over exactly the content that's generated, so that's definitely one thing we'll be looking into.
主持人
确实,有些用户可能只是想确保视频是宽屏或竖屏,或者光线充足之类的,而不想花太多精力去设计复杂的提示词。这个想法很有意思。
下一个话题,未来 Sora 是否有可能生成出与真实视频毫无二致的作品呢?我猜是可以的。
就像 DALL-E 那样,随着时间发展,越来越强。
Interesting, I can imagine just wanting to make sure it's widescreen or make sure it's vertical or it's well-lit or something like that, just to not have to worry about prompt engineering, I guess. Okay, so I guess as if you've been working on this stuff for a long time, is there a future where you can generate a video that is indistinguishable from a normal video? Because that's how it feels like DALL-E has evolved over time where you can ask for a photorealistic picture and it can make that. Is that something you could imagine actually being possible? I guess probably yes, because we've seen it do so much already.
Aditya Ramesh
我也相信,因此我们会变得变得更为谨慎。
人们应该知道他所看到的视频,是真实的,还是 AI 生成的。我们希望 AI 的能力不会被用到造谣上。
Eventually I think it's going to be possible, but of course as we approach that point we want to be careful about releasing these capabilities so that, you know, people on social media are aware of when a video they see could be real or fake. You know, when a video that they see comes from a trusted source, we want to make sure that these capabilities aren't used in a way that could perpetuate misinformation or something.
主持人
在 Sora 生成的视频中,在右下角都有水印,这确实很明显。但是,像这样的水印可以被裁剪掉。
我很好奇,有没有其他方法可以识别 AI 生成的视频?
I saw there's a watermark in the bottom corner of Sora-generated videos, which obviously is pretty important, but a watermark like that can be cropped. I'm curious if there are other ways that you guys think about being able to easily identify AI-generated videos, especially with a tool like Sora.
Aditya Ramesh
对于 DALL·E 3,我们训练了一种溯源分类器,可以识别图像是否由模型生成。
我们也在尝试将此应用于视频,虽然不完美,但这是第一步。
For DALL·E 3, we trained provenance classifiers that can tell if an image was generated by the model or not. We're working on adapting that technology to work for stored videos as well. That won't be a complete solution in and of itself, but it's kind of like a first step.
主持人
懂了。就像是加上一些元数据或者某种嵌入的标志,这样如果你操作那个文件,你就知道它是 AI 生成的。
Got it. Kind of like metadata or like a sort of embedded flag, so that if you play with that file, you know it's AI generated.
Aditya Ramesh
C2PA 就是这样做的,但我们训练的分类器可以直接应用于任何图像或视频,它会告诉你这个媒体是否是由我们的某个模型生成的。
C2PA does that but the classifier that we trained can just be run on any image or video and it tells you if it thinks that the media was generated by one of our models or not.
主持人
明白了。我还想知道你的个人感受。
显然,你们必须等到觉得 Sora 准备好了,可以向世界展示它的能力。看到其他人对 Sora 的反应,你有什么感觉呢?
有很多人说“太酷了,太神奇了”,但也有人担心“哦不,我的工作岌岌可危”。你是怎么看待人们各种各样的反应的?
Got it. What I'm also curious about is your reaction. You obviously had to get to the point where Sora comes out and you think it's ready for the world to see what it's capable of. What's been your reaction to other people's reactions to Sora? There's a lot of "this is super cool, this is amazing" but there's also a lot of "oh my God, my job is in danger." How do you digest all of the different ways people react to this thing?
Aditya Ramesh
我能感受到人们对未来的焦虑。作为使命,我们会以安全负责的方式推出这项技术,全面考虑可能带来的各种影响。
但与此同时,我也看到了许多机遇:现在如果有人想拍一部电影,由于预算高昂,要获得资金支持可能非常困难-制片公司需要仔细权衡投资风险。而这里,AI 就可以大幅降低从创意到成片的成本,创造不同。
I felt like a lot of the reception was like, definitely, you know, some anxiety as to what's going to happen next. And we definitely feel that in terms of, you know, our mission to make sure that this technology is deployed in a safe way, in a way that's responsible to all of the things people are already doing involving video generation. But I also felt like a lot of opportunity, like right now, for example, if a person has an idea for a movie they want to produce, it can be really difficult to get funding to actually produce the movie because the budgets are so large. You know, production companies have to be aware of the risk associated with the investment that they make. One cool way that I think AI could help is if it drastically lowers the cost to go from idea to a finished video.
主持人
Sora 和 DALL·E 确实有很多相似之处,尤其是在使用场景上。
我自己就经常用 DALL·E 来设计各种概念图,帮助很大。我相信对于 Sora 来说,类似的创意应用场景也会有无限可能。
我知道,Sora 现在还没具体的开放时间,但你觉会很快吗?
Yeah, there's a lot of parallels with DALL·E just in the way I feel like people are going to use it. Because when DALL·E got really good, I started - I mean, I can use it as a brainstorming tool. I can use it to sort of visualize a thumbnail for a video, for example. I could see a lot of the same cool-like use cases being particularly awesome with Sora. I know you're not giving timelines, but you're in the testing phase now. Do you think it's going to be available for public use anytime soon?
Aditya Ramesh
我觉得不会那么快,我觉得😊
Not any time soon, I think.
主持人
最后一个问题是:在将来,当 Sora 能制作出带声音的、极度逼真的、5分钟的 YouTube 视频的时候,会出现哪些新的、要应对的问题?
更进一步说,相较于图片,视频制作的复杂的要高得多。但视频则涉及到时间、物理等多个维度,还有反射、声音等诸多新的难题。
说实话,你们进入视频生成领域的速度远超我的预期。那么在 AI 生成媒体这个大方向上,下一步会是什么呢?
I guess my last question is, way down the road, way down into the future, when Sora is making five-minute YouTube videos with sound and perfect photorealism. What medium makes sense to dive into next? I mean, photos is one thing, videos have this whole dimension with time and physics and all these new variables with reflections and sound. You guys are, you jumped into this faster than I thought. What is next on the horizon for AI-generated media in general?
Tim Brooks
我期待看到人们用 AI 来创造全新的东西。 大聪明:来看看离谱村吧
去复刻已有对东西,不算难事儿;但使用新工具,去创造未曾出现的东西,着实令人心动!
对我来说,一直激励我的,正是让那些真正有创意的人,将一切不可能的事情变成可能,不断推进创造力的边界,这太令人兴奋了!
So something I'm really excited for is how the use of AI tools evolves into creating completely new content and I think a lot of it will be us learning from how people use these tools to do new things. But often it's easy to think about how they could be used to create existing things. But I actually think they'll enable completely new types of content. It's hard to know what that is until it's in the hands of the most creative people, but really creative people when they have new tools do amazing things. They make new things that were not previously possible. That's really what motivates me a lot. Long term, it's like how could this turn into completely new experiences in media that currently aren't capable, that currently we're not even thinking about. It's hard to picture exactly what that is, but I think that will be really exciting to just be pushing the creative boundaries and allowing really creative people to push those boundaries by making completely new tools.
主持人
确实有趣啊!
我觉得,由于它们是基于已有内容训练的,因此生成的东西也只能建立在现有内容之上。要让它们发挥创造力,唯一的办法可能就是通过你给它的 prompt 了。
你需要在如何巧妙地提出要求上下功夫,琢磨该如何引导它。这么理解对吗?
Yeah, it's interesting. I feel like the way it works is that since it's trained on existing content, it can only produce things based on what already exists. The only way to get it to be creative is with your prompt, I imagine. You have to get clever with the learning curves prompt engineering and figuring out what to say to it. Is that accurate?
Bill Peebles
除了prompt,Sora 还可以通过其他方式引导视频生成。
比如在我们之前发布的报告里,演示了如何将两个的混合输入:左边视频一开始是无人机飞过斗兽场,然后逐渐过渡到右边 - 蝴蝶在水下游动。中间有一个镜头,斗兽场渐渐毁坏,然后被看起来像被珊瑚覆盖,沉入水中。
像这一类的视频生成,无论是技术还是体验,都是完全与以往不同的。
There are other kinds of cool capabilities that the model has, sort of beyond just like text-based prompting. So in our research post that we released with Sora, we had an example where we showed blending between two input videos, and there was one really cool example of that where the video on the left starts out as a drone flying through the Colosseum, and on the right it gradually transitions into like a butterfly swimming underwater. There's a point in there where the Colosseum gradually begins decaying and looking as if it's covered in coral reefs and is partially underwater. These kinds of, you know, generated videos really do kind of start to feel a bit new relative to what's been possible in the past with older forms of technology, and so we're excited about these kinds of things, even beyond just prompting, as being new experiences that people can generate with technology like Sora.
Aditya Ramesh
从某种意义上来说,我们做的事情,就是先模拟自然,再超越自然!
In some ways we really see modeling reality as the first step to being able to transcend it.
主持人
哇,这确实挺酷的,很有意思啊!
Sora能够越精准地模拟现实,我们就能在它的基础上越快地进行创新和创作。理想情况下,它甚至能成为一种工具,开辟新的创意可能性,激发更多的创造性思维。
真的超级赞!
如果有什么话想对大家说,现在正是个好时机。毕竟,你们是最早开始这个项目的人,比任何人都更早地看到了它的潜力。关于Sora和OpenAI,还有什么是你们想让大家知道的吗?
Wow! I like that, it's really interesting yeah. The better it is able to model reality, the faster you're able to sort of build on top of it, and ideally that unlocks new creative possibilities as a tool and all kinds of other things. Super cool! Well, I'll leave it open to if there's anything else you want people to know. Obviously, you guys have been working on this longer than anyone else has gotten to see what it does or play with it. What else do you want the world to know about Sora and OpenAI?
Tim Brooks
我们还特别兴奋的一点是,AI通过从视频数据中学习,将不仅仅在视频创作方面发挥作用。毕竟,我们生活在一个充满视觉信息的世界,很多关于这个世界的信息是无法仅通过文本来传达的。
虽然像GPT这样的模型已经非常聪明,对世界有着深刻的理解,但如果它们无法像我们一样“看到”这个世界,那么它们就会缺失一些信息。
因此,我们对Sora及未来可能在Sora基础上开发的其他AI模型充满期待。通过学习世界的视觉信息,它们将能更好地理解我们所生活的世界,因为有了更深刻的理解,未来它们能够更好地帮助我们。
I think another thing we're excited about is how learning from video data will make AI more useful a bit more broadly than just creating videos, because we live in a world where we see things kind of like a video that we're watching, and there's a lot of information about the world that's not in text. While models like GPT are really intelligent and understand a lot about the world, there is information that they're missing when they don't see the visual world in a way similar to how we do. So one thing we're excited about for Sora and other AI models moving forward that build on top of Sora is that by learning from visual data about the world, they will hopefully just have a better understanding of the world we live in, and in the future be able to help us better just because they understand things better.
主持人
确实非常酷!我猜背后肯定有大量的计算工作和一群技术大神!
说实话,我一直盼着某天能用上 Sora,有进度来请立即敲我~
That is super cool. I imagine there's a lot of computing and a lot of talented engineering that goes into that. So I wish you guys the best of luck. I mean eventually when I'm able to plug in more stuff into Sora, I'm very excited for that moment too. So keep me posted.
Bill Peebles
没问题
We'll do
主持人
谢啦
Thank you.
OpenAI Team
感谢
Thanks.
1000 thousand years later...
主持人
对了,我还忘了问他们一个挺有意思的问题。虽然录的时候没问到,但大家都想知道,用一个提示让 Sora 生成一个视频需要多长时间?
我私信问了他们,答案是:得看具体情况,但你可以去买杯咖啡回来,它可能还在忙着生成视频。
所以,答案是「需要挺长一段时间」
One more fun fact I forgot to ask them. During the recording, but everyone wanted to know how long does it take to generate a video with Sora with a single prompt? I did ask them off camera, and the answer was: it depends, but you could go get coffee and come back, and it would still be working on the video. So a while seems to be the answer.