A couple of weeks ago, we saw and heard about two interesting breakthroughs in the world of AI generated content: Sora and V-JEPA. In this post I want to touch on SORA and its promise (and potential dangers) for the field of content production.
Sora
OpenAI has now introduced Sora, an AI model that generates video content from text. In its promo video you can watch examples of video created through SORA; from a cartoon kangaroo dancing to a drone view of waves crashing against the rugged cliffs along Big Sur’s Garay Point Beach.
Sora claims that it’s able to generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background. Similar to Google Gemini and DALL-E, you give Sora a text prompt and it will generate a video for you. OpenAI claims that the model behind Sora not only understands what the user has asked for in the prompt, but also how those things exist in the physical world.
Sora hasn’t yet been released to the general public and little is known yet about whether Sora is purely based on data and if so, which kinds of data sets it uses. Dr. Jim Fan at Nvidia speculates that Sora is being trained on lots of synthetic data, using a gaming engine. Or Stefano Ermon at Stanford who thinks that OpenAI might be compressing Sora’s training data into a more compact ‘latent representation’ of the data so that it requires less computing power. This latent data model would be used to diffuse a low, ‘noisy’ resolution video to a high resolution video.
OpenAI uses a transformer based architecture to create high resolution video. OpenAI research shares a comparison of video samples below, each sample having fixed seeds and inputs as training of the data progresses. The quality of each sample improves dramatically as training compute increases.
The computing required to generate video that contains a certain visual resolution quality and length will be substantial. It will be interesting to see whether Sora will eventually become available to the public through a (paid ) web and a mobile app.
Main learning point: Even from watching the first content samples generated through Sora, it’s plain to see its potential impact (and threat to existing content producers). Sora will enable people to create high quality short (and potentially) long-form content.
However, without the right guardrails in place, people could use Sora to easily create violent or sexually explicit content or reuse IP protected content. In the light of the generative AI acceleration over the past year, I can’t wait to see how Sora will evolve over the coming months!
Related links for further learning:
- https://www.fastcompany.com/91029951/meta-v-jepa-yann-lecun
- https://twitter.com/ylecun/status/1758195635444908409
- https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/
- https://encord.com/blog/meta-v-jepa-explained/
- https://a16z.simplecast.com/episodes/beyond-uncanny-valley-breaking-down-sora-iPPRTIOb
- https://www.techradar.com/computing/artificial-intelligence/openai-sora
- https://openai.com/research/video-generation-models-as-world-simulators
- https://www.bloomberg.com/news/newsletters/2024-02-22/openai-s-sora-video-generator-is-impressive-but-not-ready-for-prime-time