Learning about video and generative AI (2)

MAA1
3 min readMar 4, 2024

Last week I wrote about the arrival of Sora and this week I’ll cover two new contributions to the field of video and AI: Meta’s “V-JEPA” method and Alibaba’s “EMO” model.

EMO

With EMO (Emote Portrait Alive — why wasn’t it called “EPO”?!), users can create singing or talking videos, based off a static image and audio input. If you input a single character image and vocal audio, such as talking or singing, EMO will create a singing video.

The EMO model seems to support songs in multiple languages and brings diverse portrait styles to life, creating expression rich avatars. Similar to Sora, the EMO method hasn’t yet been released to the public and we don’t know about the data sets that it uses.

Image Credit: Alibaba on Github

Whereas the people in the videos generated through Sora appear very static in their movements and expressions, the characters in the EMO sample clips are all singing or talking, with facial expressions that feel more natural. EMO is another great example of how fast generative AI technology seems to be moving. Putting this kind of technology in the hands of the public is a different story though, as well as considering the ethical safeguards that would need to be in put place for general usage.

V-JEPA

Last month, Meta’s VP & Chief AI Scientist Yann LeCun announced “V-JEPA”, a non-generative model that predicts which category or class a video data point belongs to. The model thus enables machines to predict and learn about concepts from the physical world.

Image Credit: Yann LeCun on X

V-JEPA is the vision model that builds on JEPA, which LeCun first introduced in 2022 (JEPA stands for Joint Embedding Predictive Architectures). V-JEPA is a self-supervised model, meaning that it trains itself based on unlabelled video data. Unlike more traditional machine learning models, V-JEPA doesn’t rely on pre-trained data sets (e.g. text or human annotations). Instead of trying to fill in every single pixel, V-JEPA will predict missing or masked parts of a video based on higher-level conceptual information.

Main learning point: Whether it’s bringing static images to life (EMO) or training LLM based on watching video (V-JEPA), it will be interesting to see how people will build on both use cases and when both approaches will be made publicly available.

Related links for further learning:

  1. https://venturebeat.com/ai/alibabas-new-ai-system-emo-creates-realistic-talking-and-singing-videos-from-photos/
  2. https://mashable.com/article/alibaba-emo-ai-facial-animation
  3. https://twitter.com/ylecun/status/1758195635444908409
  4. https://encord.com/blog/meta-v-jepa-explained/
  5. https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture

--

--

MAA1

Product person, author of "My Product Management Toolkit" and “Managing Product = Managing Tension” — see https://bit.ly/3gH2dOD.