Press ESC to close

Emote Portrait Alive  – Can turn Portrait into realistic video

Alibaba’s Institute for Intelligent Computing has launched the “EMO,” or Emote Portrait Alive, following the launch of China’s first AI cartoon series. This unique technique can animate static portrait images, bringing these images to life in talking and singing films with stunning realism.

Emote: A Breakthrough in AI Animation Technology

EMO has impressive capabilities which combine artificial intelligence with video creation. Here’s what EMO can accomplish:

1. Animating Portraits: EMO AI can bring a single portrait shot to life. It creates lifelike films of the person portrayed in the photo, causing them to appear to be speaking or singing.

2. Audio-to-Video Analysis: Unlike previous techniques that use intermediary 3D models or face landmarks, EMO AI generates video directly from audio inputs. This method enables smooth frame transitions and constant identity maintenance, resulting in extremely expressive and life-like animations.

3. Expressive Facial Expressions: EMO AI understands the dynamic and intricate interaction between auditory signals and facial motions. It extends beyond static expressions, permitting a diverse range of human emotions and facial styles.

4. Versatility: EMO AI can create realistic speaking and singing videos in a variety of styles. EMO AI brings to life emotional conversations and enchanting songs.

People Also readEverything you need to know about Google’s New Gemini Ultra

AI Training for Emote Portrait Alive

Using images and audio recordings, Alibaba’s EMO AI is a creative framework for audio-driven portrait video generation that generates character head videos. This removes the need for intermediary representations, resulting in great visual and emotional authenticity that matches the aural input. Emote Portrait Alive uses Diffusion Models to make character head films that capture subtle microexpressions and allow for genuine head motions.

For training EMO, researchers compiled a varied audio-video dataset containing over 250 hours of film and 150 million pictures. This dataset contains a variety of content categories, like speeches, film and television clips, and song performances in many languages. The abundance of information guarantees that EMO captures a diverse spectrum of human expressions and vocal styles, laying a solid foundation for future growth.

Emote Portrait Alive‘s AI method

1. The EMO framework is divided into two major stages: frame encoding and the diffusion process. 

  • Frames Encoding Extracts characteristics from reference photos and motion frames.
  • The Diffusion Process employs a pre-trained audio encoder, face area mask integration, and denoising techniques. 

2. Attention systems maintain character identity and regulate movement. 

3. Temporal Modules modify motion velocity to provide smooth video.

4. EMO adds a FrameEncoding module to preserve compatibility with the supplied reference picture. 

5. Stable control devices, like a speed controller and a face region controller, improve visual stability while preserving diversity.


EMO’s performance was examined on the HDTF dataset, and it outperformed existing state-of-the-art approaches like DreamTalk, Wav2Lip, and SadTalker across a variety of parameters. Quantitative studies involving FID, SyncNet, F-SIM, and FVD demonstrated EMO’s superiority. User surveys and qualitative evaluations confirmed EMO’s capacity to make realistic and emotional talking and singing videos, confirming it as the field’s top solution.

Challenges of the Conventional Method

Conventional methods of creating talking head videos frequently result in output that is limited in terms of the variety of facial emotions. Methods such as head movement sequence extraction from base movies or the use of 3D models make the process easier but at the expense of naturalness. EMO seeks to develop a novel framework that enables natural head movements and records a wide range of genuine facial expressions.

Disadvantages of the Emote Portrait Alive

1. Takes More Time: It takes more time than other strategies that do not rely on diffusion models.

2. Unintended Body Part Generation: The lack of clear control signals to guide the character’s movement. This absence could lead the movie to unintentionally generate extra body parts, such as hands, which would result in artifacts.


Emote Portrait Alive is a ground-breaking technology that integrates lips with sounds in images, resulting in smooth and expressive animations that fascinate viewers. Imagine transforming a static portrait into a dynamic, talking, or singing avatar, EMO makes it possible!