Back to Blog

LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control

The article will introduce the technical principles of live portrait

Posted by

portrait animation results from LivePortrait model

Introduction

Unlike the current mainstream diffusion-based methods, LivePortrait explores and expands the potential of the implicit keypoint framework, balancing model computational efficiency and controllability. LivePortrait focuses on better generalization, controllability, and practical efficiency. To improve generation capability and controllability, LivePortrait adopts a 69M high-quality training frame, a video-image mixed training strategy, upgrades the network structure, and designs better motion modeling and optimization methods. Additionally, LivePortrait considers implicit keypoints as an effective implicit representation of facial blendshapes and proposes stitching and retargeting modules based on this concept. These two modules are lightweight MLP networks, so the computational cost can be ignored while improving controllability. Even compared to some existing diffusion-based methods, LivePortrait still performs well. Moreover, on an RTX4090 GPU, LivePortrait can achieve a single-frame generation speed of 12.8ms, and with further optimization, such as TensorRT, it is expected to be within 10ms! The training of LivePortrait model consists of two stages. The first stage is the training of the base model, and the second stage is the training of the stitching and retargeting modules.

you can also free use it on our playground

Methodology

Stage I: Base Model Training

 live portrait pipeline of the first stage: base model training

In the first stage of model training, LivePortrait made a series of improvements to the implicit keypoint-based framework, such as Face Vid2vid, including:

1.High quality data curation.

2.Mixed image and video training.

3.Upgraded network architecture.

4.Scalable motion transformation.

5.Landmark-guided implicit keypoints optimization.

Stage II: Stitching and Retargeting

 live portrait pipeline of the second stage: stitching and retargeting modules training.

LivePortrait considers implicit keypoints as a form of implicit hybrid deformation, and discovers that this combination can be well learned with just a lightweight MLP, with negligible computational cost. Considering practical needs, LivePortrait has designed a fitting module, an eye redirection module, and a mouth redirection module. When the reference portrait is cropped, the driven portrait will be mapped back from the cropped space to the original image space. The fitting module is added to avoid pixel misalignment during this remapping process, such as in the shoulder area. As a result, LivePortrait can perform motion driving on larger image sizes or group photos. The eye redirection module aims to solve the problem of incomplete eye closure during cross-identity driving, especially when a portrait with small eyes drives a portrait with large eyes. The mouth redirection module is designed with a similar concept to the eye redirection module. It normalizes the input by driving the mouth of the reference image to a closed state, thereby achieving better driving results.

Experiment Results

1.Self-reenactment Results

The first four source-driving paired images are from TalkingHead-1KH [5] and the last ones are from VFHQ [51]. Our model faithfully preserves lip movements and eye gazes, handles large poses more stably, and maintains the identity of the source portrait better compared to other methods.

Qualitative comparisons of self-reenactment.

2. Cross-reenactment Result

Qualitative comparisons of cross-reenactment. The first three source portraits are from FFHQ [52] and the last two are celebrities. Driving portraits are random selected from TalkingHead-1KH [5], VFHQ [51] and NeRSemble [53]. We present the animated portraits without stitching in the cropping space, as well as the final results after stitching and pasting back into the original image space. Similar to self-reenactment, our model better transfers lip movements and eye gazes from another person, while maintaining the identity of the source portrait.

Qualitative comparisons of self-reenactment.