A new AI model from Microsoft Research Asia has given us something you probably thought you’d never see — the Mona Lisa rapping. This is thanks to the company’s latest AI model, which can generate deepfake videos from a single still image and audio track.
This isn’t the first AI model of this kind, but it is more realistic and accurate than previous versions. Called VASA-1, the model was trained using footage of 6,000 talking faces from the VoxCeleb2 data set. Then, it just needs to be given a single headshot image of a person and an audio clip, and it can create a realistic video of the person in the headshot lip syncing the supplied audio.
It can create the videos at 512x512 pixels at 40 frames per second “with negligible starting latency.” You can also adjust the settings for facial dynamics and head poses to achieve certain effects like specific emotions, expressions and gaze direction. “Such technology holds the promise of enriching digital communication, increasing accessibility for those with communicative impairments, transforming education methods with interactive AI tutoring, and providing therapeutic support and social interaction in health care,” said a paper describing the technology.