Connect with us

Robotics

Robots learn to lip-sync

A humanoid learned to control its facial motors by watching itself in a mirror before imitating human lip movement from online videos.

Researchers have developed a robot that can learn facial lip movements for tasks such as speech and singing. Designed by a team at Columbia Engineering, the humanoid aims to address one of the most persistent challenges in robotics: reproducing natural, human-like facial motion.

Almost half of our attention during face-to-face conversation is focused on lip movement, yet robots have long struggled to replicate it convincingly. Even the most advanced humanoids manage little more than exaggerated, puppet-like mouth motions, if they have a face at all.

Humans place disproportionate importance on facial gestures, particularly lip movement. While an unusual walking gait or stiff hand motion is often overlooked, even minor errors in facial expression tend to stand out. This sensitivity contributes to the “uncanny valley”, where near-human robots appear unsettling rather than lifelike.

Lip movement plays a central role in this effect. When it is absent or poorly executed, robots can seem lifeless or eerie. The system developed at Columbia Engineering is designed to address this gap by enabling a robot to learn and reproduce coordinated lip movements for speech and singing.

In a new study published in Science Robotics, the researchers demonstrate how the robot articulates words in multiple languages and even sings a song from its AI-generated debut album, Hello World.

Rather than relying on predefined rules, the robot acquired this ability through observational learning. It first learned to control its 26 facial motors by watching its own reflection in a mirror, before learning to imitate human lip movement by analysing hours of YouTube videos.

“The more it interacts with humans, the better it will get,” says Hod Lipson, James and Sally Scapa Professor of Innovation in the Department of Mechanical Engineering and director of Columbia’s Creative Machines Lab, where the work was done.

Robot watches itself talking 

Achieving realistic robot lip motion is challenging for two reasons. First, it requires specialised hardware containing a flexible facial skin actuated by numerous tiny motors that can work quickly and silently in concert. Second, the specific pattern of lip dynamics is a complex function dictated by sequences of vocal sounds and phonemes. 

Human faces are animated by dozens of muscles that lie just beneath a soft skin and sync naturally to vocal chords and lip motions. By contrast, humanoid faces are mostly rigid, operating with relatively few degrees of motion, and their lip movement is choreographed according to rigid, predefined rules. The resulting motion is stilted, unnatural, and uncanny.

In this study, the researchers overcame these hurdles by developing a richly actuated, flexible face and then allowing the robot to learn how to use its face directly by observing humans. First, they placed a robotic face equipped with 26 motors in front of a mirror so that the robot could learn how its own face moves in response to muscle activity. Like a child making faces in a mirror for the first time, the robot made thousands of random face expressions and lip gestures. Over time, it learned how to move its motors to achieve particular facial appearances, an approach called a vision-to-action language model (VLA).

Photo courtesy Science Robotics.

Then, the researchers placed the robot in front of recorded videos of humans talking and singing, giving AI that drives the robot an opportunity to learn how exactly humans’ mouths moved in the context of various sounds they emitted. With these two models in hand, the robot’s AI could now translate audio directly into lip motor action.

The researchers tested this ability using a variety of sounds, languages, and contexts, as well as some songs. Without any specific knowledge of the audio clips’ meaning, the robot was then able to move its lips in sync.

The researchers acknowledge that the lip motion is far from perfect.

“We had particular difficulties with hard sounds like ‘B’ and with sounds involving lip puckering, such as ‘W’. But these abilities will likely improve with time and practice,” says Lipson. 

More importantly, however, is seeing lip sync as part of more holistic robot communication ability. 

Yuhang Hu, who led the study for his PhD, says: “When the lip sync ability is combined with conversational AI such as ChatGPT or Gemini, the effect adds a whole new depth to the connection the robot forms with the human. The more the robot watches humans conversing, the better it will get at imitating the nuanced facial gestures we can emotionally connect with.

“The longer the context window of the conversation, the more context-sensitive these gestures will become.” 

The missing link of robotic ability

The researchers believe that facial affect is the ‘missing link’ of robotics. 

Lipson says: “Much of humanoid robotics today is focused on leg and hand motion, for activities like walking and grasping. But facial affection is equally important for any robotic application involving human interaction.”

Lipson and Hu predict that warm, lifelike faces will become increasingly important as humanoid robots find applications in areas such as entertainment, education, medicine, and even elder care. Some economists predict that over a billion humanoids will be manufactured in the next decade.

“There is no future where all these humanoid robots don’t have a face,” says Lipson. “And when they finally have a face, they will need to move their eyes and lips properly, or they will forever remain uncanny.”

Hu says: “We humans are just wired that way, and we can’t help it. We are close to crossing the uncanny valley.”

Risks and limits

The work is part of Lipson’s decade-long quest to find ways to make robots connect more effectively with humans, through mastering facial gestures such as smiling, gazing, and speaking. He insists that these abilities must be acquired by learning, rather than being programmed using stiff rules. 

“Something magical happens when a robot learns to smile or speak just by watching and listening to humans,” he says. “I’m a jaded roboticist, but I can’t help but smile back at a robot that spontaneously smiles at me.”

Hu says: “Robots with this ability will clearly have a much better ability to connect with humans because such a significant portion of our communication involves facial body language, and that entire channel is still untapped.” 

The researchers are aware of the risks and controversies surrounding granting robots greater ability to connect with humans. 

Lipson says: “This will be a powerful technology. We have to go slowly and carefully, so we can reap the benefits while minimising the risks.”

Subscribe to our free newsletter
To Top