The Codec Avatars project is all about defying distance
UPDATE: April 14, 2023 — A few years ago we introduced Codec Avatars, a Reality Labs (RL) Research project that uses groundbreaking Computer Vision technology and Machine Learning systems to automatically create highly realistic digital representations that accurately reflect how a person looks and moves in real life, and in real-time. Codec Avatars will be crucial to how we connect and interact with the people who are geographically distant in the future. You can read more about the work below, or check out last year’s Connect for a look at our most recent breakthroughs.
The next step in our effort is enhancing avatar representation through the Codec Avatar Inclusive Dataset Research Project. Today, creating a high-quality and unique Codec Avatar requires several hours of image and audio capturing in a lab using highly-customized cameras and microphones, followed by weeks spent processing that data to generate the final avatar.
To be able to scale Codec Avatars, it will need to be relatively quick and easy to generate an avatar using your mobile phone, what we call Instant Codec Avatars. We’re working on this — but to succeed, we’ll need a vast and truly diverse data set that represents a broad range of gender expression, body types, hair styles and textures, skin tones, facial features, and more. Everything that makes up “you,” no matter who you are.
And capturing this level of diversity requires more people than we could reasonably ask to visit us in our Pittsburgh lab — so we’re taking this show on the road. While transporting this lab technology isn’t easy, we have found an innovative way to load our equipment on trucks and will begin taking it to Meta campuses, recruiting both internal and external research participants who want to contribute to the data set through this fully-voluntary research project.
As part of a session, participants will do things like mimic expressions, read sentences and converse, and make eye movements — a process that takes a few hours and is similar to what participants in our Pittsburgh lab already experience. The information received won’t be used to create Codec Avatars of each contributor, but will help us get to a future where everyone can create their own Codec Avatar easily, at home, without specialized equipment.
We’re starting with Meta campuses so we can optimize the process, but our goal is to eventually take this mobile setup to communities outside of tech hubs like the San Francisco Bay Area to enhance representation within our data set. Participants are given the option to self-identify using demographic forms, and our hope is that this Codec Avatar Diverse Dataset Research Project will help us accurately represent a more diverse group of people as a result.
Still, there’s a long way to go before people are fully empowered to generate avatars that reflect every little nuance. For instance, we don’t yet have the ability to show avatars with accessories like glasses, which is pretty fundamental. We have much more to do, but we’re excited to take this next step as we work to make Codec Avatars better represent us all.
ORIGINAL STORY: Yaser Sheikh, the Director of Research at Facebook Reality Labs in Pittsburgh, is deeply invested in creating new and better ways for people to connect, even when they’re on opposite sides of the world. “Most of us, myself included, don’t live in the places where we grew up,” he says. “I’ve spent my life moving from city to city, and each time, I’ve left relationships that are important to me.”
That focus on connection is what’s driving Sheikh’s work leading a project called Codec Avatars, which seeks to overcome the challenges of physical distance between people, and between people and opportunity. Using groundbreaking 3D capture technology and AI systems, Codec Avatars could let people in the future create lifelike virtual avatars of themselves quickly and easily, helping social connections in virtual reality become as natural and common as those in the real world. While avatars have been a staple of video games and apps for years, Sheikh believes incredibly accurate virtual representations of people — those that can perfectly capture a wry smile or a furrowed brow — will be a game changer.
This is the first in a series of blog posts exploring the work happening at Facebook Reality Labs. We’ll take you inside FRL and introduce you to the people helping build the future of connection. Learn more below, and click here to read a note from FRL’s Chief Scientist, Michael Abrash.
Codec Avatars is an active research project today, but it could radically change the way we connect through VR headsets and AR glasses tomorrow. It’s not just about cutting-edge graphics or advanced motion tracking. It’s about making it as natural and effortless to interact with people in virtual reality as it is with someone right in front of you. The challenge lies in creating authentic interactions in artificial environments.
If telepresence lets you feel like you’re somewhere else, then social presence lets you share that sensation with other people. Sheikh talks about two simple but important ways to measure success. “We colloquially refer to this as passing the ‘ego test’ and the ‘mother test,’” he says. “You have to love your avatar and your mother has to love your avatar before the two of you feel comfortable interacting like you would in real life. That’s a really high bar.”
The first time you answered a video call, no one had to tell you why the technology mattered. It brought you closer to everyone — and it meant you could work in your pajamas. The jump from video calls to avatar calls will deliver genuine social presence, a bit like talking to someone in a Star Trek holodeck, where participants could hang out in simulated environments like they were actually there. Getting Codec Avatars to work in a way that’s authentic and comfortable is a huge design challenge the Pittsburgh team has been working on for years, and the team is deeply invested in getting it right.
Work on Codec Avatars is an essential milestone on the road to authentic social presence that’s accessible, practical, and ready for future headsets. “Right now, proximity determines whom we have relationships with,” Sheikh says. “The real promise of augmented reality and virtual reality is that it lets us spend time with whomever we wish and build meaningful relationships no matter where people live,” says Sheikh. This is the future of connection, which makes it an essential part of Facebook’s core mission to help build communities and bring people closer together.
Erasing physical distance between people is a significant undertaking that requires long-term commitment. In the fall of 2014, Sheikh met Michael Abrash, then the Chief Scientist at Oculus Research. At the time, Sheikh was leading the Panoptic Studio, a 3D capture laboratory at the Robotics Institute at Carnegie Mellon University. The two met to discuss the creation of a new research facility in Pittsburgh and eventually homed in on social presence as the overarching goal. Their first order of business: assemble a multidisciplinary team of engineers, technicians, and scientists to “build the future,” as Abrash put it. Sheikh joined Facebook in 2015 and has been leading the Pittsburgh team since.
Facebook Reality Labs has offices across the U.S., including in Redmond, Washington; Sausalito, California; and Pittsburgh, Pennsylvania. Each location is tackling its own set of challenges required to establish AR and VR as the next computing platform, from machine learning and materials science to optics and haptics. “FRL is the holy grail of institutions for practical research work,” says Stephen Lombardi, a Research Scientist at FRL. “We’ve got amazing resources and support to do our job, and I’m able to work with incredibly intelligent people. This has allowed me to achieve more than I ever could have on my own.”
For Danielle Belko, a Technical Program Manager at FRL, her work at the Pittsburgh lab started with a wild suggestion from Sheikh. He asked her if she’d like to “analyze data on systems that aren’t invented yet, on a scale that no one has done before, to do things people say are impossible.” She signed up. “I have a background in linguistics and entertainment technology, so I’m fascinated by how people communicate. The opportunity was too good to ignore,” she says.
Jason Saragih, a Research Scientist at FRL, chased his passion for computer vision straight through the doors of FRL. “I’ve worked on human modeling in computer vision and graphics for over a decade and consider AR and VR the ultimate vehicles for this kind of technology,” says Saragih. Lombardi agrees. “FRL is making significant investments in the future of immersive platforms,” he says. “It’s exciting to contribute to it, especially now that we’re using computer vision, machine learning, and cutting-edge graphics tech to make realistic avatars.”
Chuck Hoover, General Manager at FRL Pittsburgh, was looking to get in on the ground floor of something big. “It’s really the long-term impact that has me excited,” he says. “Could we live anywhere and eliminate commuting altogether?” Decoupling the social aspects of life from physical dependencies is potentially world changing, Hoover says. “Being a part of something that can change everything — from such an early stage — is exhilarating.”
While the social and cultural implications of Codec Avatars and social presence are huge, working out of the Pittsburgh office has other perks, like exploring the world’s most advanced hardware systems just because you can. “It dawned on us that we owned the world’s most advanced scanning device,” says FRL Research Scientist Shoou-I Yu. “We started scanning people’s shoes, toys, dry ice, burning candles, and anything we could think of.” Scanning everyday objects sounds random, but it’s all in the name of building a better algorithm so future hardware can easily render even the most complex avatars.
Lifelike avatars are a popular concept in science fiction, like in the movie TRON,where a software programmer finds himself being reconstructed bit by bit inside a computer. That’s not what’s happening here, of course; you’re not getting sucked into a machine, and avatars aren’t video game characters that happen to look like you. But the idea is similar — digitally beaming yourself from one location to the other and feeling like it’s real.
The key to lifelike avatars is physical details, even subtle ones we take for granted every day, like the way your eyes are darting around this paragraph. They’re all crucial pieces to the puzzle. “We have to capture all of these subtle cues to get it all to work properly,” says Yu. “It’s both challenging and empowering because we’re working to let you be you.”
The visual effects industry has been working on lifelike avatars for years, but those require the talents of artists to match the likeness of each avatar with its actor. It’s a manual process that takes months of production time. Live interactions between avatars in artificial reality is uncharted territory and requires a fresh approach.
Facebook has worked on virtual avatars for several years. At F8 2016, Facebook Chief Technology Officer Mike Schroepfer introduced new avatars for Facebook Spaces, replacing the floating blue head in use at the time with an updated model featuring new facial features and lip movement. At F8 last year, he debuted Facebook’s efforts into more lifelike avatars being developed by FRL Pittsburgh. In the brief demo, audiences saw two realistic digital people animated in real time by members of the team.
The FRL team has made significant progress since Schroepfer debuted their work on lifelike avatars. “We’ve completed two capture facilities, one for the face and one for the body,” says Sheikh. “Each one is designed to reconstruct body structure and to measure body motion at an unprecedented level of detail. Reaching these milestones has enabled the team to take captured data and build an automated pipeline to create photorealistic avatars.” With recent breakthroughs in machine learning, these ultra-realistic avatars can be animated in real time.
Codec Avatars isn’t the only approach to realistic avatars that FRL is pursuing. A different team at FRL Sausalito is exploring physics-based avatars that can interact with any virtual environment. This work combines fundamental research in areas like biomechanics, neuroscience, motion analysis, and physically driven simulations. This technique still relies on live data capture, just like Codec Avatars, but instead of the live sensor data driving a neural network, it drives a physics-based model inspired by human anatomy (more to come on that approach later this year).
If you’re going to replicate something as nuanced as a chat between two people, you first need to understand how human interaction works. Then you need to package it in a way computer systems can understand. It might sound simple, but holding even a basic conversation requires a complex web of signals all working in concert to convey meaning between participants. It’s these signals, made up of speech, body language, linguistic cues, and so on, that Codec Avatars package into quantifiable data for use in rendering realistic virtual humans. The goal, as mentioned previously, is to create virtual interactions that are indistinguishable from real ones.
“The cornerstone is measurement,” says FRL Research Scientist Tomas Simon. “Realism is driven by accurate data, which requires good measurements. The key to building real avatars, then, is finding a way to measure physical details in human expression, like the way a person squints their eyes or wrinkles their nose.”
At the Pittsburgh lab, Codec Avatars measure human expression through two primary functions: an encoder and a decoder. First, the encoder uses a system of cameras and microphones on the headset to capture what the subject is doing and where he or she is doing it. Once captured, the encoder takes the information and assembles a unique code, a numeric representation of the state of a person’s body and environment that is ready to send wherever it needs to go. The decoder then translates this code into audio and visual signals the recipient sees as a picture-perfect representation of the sender’s likeness and expression.
Codec Avatars represent a major leap on the road to social presence. It takes what’s happening in the Pittsburgh lab today — creating a database of physical traits using a small group of participants — and delivers a way for consumers in the future to create their own avatars without a capture studio and without much data. “It is one of the first approaches to produce a real digital likeness of a person automatically,” says Saragih. “It presents a way forward in virtual face-to-face communication that can scale broadly. Virtual interactions that feel like the person’s right there in front of you are a big step in achieving our goal of connecting people.”
Your average 10-megapixel smartphone camera uses millions of light sensors to produce vivid pictures. Using captured data and fancy software, a smartphone can automatically adjust ambient light, field of view, and other factors to give you the best possible photo. Building Codec Avatars is also a combination of physical data and sophisticated software, but there’s a lot more involved than what’s in your average Instagram post.
Codec Avatars need to capture your three-dimensional profile, including all the subtleties of how you move and the unique qualities that make you instantly recognizable to friends and family. And, for billions of people to use Codec Avatars every day, making them has to be easy and without fuss. FRL approached the challenge by creating a pair of world-class capture studios — one for capturing faces and another for capturing full bodies. There are hundreds of high-resolution cameras across both studios, with each camera capturing data at a rate of 1 GB per second.
“To put this into perspective, a laptop with 512 GB disk space will survive for three seconds of recording before running out of space,” says Yu. “And our captures last around 15 minutes. The large number of cameras really pushes the limits of our capture hardware, but pushing these limits lets us collect the best possible data to create one of the most photorealistic avatars in existence.” One of the studios has 1,700 microphones, enabling the reconstruction of sound fields in 3D for truly immersive audio — an essential component of immersive environments.
FRL’s approach is to use these captures to train AI systems that can quickly and easily build your Codec Avatar from just a few snaps or videos. But doing this for the diversity of humans is a considerable challenge, and the team is just getting started. “This has taught me to appreciate how unique everyone is,” Yu notes. “We’ve captured people with exaggerated hairstyles and someone wearing an electroencephalography cap. We’ve scanned people with earrings, lobe rings, nose rings, and so much more.”
Working at FRL Pittsburgh has even led to some profound on-the-job moments. “Yaser’s parents came in and recorded a message to their grandchildren and future great-grandchildren,” says Belko. “They created an interactive time capsule. I had never really thought about the impact that telepresence could have to connect the future generations to the past, but can you imagine being able to watch a personalized message from someone who is no longer with you?”
The two capture studios are vitally important to FRL Pittsburgh’s efforts, but they’re large and impractical. The goal is to achieve the same results through lightweight headsets sometime in the future. The Pittsburgh team had to stretch beyond capture solutions available today — largely focused on a subject’s head and hands — and invent a series of prototype Head Mounted Capture systems (HMCs) equipped with cameras, accelerometers, gyroscopes, magnetometers, and microphones to capture the full range of human expression. These HMCs are what animate the Codec Avatars while users talk to each other in virtual environments.
Building HMCs is no easy feat. Sensors need to be packed into headsets people will find comfortable. Illuminating the face leads to an unpleasant user experience, so the HMCs created at the Pittsburgh lab use infrared, which is invisible to the human eye. “If the experience is to be indistinguishable from a physical face-to-face experience, we need to have comprehensive sensing ability while making sure the headset won’t limit users’ ability to gesture and express themselves,” says FRL Research Scientist Hernan Badino.
Software is an equally important part of the equation, and the team has cooked up a suite of programs to work with data from HMCs. “A researcher might want to obtain very specific images from a device or have full control on the capture system to test a particular hypothesis,” says Badino. “The software our team developed gives us flexible control over the capture system, letting us focus and study specific areas. Within the software there are also plenty of tools for deploying headsets within the lab, such as calibration, data diagnostics, and analysis tools.”
Trust is a critical component when talking to people in real life, and it shouldn’t be any different in virtual reality. The system needs to deliver lifelike avatars that people can trust immediately. A big part of this is accurately capturing the subtle expressions, like the way a person blinks or chuckles, so there’s no mistaking who’s behind the virtual face. “The only proof we have for what makes social engagements compelling, physical or otherwise, is authenticity. There is an implicit trust that you are receiving ‘real’ information from the other person,” says Sheikh.
Giving people a way to build their own lifelike avatars quickly and easily is only part of the challenge. Making sure people (and their avatars) stay safe is the other. The Pittsburgh team is mitigating potential issues through a combination of user authentication, device authentication, and hardware encryption. But it all starts with the proper handling of data. “This is incredibly important to all of us,” says Belko. “Before starting any collection efforts, we made sure we had a robust system in place for handling and storing data.”
One technology the team is keenly aware of is “deepfakes” — images and videos that use AI and preexisting images and footage in order to fabricate a scene, such as a person saying something they never actually said in real life. This technology will only improve in the future, making it hard to tell the difference between a real event, such as a live television interview, and one artificially created using deepfake technology. “Deepfakes are an existential threat to our telepresence project because trust is so intrinsically related to communication,” says Sheikh. “If you hear your mother’s voice on a call, you don’t have an iota of doubt that what she said is what you heard. You have this trust despite the fact that her voice is sensed by a noisy microphone, compressed, transmitted over many miles, reconstructed on the far side, and played by an imperfect speaker.”
FRL Pittsburgh is thinking pragmatically about safeguards to keep avatar data safe. For example, the team is exploring the idea of securing future avatars through an authentic account. How we work with real identities on our platforms will be a key part of this, and we have discussed several security and identity verification options for future devices. We’re still years away from this type of technology reaching consumer headsets, but FRL is already working through possible solutions.
The team also has regular reviews with privacy, security, and IT experts to make sure they’re following protocol and implementing the latest and most rigorous safeguards possible. “We’ve considered all possible use cases for this technology,” says Hoover. “We’re aware of the risk and routinely talk about the positive and negative impacts this technology can have. As a lab, we’re excited about making this technology, but only if it’s done the right way. Everyone knows how important this research is and how important it is that people trust it.”
Imagine putting on a headset and being transported thousands of miles away to attend class, go to work, or attend a relative’s birthday party. You’d be recognized immediately by everyone there because, for all intents and purposes, you’ve arrived at the event. You’ll look, move, and sound just like you do in real life. It’s not just for convenience; a lifelike avatar can be somewhere you can’t be physically, whether because of circumstances or simple distance. It would help solve a lot of the challenges people today face in maintaining long-distance friendships and finding community.
The point isn’t to replace physical connection but rather to give people new tools when they can’t interact in person, as telephone and video calls have. There’s a lot of work to do, and several remaining challenges to solve, before lifelike avatars are ready for prime time. When you’re building a new way for people to spend time together at distance — to see each other, talk to each other, and literally feel like they’re together in the same room — there are plenty of issues to resolve and breakthroughs to make before the project is ready to take the stage.
But this kind of authentic closeness is exactly what the folks at FRL’s Pittsburgh office have been working on through Codec Avatars. “We have the resources to drive new ideas,” says Sheikh. “And when you add the chance to bring together diverse expertise to fully tackle these massive design challenges, it fuels a pace of innovation unlike anything I have seen before.”
Reality Labs brings together a world-class team of researchers, developers, and engineers to build the future of connection within virtual and augmented reality.