This month, we’re updating the Oculus Avatar system to allow people to be more expressive in social experiences in VR.
This blog post looks at the origins and evolution of Oculus Avatars, and the important role they play in creating immersive experiences that help defy distance and allow people to use VR to interact meaningfully no matter where they are. The newest changes to the Oculus Avatars system — the expressive updates — are live now across Oculus mobile and PC platforms.
For all the computing enhancements we’ve seen over the past few decades, the level of immersion in virtual worlds has remained roughly unchanged as it was relayed through a computer monitor, with a mouse or controller for input.
In comparison, virtual reality hardware is increasingly able to relay human motion directly into an experience and embody that experience via the head-mounted display in a way that closes the action-perception feedback loop. When you can adjust your view of the world by simply looking around, you feel increasingly immersed in the experience, as if it were real. With the addition of accurately tracked hand controllers, allowing you to see your own hands move and interact in the world, you begin to achieve self-presence — the sensation of being there.
And in VR experiences that use avatars to deliver verbal and nonverbal social interactions in real time, you get closer and closer to social presence — the sensation of being there with others — in a way that is unique to VR.
You can feel connected to others with very little in the way of input. With just the tracked position of the headset and two hand controllers, people on opposite sides of the world can collaborate in a shared virtual environment and begin to feel like they’re really sharing a room. We are very receptive to one another’s behavioral cues, such as body language, gestures, and head tilt. We can infer that someone is paying attention and, by the subtlest of head movements, appreciate social norms that we see in the real world.
As the representation of avatars becomes more nuanced, adding eye and mouth movements and body simulation, the challenge shifts to reproducing the nuances in behavior that we are hardwired to see in others. Doing so with today’s VR hardware means achieving a high degree of social presence without any additional face cameras or body tracking, requiring us to combine high-quality tracked head and hand motions with simulated behaviors and expressions. When this is done well, your sense of social presence is heightened. When there’s a disconnect between the perceived quality of fake and real parts of the avatar performance, it can feel quite jarring or uncomfortable.
Formed in 2016, the Oculus Avatars project aims to tackle some of these challenges on the way to delivering rich self-presence and social presence, and to make this technology available for developers who want to build social VR experiences to help connect the world.
The Oculus Avatars project evolved from Toybox. Developed in 2015, Toybox was one of the first experiences Oculus created to showcase the value of tracking both hands in addition to positional tracking of the VR headset, allowing people to grab objects and, more important, to interact with one another in VR.
During the creation of Toybox, we realized a simple human representation (a box representing the head and hands) was enough to read a lot of the nuance of human body language. In fact, the partner of one test participant was able to recognize her spouse, represented as three simple shapes. “That’s the way he shrugs!” she said.
As we iterated on the demo, we settled on a simplified head and hands, learning that the simplest information — like the direction of the bridge of the nose — could powerfully communicate where someone was focusing his or her attention and enhance communication in a shared VR experience.
We also quickly learned that the things we faked, unless reproduced with the nuance of human motion we are accustomed to interpreting, could seem out of place to the point where it would break immersion in the virtual world. Seeing a pair of hands tracked with high accuracy was indeed incredibly immersive, but seeing an incorrectly simulated elbow immediately caused the human brain to exclaim, “That’s not where my elbow is!”
We were fighting against proprioception (the innate sense of where parts of your body are in relation to one another) and learning the importance of being deliberate about what we would simulate in the absence of camera or controller tracking.
We also saw a huge amount of value in giving users a human representation to call their own. In VR you can be anyone —a powerful form of personal expression. We wanted to create a system that lets users customize their appearance and take it with them across myriad social VR experiences. In doing so, they would help provide developers with tools and content to quickly bring personalized avatars to their social experiences.
Video games have had realistic-looking, customizable humans for decades — this should be simple, right?
It is not simple. When you look at another player in a traditional multiplayer game, there’s generally no expectation for the character to move exactly like a human would — you know it’s being piloted by someone sitting on a couch, holding a gamepad. This lack of expectation is a saving grace.
In VR, when you see an avatar moving in a realistic and very human way, your mind begins to analyze it, and you see what’s wrong. You could almost think of this as an evolutionary defense mechanism. We should be wary of things that move like humans but don’t behave like us. (Cue fifty years of sci fi filmmaking and paperbacks).
Human skin, we learned, is really difficult to fake. When humans talk, the skin stretches over cheekbones and coloration changes, all of which is very hard to reproduce in a convincing manner, particularly in the tightly constrained compute budgets of VR experiences.
Eye and mouth movements are equally important to get right. It’s not enough to make the eyes blink and to articulate the jaw when you are talking. There are other social cues delivered by the brows, cheeks, and lips. These are nonverbal and, without a face-tracking camera, difficult to simulate.
It’s because of challenges like these that our teams at Facebook Reality Labs are having to invent completely new technologies to enable more realistic avatars over the next decade.
In the absence of mature technology, however, our initial approach was to abstract away from these problems.
Making humans look significantly less human and breaking the expectation of realism worked well. Our colleagues working on Facebook Spaces pursued a fully articulated avatar with simulated eyes, mouth, and arms, using a cartoon style to abstract away from anything that might come across as uncanny or heighten the discontinuity of simulated and tracked behaviors.
At the same time, we learned the value of greater realism. Presented with a more human likeness and proportions, people automatically understood the space someone was occupying; the distinct shape of the nose bridge could indicate attention and facilitate conversational turn-taking, even when viewed side-on. And realism opened up the use of avatars in contexts where toon avatars felt less appropriate, such as business meetings in VR.
For the launch of Oculus Avatars in 2016, we started with a volumetric and sculpturally accurate human representation, and we used a monochrome texture to abstract away from anything that felt too much like skin. We faded or covered areas of the body we couldn’t reliably simulate, all to deemphasize what was not really there and to focus attention on the very human motion of the head and hands that we could track using our hardware.
Going into 2017, as we looked at ways to evolve the product, we performed countless hours of user research and different experiments with avatar design. Many people loved their sci-fi stylized avatar, but many more wanted to choose a skin tone and hair color that felt more personal to the expression of their identity in VR.
There was also an overwhelming desire to see eye and mouth movement, to help make connecting in VR feel more authentic and meaningful. We saw some rudimentary gains with very simple mouth animation (dubbed wibble mouth), but any experiments adding eyes to our monochromatic (and partially transparent) avatars ended up looking very, very strange.
We needed to evolve our visual style in order to deliver richer social presence.
This meant solving for some really hard challenges. Not only did we need to deliver eye and mouth movements without any cameras, but we also had to do it in a way that built upon what we’d already learned about realism of avatar form, and with a visual treatment that would work across the gamut of experiences people had already built using the Oculus Avatars SDK.
We also had to invent new ways to measure the impact of changes to our system. Working in conjunction with our excellent user research team in London, we created a scalable framework for evaluating avatar style and behaviors across multiple axes of self-presence and social presence, giving us a much better sense of when we were headed in the right direction.
As we prepared the ultimate makeover, we quickly learned that overindexing disproportionately to either behavioral or visual realism quickly led to an uncomfortable, or uncanny, interpersonal experience. An ultra-realistic human avatar without ultra-realistic facial expression felt jarring and lifeless. This was very much a case of balancing the art and the science.
We started with the science. We understood very early on that without face-tracking technology, our ability to re-create eye gaze and facial expressions represented the upper limit of believability we could achieve. Understanding our limits in re-creating human behavior helped us decide what level of realism we needed to convey through an avatar’s appearance.
We approached avatar behaviors by breaking the face into a few different components (speech, gaze, blinking) on top of which we could layer a more holistic understanding of the face and body language to ensure a cohesive performance.
Fortunately, we weren’t the first people to try to understand how humans work!
There is a wealth of academic research on eye kinematics, governing how the eyes behave when tracking a moving object or snapping focus from one to another. Blinking is similarly well understood. You blink subconsciously to pull water over your eyes as air dries them out. If you look around, more eyeball is exposed to the air, and there’s a greater chance that you’ll blink. It has also been observed that people blink more often as they finish a sentence.
We were able to codify these models of behavior for VR, and then had the fun job of tuning them as we realized all the ways wearing a headset actually makes people behave differently in a virtual social environment!
Gaze modeling is a great example of where things are different in VR: People normally look at things within a comfortable range of eye movement, tending to move their eyes about 30 degrees to the left or right before turning their head. It’s uncomfortable to look at something out of the corner of your eye for more than a few seconds.
With a VR headset on, however, we observed this range decrease to about 10 degrees. With a reduced field of view and a smaller sweet spot of visual clarity, people turned their heads more to look at things in VR.
This realization made it a little easier to predict where someone was looking based on head direction and the objects or people in front of them in a virtual environment, giving us more confidence in being able to simulate compelling eye behaviors.
With speech, we similarly looked to decades of study and mountains of data to characterize the behaviors of the mouth as it forms different sounds. Leveraging our latest Oculus Lipsync feature, we were able to use spoken audio to derive visemes: shapes that your mouth makes in order to form speech.
But your mouth moves in a very complex way. Animating between a set of mouth shapes, while accurate, felt decidedly artificial. We had to think about how to model contracting and relaxing face muscles in a way that felt much more comfortable to look at, and we brought in experts in linguistics and face tagging to help us out.
In this case, our issue was the order of events. You move your lips before, while, and after sound passes through them in order to shape words. How could we simulate anything that came before the sound had left a person’s mouth, gone into their microphone, and been decoded into visemes?
One option was to take advantage of the fact that social interactions in VR involved networking people in different locations and delay the audio long enough to allow us to animate into the visemes, but doing so would exacerbate common frustrations that occur when people use videoconferencing or long-distance calls, leading to worse social interactions.
We wanted to avoid this at all costs, so we instead looked at how the mouth moved when chaining together multiple sounds (the many visemes involved in saying “Hello”). We found that we could model the intermediate mouth shapes between each sound and the movements into the next sound in quite a convincing way, by controlling how quickly individual mouth muscles could move. We nicknames this technique differential interpolation(setting a limit on the distance each muscle could move between two different positions) and it resulted in mouth movement that felt less repetitive and choppy, and was much more readable.
Finally, having a clear understanding of how to render mouth and eye behaviors, we looked at how to better characterize the movements that accompany speech. We found that, with eye and mouth movements alone, avatars felt very inorganic, especially when they weren’t talking.
Our faces are constantly moving, whether it’s the slight tightening around the eyes and cheeks, periodic twitches around the mouth and brow, or even just subtle asymmetry in the way we move. These micro-expressions are the base layer of what makes a face feel alive. Because many of them happen seemingly at random, we were able to add these elements to our face model and help it feel more dynamic and lively.
The more we analyzed the signal of human interactions, the more we uncovered common patterns that we might be able to factor into modeling or training data — a sharp change in voice pitch and a jerking head motion, both of which we can track, accompanied by eyes widening and eyebrows raising, which we couldn’t otherwise infer with today’s hardware. While we only scratched the surface of these behaviors and their triggers, we identified a wealth of work ahead. We’ll be able to update our system with new features, like pupils dilating and constricting with changing light levels or avatars making mutual eye contact, over the next few months.
We took a much more restrained approach to overt expressions of emotional state, however. We realized we needed to tread incredibly carefully in predicting and simulating when someone might look angry, or even happy, in the absence of any data about the social context. Adding in a small degree of angry invoked a visceral response in VR. To that end, we also explored enabling people to trigger emotions on their face, while being conscious of the fact that asking someone to press the “Angry” button might disrupt natural conversation flow (or, perhaps worse, they might press it by accident!)
As we planned for more expressive avatars, we had to consider how we’d evolve our art style in line with the introduction of a lot more behavioral realism.
We started this work back in 2017, and it informed both the system-wide style overhaul and the initial expressive prototype that were revealed at Oculus Connect 4.
As we deepened our understanding of what we could achieve with behavior simulation, we also looked to where we could achieve believability in both model sculpture and rendering.
The first thing we did was look at the gamut of movie and video game characters and establish a spectrum, from cartoon and abstract to highly realistic. Through experimentation, we learned that as avatars became more realistic in terms of sculptural fidelity, the texturing increasingly became the distinguishing factor in how uncanny the avatar felt when moving in VR. By comparison, with more abstract or exaggerated face shapes, texture mattered much less as the sense of uncanny realism had already been greatly diminished along a different spectrum. We could make a realistic face seem less real with textures and shading. It was much harder to make a cartoony face uncanny.
Given our learning to date, we determined that we would use a more sculpturally accurate form, but we’d also use texture and shading to pull it back from being too realistic, in order to match the behavioral fidelity that we were increasingly confident we could simulate. Our goal was to create something that was human enough that you’d read into the physiological traits and face behaviors we wanted to exemplify, but not so much that you’d fixate on the way that the skin should wrinkle and stretch, or the behavior of hair (which is incredibly difficult to simulate).
Compute cost was also a huge factor in our work. With Oculus Go and Quest on the horizon, we needed a path to shipping multiple expressive avatars on a mobile chipset. We’d been able to cram 250 avatars into a shared experience on Oculus Venues, so we therefore had to be very cost conscious for every vertex we added to the avatar face.
Normal maps (using a relief map to augment a 3D mesh with contours at much lower cost) had already helped greatly in bringing the high quality of PC meshes to mobile. But with the addition of mouth movement, we had to completely resculpt the avatar faces and optimize for fidelity in areas that would stretch and crease (the regions surrounding the mouth, eyelids, and brows), shaving weight from areas that wouldn’t move (teeth, eyeballs, ears).
Finally, we spent weeks iterating on texture and shaders, ensuring that many of the qualities we’d honed on our original avatars, like subtle rim lighting to help accentuate face readability, weren’t lost as we transitioned to the new style.
The end result, demo’d at Oculus Connect 5 late last year and now available to Oculus users, combines everything we’ve learned about the art and science of avatars so far.
So that’s the journey to date. What we’re really interested in solving for next is the complexity of the full body.
We’ve already seen people invest deeply in their avatar, choosing casual and fantastical outfits from their virtual wardrobe. Having a fully embodied avatar allows people to get incredibly creative and expressive, and it allows us to deliver a much richer and more complete sense of social presence, fully encapsulating the projection of a virtual person into the space with you and making more of their body language for nonverbal social cues.
But doing so introduces amazing challenges, including:
Back in 2016, we made a conscious decision to avoid showing what we didn’t know in order to better represent what we knew with certainty. Since then, we’ve learned a great deal not only about how our hardware can help us simulate believable behaviors with higher confidence, but also about how we can use machine learning and well-understood priors to translate subtle signals into great social presence.
We’re incredibly excited to see what we learn next on the path to realizing full-body avatars.
Meta Quest 2 is our newest, most advanced all-in-one VR system yet.
Reality Labs brings together a world-class team of researchers, developers, and engineers to build the future of connection within virtual and augmented reality.