Introducing DeepFocus: The AI rendering system powering Half Dome
December 19, 2018
Earlier this year, Facebook Reality Labs (FRL) unveiled Half Dome, an industry-first prototype headset whose eye-tracking cameras, wide-field-of-view optics, and independently focused displays demonstrated the next step in lifelike VR experiences. By adjusting its displays to match your eye movements, Half Dome’s varifocal design makes every virtual object come into sharp focus. This approach showed real progress in creating a more comfortable, natural, and immersive sense of perception within VR.
“Our end goal is to deliver visual experiences that are indistinguishable from reality,”
But to reach its full potential, Half Dome’s advanced hardware needed equally innovative software. Today, we’re sharing details about DeepFocus, a new AI-powered rendering system that works with Half Dome to create the defocus effect that mimics how we see the world in everyday life. DeepFocus is the first system able to generate this effect — which blurs the portions of the scene that the wearer isn’t currently focusing on — in a way that is realistic, gaze-contingent, and that runs in real time. We presented our research paper at the SIGGRAPH Asia conference in Tokyo this month, and we are also open-sourcing DeepFocus, including the system’s code and the data set we used to train it, to help the wider community of VR researchers incorporate blur into their work.
DeepFocus, which was developed by a multidisciplinary team of researchers at FRL, has nothing to do with cinematic aesthetics or splashy visuals. In fact, the more accurate the rendered blur, the less likely the viewer is to notice it. “Our end goal is to deliver visual experiences that are indistinguishable from reality,” says Marina Zannoli, a vision scientist at FRL who joined the DeepFocus project early on. And the key to a truly realistic experience is a combination of focused and defocused visuals. “Our eyes are like tiny cameras: When they focus on a given object, the parts of the scene that are at a different depth look blurry. Those blurry regions help our visual system make sense of the three-dimensional structure of the world, and help us decide where to focus our eyes next. While varifocal VR headsets can deliver a crisp image anywhere the viewer looks, DeepFocus allows us to render the rest of the scene just the way it looks in the real world: naturally blurry.”
One of the biggest potential benefits of realistic retinal blur is more comfortable VR experiences. “This is about all-day immersion,” says Douglas Lanman, FRL’s Director of Display Systems Research. “Whether you're playing a video game for hours or looking at a boring spreadsheet, eye strain, visual fatigue and just having a beautiful image you’re willing to spend your day with, all of that matters.”
Lanman recognized the need for rendered blur back in 2015, in the early stages of the Half Dome project (which he also leads). Even just a few months into the project, early prototypes were showing promising results for creating sharp focus within VR. Software-based defocusing, however, was proving to be a major obstacle. Our process couldn’t draw from existing techniques for rendering real-time blur in non-VR games, which have more to do with cinematography than realism, generating eye-catching cinematic effects (such as a pleasantly defocused background) geared specifically for flatscreen monitors and TVs. These fast but inaccurate methods of creating “game blur” ran counter to Half Dome’s mission, which is to faithfully reproduce the way light falls on the human retina.
After months spent exploring traditional techniques for optimizing computational displays, the results still weren’t fast enough to produce truly real-time blur that accurately matched physical reality. Those early efforts exposed the dual challenge of rendering truly realistic blur in VR, which requires combining incredibly high render speeds with the levels of image quality required by advanced head-mounted displays. And rendered blur isn’t a one-off process applied to a scene while it’s being developed or when the viewer first encounters it. Gaze-contingent blur has to deliver rapid-fire and near-instant defocusing to match essentially every eye movement, with a level of fidelity that can’t be achieved by simply dropping the resolution of objects that the wearer is no longer focusing on.
Lanman had already learned that throwing more processing power at the problem wasn’t feasible. A 2016 demo of Half Dome achieved real-time blur through a process called accumulation buffer rendering, where the scene was rendered 32 times per eye. But this approach worked only because the overall scene was simple; it wouldn’t be possible for a wider range of VR experiences, particularly given Lanman’s focus on making any software solution accessible to the entire VR community. “I wanted something that could work with every single game immediately, so we wouldn’t have to ask developers to alter their titles—they’d just work out of the box with Half Dome,” says Lanman.
Bringing deep learning to VR
Instead of waiting for future processors to meet our requirements or asking customers to foot the bill for more total processing power, Lanman decided to develop software powered by AI. Specifically, he wanted to explore the use of deep learning, an approach in which AI systems learn to carry out a given task by training on large sets of relevant data. Deep learning algorithms are often used to analyze or even generate images. And while chipmakers have been moving in this direction, boosting the upper limits of image quality by adding AI-compatible learning cores to their latest video cards, deep learning is relatively unheard of in VR-related systems. “We decided to leverage those same AI tools that are driving industry trends,” says Lanman, “to go beyond just generating the pixels and actually give you more realism than you’ve seen before.”
Lanman’s deep learning strategy began in earnest when he hired Lei Xiao, an AI researcher who was fresh out of graduate school, where his PhD studies included numerical optimization and machine learning for computational photography. “I believe it was Lei’s first day in the lab when I told him, ‘I want to make computational displays like Half Dome run in real time, for the first time,’” says Lanman. “And that solution has to work for every single title in the Oculus Store, without asking developers to recompile.”
Xiao, who is now a research scientist at FRL, was tasked with generating realistic blur not from some new set of complex, focus-related parameters, but from the basic color and depth (RGB-D) inputs that our ASW 2.0 frame-rate smoothing technology already uses and that most game engines commonly provide. Previous work in this area had been plagued by artifacts that appeared at depth discontinuities of the virtual scenes and their insufficient runtime performance on modern VR display resolutions. In theory, an AI system with a sufficient understanding of defocusing could predict how neighboring pixels would mix together, regardless of their relative depth or the 3D gaze position (such as the wearer’s point of view). And if this technique could work with simple RGB-D inputs, realistic blur would be feasible for nearly any VR experience.
To pull off this combination of sophisticated image understanding and straightforward inputs, Xiao built an entirely new neural network architecture—one that’s specifically optimized for rendering blur in real-time. Unlike more traditional AI systems used for deep learning — based image analysis, this system could process visuals while maintaining the ultra sharp image resolutions necessary for high-quality VR.
But like all deep learning-based systems, FRL’s needed a wealth of training data to learn from. Specifically, DeepFocus — as the small but growing team had started calling the system — needed to develop its understanding of focusing and defocusing by looking at thousands of images that featured a wide variety of objects positioned at different distances. No data set existed that had the variety of surfaces and shapes the DeepFocus team needed. So Xiao and FRL technical artist Matt Chapman created one.
Chapman had come to FRL from the Oculus product team, where he had built some of our most well-known and polished demos. For DeepFocus, Chapman set aesthetics aside and gave Xiao an interactive junkyard of virtual objects. Chapman’s random scene generator produces scenes populated by scores of objects, including 3D scans of sculptures from the Louvre as well as synthetic spheres, cubes and 3D curves. The objects are randomly placed in 3D space with depths ranging from 25 centimeters to 10 meters.
The resulting collections of objects are bewildering to look at, but there’s a method to the random scene generator’s visual madness. This unnaturally rich range of geometric shapes and occlusions — with a greater variety of textures, surfaces and other features than you’d find in real life — functions as a kind of focal analysis boot camp for our deep learning system, preparing it to render blur in VR experiences it hasn’t seen before. “That was the first time I'd worked closely with a technical artist,” says Xiao. Technical artists like Matt Chapman are rare in research organizations but essential to FRL’s approach to AR and VR innovation. “Matt and I went through a lot of iterations to improve the random scene generator, from fine-tuning the distribution of objects, textures and materials to reducing the rendering time for ground truth images,” says Xiao. In total, they trained the system on 196,000 images drawn from the random scene generator, giving DeepFocus its core understanding of how to render blur in even the most varied and unfamiliar VR environments.
Over the course of the next year, the DeepFocus team grew to include a vision scientist (Zannoli) as well as research scientists Alexander Fix and Anton Kaplanyan, who helped design the system’s deep learning approach. “All previous methods of rendering highly realistic blur were based on hand-crafted mathematical models, with corner cases and limitations that lead to low-quality results and artifacts,” says Kaplanyan, who leads the Graphics Research team at FRL. “With deep learning, our system was able to grasp complex effects and relations, such as foreground and background defocusing, as well as correct blur at occlusion boundaries. And by generating a rich database of ground truth examples, we were able to cover a much wider range of defocus effects and set a new bar for depth-of-field synthesis.”
Salah Nouri, a research software engineer at FRL, joined the project to help demonstrate that DeepFocus could actually run on Half Dome and render real-time blur on present-day processors at a resolution fit for VR. “When I joined the team, the architecture of the network was already established, and the runtime was good enough for a regular PC or console game running at 1080p resolution,” says Nouri, who had worked on AAA video game titles before coming to FRL. “But we needed to at least quadruple that performance because VR is much more demanding.”
Nouri was able to demo DeepFocus and Half Dome on a four-GPU machine — a significantly more powerful setup than what consumers currently have available but still a major technical feat. “We needed to be very careful about parallelizing the work between the four GPUs, so that the memory transfers between them are pipelined in such a way that they don’t introduce any extra latency and have virtually zero compute cost,” says Nouri.
FRL isn’t done with either the software or hardware components of this technology, and our ultimate goal is to make real-time rendered blur run at VR resolutions on a single GPU. But our four-GPU demo and the research that we presented at SIGGRAPH Asia represent a significant milestone, both for integrating AI technology into graphics rendering and for developing new, more immersive and lifelike VR experiences. “We wanted to see what rendered blur could add to VR,” says Lanman. “But it had to be on real games and in a real VR setting. We achieved that. And that unlocks this whole universe of understanding.”
The future is bright, and defocused
With DeepFocus and Half Dome, we now have the tools to better understand how realism contributes to the user’s experience in VR and AR. And though we’re currently using DeepFocus with Half Dome, the system’s deep learning–based approach to defocusing is hardware agnostic. Our research paper shows that in addition to rendering real-time blur on varifocal displays, DeepFocus supports high-quality image synthesis for multifocal and light-field displays. This makes our system applicable to the entire range of next-gen head-mounted display technologies that are widely seen as the future of more advanced VR.
By making our DeepFocus source and training data available, we’ve provided a framework not just for engineers developing new VR systems, but also for vision scientists and other researchers studying long-standing perceptual questions. For example, how does our visual system use the blur in the environment to refocus our eyes? What can blur tell our brains about the three-dimensional structure of the world? DeepFocus may have provided the last piece of the puzzle for rendering real-time blur, but the cutting-edge research that our system will power is only just beginning.