By Paul Wu and Nikita Lutsenko
In today’s post, Meta Product Manager Paul Wu and Software Engineer Nikita Lutsenko walk us through how Meta Spark powers AR experiences across Meta’s family of applications and technologies.
We’ve developed an augmented reality (AR) engine that gives creators the core technologies they need to create AR experiences.
At Meta, our AR engine group works to ensure that our augmented reality (AR) services are available for everyone, regardless of the device they’re using. AR and virtual reality (VR) experiences shouldn’t be restricted to the most sophisticated devices.
To achieve this, we’re focusing on performance optimization. Meta’s AR platform is one of the largest in the world, helping the billions of people on Meta’s apps experience AR every day and giving hundreds of thousands of creators a means to express themselves. Meta’s AR tools are unique because they can be used on a wide variety of devices — from mixed reality headsets like Meta Quest Pro to phones, as well as lower-end devices that are much more prevalent in low-connectivity parts of the world.
Here are some of the challenges we’ve faced and lessons we’ve learned in the process of building a large-scale, cross-platform AR runtime since we began in 2017.
A lot of the teams within Meta want to build AR experiences. This requires a few different pieces of technology—for example, managing device input, such as input from your camera, and managing and using computer vision tracking to anchor things on the face, the body, or the environment. It also requires advanced rendering to deliver quality imagery on a spectrum of edge devices with different hardware.
Our teams generally want to avoid the overhead of building all this technology from scratch. So Meta’s AR engine provides the core technologies developers need to build AR experiences.
As a platform, we believe that creators can unlock their creativity only when they can use various AR capabilities as building blocks. We think about creative capabilities as either people-centric or world-centric. People-centric capabilities use people tracking to anchor things on the person (using iris tracking to control a game, for example). World-centric capabilities put your content into the real world, using things like plane tracking and target tracking. For example, you can use target tracking to create a QR code greeting card that shows someone a unique animation and custom text. In addition to computer vision–based capabilities, there are plenty of others that might not require a camera or even computer vision but rely on things like audio. For example, you can use our platform to build digital content that responds to the beat of a song or even allows you to transform your voice while you’re recording a video.
We give all creators the option to mix and match these various capabilities as they please, somewhat like LEGO bricks, to create and deliver unique experiences. We deliberately aim to make these capabilities platform- and device-agnostic whenever possible in order to provide maximum flexibility to customize content for the form factor that creators are targeting.
Once an experience is built, we do several things behind the scenes to ensure that we can deliver AR experiences with the widest possible reach, from mobile all the way to advanced hardware and VR. The first step involves significant server-side optimization — including asset optimization, such as removing empty pixels or comments from scripts, compressing assets for the target device, and also adjusting packaging into a large file or multiple smaller ones, so we can support a wide variety of connection speeds when delivering the assets.
Next, we optimize the runtime. This includes preloading assets and prerendering content, similar to what you’d expect in a modern game engine, but we take it much further across all components. For example, tracking is one of the biggest performance hogs in our system, but we know that we can generally improve on this by parallelizing operations, allowing a tight balance between running inference on the latest data, while not sacrificing smoothness and fluidness of visual aspects of a given experience.
A big win has come from estimating hardware performance and being able to optimize ahead of time in general. Knowing how a given target hardware performs will inform the selection of higher- or lower-precision tracking models as well as selection of assets of different fidelity for performance gains. As shown in the image below, a big lever within optimization is the ability to move important buckets and shorten the time it takes to invoke each step.
A specific example is when we expanded reach to bring AR to low-connectivity parts of the world. We achieved this by bringing AR to Facebook Lite, a smaller version of the Facebook mobile app that uses significantly less data. Doing this meant off-loading several elements to the server, as many devices don’t have the compute capabilities to run heavyweight components of the system.
Our platform needs to optimize for many diverse experiences as well as consider different form factors. We want to let creators focus on building amazing experiences, while we focus on the complexity of having them run everywhere.
Our runtime is largely organized in terms of tracking, simulation, and rendering. While each is a large system in itself, one way we simplify is by building all these systems on a single set of OS APIs. This enables us to abstract away platform specifics of operating systems as well as hardware components, like GPUs.
We’ve also deliberately split our monolithic runtime into smaller plugins. This way, if an app doesn’t require a specific capability it can easily be excluded using just a quick configuration toggle. For example, mobile phones generally wouldn’t benefit from having a capability for handling VR controller-based input. The capabilities needed for any given experience are what dictate which plugins are loaded.
Avatars, powered by our AR engine, is one of the most recent success stories that encompassed the above techniques. Avatars, digital replicas of people, need to be supported on a wide range of mobile devices, as well as VR headsets. In addition, since avatar expressivity is only increasing, even at the current stage, we have to support hundreds of thousands of Avatar permutations, since each and every avatar is personalized.
Our first tactic was to break down the process and order of operations in the entire Avatar rendering. We started to shave time off of each step, and rearrange steps where it would increase efficiency. Second, we parallelized the set of operations that go into both delivering, simulation and rendering the Avatar on each target form factor, as well as started applying a wide variety of ahead-of-time optimization techniques. Last, but not least, we partnered with another Meta team to integrate Avatar SDK, a tool for customizing and displaying avatars, as an AR engine plugin. This allowed us to even further reduce latency, as well as bring in some of the newest features like kinematics or full-body avatars into AR engine, and to accelerate new avatar feature development overall.
Building for scale really is a game of inches. A lot of what we achieved was through iteration – not designing for maximum reach from the beginning – we instead evolved there over time. We’re still looking for new opportunities today to further expand our platform's reach and support more use cases with a single AR engine at the core.
Take a deeper dive: Watch "Building a Cross-platform Runtime for AR Experiences."
Reality Labs brings together a world-class team of researchers, developers, and engineers to build the future of connection within virtual and augmented reality.