How Meta production engineers solve the problem of scale

It's all about empowering engineers to do what they do best

At Meta, everything is a scale problem. That’s always going to be the case when you have 3.65 billion users globally.

For this reason, we have to build our own products and infrastructure. Quite simply, no third-party tool is powerful enough to handle our services and applications across our infrastructure. When we’re building these products and the infrastructure that supports them, our mandate must be future-thinking. These systems, after all, need to be able to not only support the massive user base we already have, but also grow as both users and new capabilities on the horizon expand. This applies to everything from our infrastructure to our entire family of apps. It also applies to our research projects.

At other companies, this task tends to fall to a well-known position in the engineering field: the site reliability engineer, or SRE. An SRE’s duty is to build reliability into a product or infrastructure as its users scale. At Meta, this role has another name: production engineer, or PE. And the job title isn’t the only thing that’s different: We’ve rethought the entire role. Rather than being restricted to putting out daily reliability fires — as is the case with SREs at many companies — our PE group acts as an engineering team focused on holistic scale problems, reliability challenges, and efficiency efforts. Instead of troubleshooting after a product has been released, our PEs have a key place at the table from the outset and have a zero-tolerance approach to performing operations manually. It simply does not work at our scale. 

All about the passion

Our approach begins with a simple philosophy: Engineers do their best work when it’s for something they’re passionate about. That’s the basis for our Engineer Bootcamp, a typical six- to 10-week program that every new hire goes through when they join our team. At Bootcamp, engineers are introduced to our culture, exposed to different parts of our infrastructure, and schooled in each of our unique areas of focus. Upon completing the program, they get the chance to choose where they want to work. This helps ensure that everyone buys into the work they’re doing. It also flips the model from hierarchical to one driven by the employee, placing managers in more of an influence-without-authority model and creating the “bottom-up” influence in which engineers have a level of autonomy seldom seen at other tech companies. 

Next, Meta structures the role differently. Once they’ve chosen an area, PEs work alongside software engineers (SWEs). In some cases, they’ll lead projects, and at other times they’ll support them. But PEs are not the ops team. This partnership and equality is both empowering and validating, and it speaks to the highly collaborative work environment we strive to create. This makes our products better by creating a natural tension between features and reliability at scale in a world where both are required in order to create a great experience for our community.  

My own journey at Meta is a good example that speaks to our iterative, engineer-friendly culture. I began as a director in Production Engineering, and upon completing my Bootcamp decided I wanted to work in an area that has not been implemented and one that helped with secure content moderation. We had tens of thousands of users working on this crucial task, and the third-party tool we were using wasn’t sufficient to handle that scale. I had the idea to build a product that created a more efficient and secure way to handle this challenge and our users’ data. I explored building with technologies that were not the norm at the time. I started by writing some simple code, built a prototype, and as I progressed, others began to join the team at their will. The result became a fully functioning product we still use today. By being given the autonomy to explore and create, and then the support to scale, I was able to bring a “0 to 1” solution to the company, with other awesome engineers, which became integral to our success. In my current role as VP of Production Engineering, I work hard to ensure that our talented PEs get the same chances I did while influencing opportunities that point engineers at solving company-level needs and priorities.  

How we structure our department is also unique. PE as a function is centrally organized but locally embedded. This gives my role and the PE organizations a company-wide perspective across infrastructure, research, and our entire family of products, including Facebook, WhatsApp, Instagram, Messenger, and Ads, as well as the metaverse and many emerging technologies.

Scaling our best-known products

The work of Syamla Bandla, who leads our PE Products division, and her team illustrates Meta’s unique approach to production engineering. Syamla’s team oversees scale and reliability for some of Meta’s most well-known products, including Instagram, WhatsApp, Messenger, and Ads. To give some context, Meta supports millions of advertisers on Facebook every year, and we’re seeing rapid growth in Instagram Reels

Creating a shared infrastructure that supports the scale we’re talking about, especially with Meta’s diverse portfolio of products, each of which has its own matrices and requirements, requires a holistic yet unique set of skills. That’s where PEs come in. They focus on complex dependency management — ensuring that reliability is built into the products — and are the glue between the infrastructure and product teams. They make sure we’re measuring the right service-level objectives and service-level indicators, and adjusting as needed along the way. As one can imagine, these products are massively complicated and require a highly matrixed team to ensure success.

Our biggest priority right now is building a state-of-the-art AI infrastructure and platform that will fuel the growth and user experiences anticipated in areas such as advertising, Reels, and, of course, the metaverse. This type of growth, in turn, leads to the need for larger models that are more advanced and complex, so PEs focus on making sure our AI infrastructure and platform can scale with reliability and observability in mind. This is such a large and cross-functional initiative we have in front of us, touching all core Infrastructure teams, including provisioning, core services, compute, network, storage, data, and AI Infra, and finally enabling the AI infrastructure and platform for some of our largest workloads in Ads, Instagram/Reels, and Feed. PE’s embedded model across our multiple Software Engineering teams enables a strong and deep collaboration setup to help drive such large initiatives, with a global perspective that is quite unique.

Whether in the area of products, infrastructure, or research and reliability labs, PEs at Meta add tremendous value and allow our services and business to grow at the rates we anticipate. In the world of technology, there will always be change, and we’re OK with that. That said, one thing that won’t change is our commitment to keeping Meta a place where engineers have an opportunity to work at massive scale, with autonomy to make an impact on some of the industry’s largest challenges and in a culture that enables bottom-up engineering culture to meet top company priorities and needs.