Ideas

From servers to silicon, how Meta infrastructure VP Alexis Björlin is building the foundation for the company’s AI-driven future


May 18, 2023
Recommended reading

When most people think of Meta, they think of the apps like Facebook, Instagram, and WhatsApp that are used by billions worldwide. Which is to say, they think about software. What fewer people know is that Meta also is in the computer systems business, which includes creating cutting-edge hardware, silicon chips, and related systems designed to power the next generation of artificial intelligence computing for Meta’s apps and services. That effort is led by Alexis Björlin, a technology veteran who held executive positions at Broadcom and Intel before joining Meta over a year ago. In the conversation below, Björlin explains her vision for Meta’s emerging AI-driven infrastructure.

Tell us about your role at Meta.

Alexis Björlin: I lead AI Systems and Accelerated Platforms at Meta. Our team is responsible for designing and delivering highly performant compute and storage platforms that are uniquely optimized to serve Meta’s workloads.

If you really think about it deeply, the infrastructure to support 3 billion users needs us to be world-leading in performance, energy efficiency, reliability, serviceability, and capital efficiency — all at the same time. This necessitates owning the full stack from software to hardware, all the way to custom silicon.

What are the main drivers behind you and your team’s current focus?

AB: If I look back to the last decade, Meta’s workloads have changed a lot. Web was the predominant story of the last 10 years and the systems to support the web scale were focused on increasing the speeds, feeds, and span of the serving systems in a capital- and energy-efficient way. The advent of real-time apps such as Instagram, WhatsApp, and Reels brought additional sensitivities around latency, availability, and load balancing. But these apps were still servable using a shared infrastructure.

The advent and growing dominance of AI is now pivoting us to build data centers specifically dedicated to high-performance AI workloads using 'software-defined hardware.' This implies working up and down the software stack with our Meta peers in AI research, AI/ML Modeling, and across AI Infra to co-design tightly coupled AI training clusters and inference-serving platforms all the way down to the silicon.

We’ve long partnered with leading AI silicon and systems companies to deliver the most capable infrastructure. But, at the same time, we’re developing our own custom-built, domain-specific accelerators to allow tighter coupling between innovations in silicon, algorithms, and the PyTorch software ecosystem. Designing systems across these three dimensions in a synergistic fashion is what we refer to as co-designing, which enables faster progress simultaneously across all three dimensions.

What’s the status of Meta’s in-house silicon efforts?

AB: We’ve focused for the past several years on growing an internal team for developing custom silicon and taking a long-term view on the arc of silicon innovation for AI workloads. Having the capability to tune our own silicon to optimize for both performance and power efficiency, and to have control over our supply chain to support our massive scale is incredibly important to Meta.

Having an internal capability doesn’t mean going it alone. The speed of innovation and scale of deployment, especially within AI, requires unwavering commitment to our long-standing partnerships with the industry’s leading silicon and systems providers and we expect these relationships to continue to grow in the future. Having said that, there are areas where Meta’s workload specificities are not well served by vendor silicon and that requires us to build silicon that is custom to our needs.

(To learn more about Meta’s silicon strategy, tune into the AI Infra @Scale event today, Thursday, May 18, at 9am PT. We’ll be sharing progress on our in-house silicon efforts across inference and video encoding. Register here.)

What led you to this role?

AB: I came here a little over a year ago to contribute to the big transformation that was underway within Meta’s infrastructure and to build vertically integrated AI systems at a scale only possible within a handful of companies in the world.

Meta’s leadership in open source software — starting with Thrift, Hive, and Cassandra in the early years to now with PyTorch — has been widely admired, across the world. For the technologist within me, Meta offered the opportunity to bring together Meta’s software innovations with all of the work that I’ve done in semiconductors, supercomputing, optics, and networking. For the business leader within me, it offered a once-in-a-lifetime opportunity to utilize my experience in building businesses from the ground up and running large engineering teams, while retaining the agility of a startup with a shared vision for the future.

Has Meta’s storied startup-like culture changed as the company has grown to its current size? What did you see coming in from the outside?

AB: Companies that grow under founder-CEOs are luckier in that the founding culture often remains at the core through the company’s growth. I’ll give you a concrete example.

Meta’s 'Hack Culture' continues to encourage engineers to try what’s not proven. This, coupled with 'title-less' engineering ranks, improves the chance of discovering new possibilities by democratizing ideation away from tenure, rank, organizational boundaries, or seniority. Our Hack Culture has been our hallmark since the beginning and still serves us well.

Has risk tolerance changed as Meta has grown?

AB: Risk tolerance has multiple thresholds in a company as large as Meta.

When we were a younger company, 'Move fast and break things' helped us reframe failure to push the boundaries of what’s possible and as a harbinger of future success. We fostered a bottoms-up, blameless culture that still encourages openness and honesty, and fosters trust. However, as it happens with a company on which the better part of the world starts depending on a daily basis, for us, the cost of failure started to grow and we started to balance moving fast with focusing on stable infrastructure and long-term impact. Now, especially for us as Meta’s Infrastructure team, it's about ensuring stability while taking bold bets (in silicon development, as an example) and not becoming risk averse.

As we embrace the pivot to AI within Infra, where our hardware systems can actually provide a competitive differentiation for our research and production teams who are developing new models and serving our global users, we now need to accelerate our systems technologies and capabilities — especially since we can no longer rely on Moore’s Law to deliver significant advances with every new process node. Some of the risk taken needs to be in the introduction of application-specific silicon and custom systems into the fleet that are architecturally tuned to the requirements of our internal customers and software stack. We see all teams coming together in AI to embrace these risks, to increase developer velocity, and to embrace new technologies we are introducing that advance our fleet and capabilities.

What about working on hardware and software simultaneously? What kinds of challenges does that present?

AB: It’s really about raising awareness of what we’re doing and the problems we’re trying to solve — today and in the future. You have to remember that Meta is, first and foremost, a software company and that the time scale for software development is much shorter than that of hardware. Designing a chip can take two or three years. Think about how much changes in software during that period! So we’re focused on aligning our teams on the longevity of our investments and on their long-term impact. One way we do this is sharing corollaries and business case plans to show how our work contributes to Meta; to our product groups. We also share how other people have thought this through, how they’ve viewed their journey, and how we can accelerate our own journey, leveraging both internal learnings and those of the broader ecosystem.

Creating a new chip from scratch is a tremendous undertaking. How do you create a culture of innovation — specifically one that is able to tolerate, learn from, and even encourage, failure?

AB: Creating a culture of innovation, creating the psychological safety required to take risks and make big bets, requires a combination of broad context and deep domain expertise.

A broad context is required to formulate the problems that are of greatest value to solve. Deep domain expertise is required to build the right teams, and to enable empowered decision making at the most competent layer in the organization. An innovation culture encourages broad ideation, and enables those ideas to take root, even if it radically changes how things have traditionally been done.

That’s one of the reasons to create startups within a larger organization — to create deeply technical teams with leaders who have the agency and autonomy to invest in experimentation and incubating new technologies, new processes, and new business models.

What’s the most surprising thing you’ve experienced during your first year at Meta?

AB: Coming in, I wondered if we would be able to move fast in a consensus driven culture.

One concrete example is the Grand Teton program, our next-generation platform for AI at scale that we are contributing to the Open Compute Project community. We needed to dramatically reduce the time to production by reducing the time it takes to validate such a complex product, which involves a tightly coordinated effort among hundreds of people. While there was plenty of well-reasoned skepticism around the ability to pivot fast at our scale, the team outperformed even our wildest expectations.

How did we do it? Two things: First, we had to set a very clear goal. Second, we had to make an upfront investment to get everybody to buy into that goal. Questions such as: Why are we doing this? Why does it matter? Will it be recognized and valued? At more traditional companies, you don’t have to do as much upfront work. You just set a direction and go. Here, there’s more work on the front end. But then, once this incredibly talented workforce internalizes a goal, it outperforms with ingenuity, hard work, and extreme focus. It feels like once we lock to a target with the team fully bought-in there is nothing we can’t achieve.

To hear more from Alexis Björlin, check out the Meta AI Infra @Scale event today, May 18, at 9am PT. Meta's Engineering and Infrastructure teams will be hosting the one-day virtual event featuring a range of speakers from Meta who will unveil the latest AI infrastructure investments powering the company’s products and services. Meta’s Head of Infrastructure, Santosh Janardhan, will deliver opening as well as closing remarks and guide attendees through six exciting technical presentations on some of Meta's latest AI infra investments. The event will also feature a fireside chat, where Björlin will join three other Meta Infra leaders on the panel, discussing, "The Future of AI Infra: The Opportunities and Challenges That Await Us On Our Journey."

To learn more and register for the event, click on the link.

We're hiring leaders!

Help us lead teams as we build the metaverse