“AI is no longer a minor workload. It is central to everything we do in infrastructure at Meta.”
At an organization as vast and complex as Meta, it’s hard to ascribe too much importance to any single executive. But it seems fair to say that there are few individuals as crucial to the company’s daily operations as Santosh Janardhan. Appointed head of infrastructure last year, Janardhan is Meta’s top engineering leader, responsible for developing and operating the hardware, network, software, and data centers that all Meta’s services run on. That means keeping everything humming for the more than 3.7 billion people who use Meta’s family of apps each month, while simultaneously laying the groundwork and vision for a highly advanced AI infrastructure to power the company’s products of today and tomorrow. A native of Ahmedabad, India, who held top roles at PayPal and YouTube before joining Facebook in 2009, Janardhan recently sat down with Tech@ to discuss the leadership lessons that come from hockey-stick growth, coping with the pressure of operating at hyperscale, and how Meta is shaping its AI infrastructure for the future.
From an infrastructure perspective, do you think we’re at an inflection point when it comes to AI?
Santosh Janardhan: Over the last 25 to 30 years, there have been two major inflection points in the history of infrastructure. The first one was the internet browser. The browser came along, and there’s a whole slew of things that ended up happening. The second was the iPhone, which introduced a new range of mobile and app infrastructure. AI is the third wave. AI is already inflecting a whole range of use cases, and the rate of change we’re seeing is phenomenal. In five to 10 years, I believe we’re going to see a completely different world based on advances powered by the next generation of infrastructure — infrastructure specifically built for AI.
When it comes to infrastructure, one thing that most people miss is that AI is no longer a minor workload. It is central to everything we do in infrastructure at Meta. It’s a game changer that comes with great promise and equally great challenges.
How is this inflection point changing your approach to infrastructure?
SJ: The unique hardware and software needs required to support and develop AI are profoundly different from the basic compute technologies we’ve been familiar with for decades. It has inspired us to fundamentally rethink how we think about scale. It has also called for a customized approach to interconnect our infrastructure technologies from top to bottom. And you don’t have an option, by the way. If you treat AI as a workload, it will be nothing but a workload. It’ll actually hold you back. If you flip it around and make it part of your fabric, that's when you can unlock AI’s potential and promise. As we position AI as integral to almost everything we do across the company, and then you multiply it by the scale at which we operate, you realize that what we are building is truly world class and cutting edge. You build it at scale, you build it for billions of users, you build it for millions of servers. The magic multiplied by the scale means that it can positively impact billions of people. That’s true fun.
How vast is the physical infrastructure that you currently manage?
SJ: We have 21 owned and operated data center regions — each with multiple data center buildings the size of approximately four football fields put back to back. These structures are full of servers. So we have millions of servers, hundreds of thousands of miles of fiber optics. We also have an edge network — that helps extend our infrastructure in places where we don’t have data centers. And all of this is interconnected.
Of course, one of the challenges of operating at hyperscale is managing the inevitable outages and technical issues. The goal is to ensure that the people using our services never experience them. Day in and day out, billions of people use our products. And they’re not just using them nominally. This is how they connect with their loved ones, how they make their living, where they turn to communicate and coordinate services during emergencies. We’re the only hyperscaler that is not a public cloud.
If technical issues are inevitable, to what extent is your life a matter of constant triage, putting out one fire after another?
SJ: Think about a submarine. A typical submarine is made up of a number of hermetically sealed chambers. So if it is at the bottom of the ocean, and something pokes a hole in the side, it’s not bye-bye. You just seal off that one chamber, and the rest of the submarine keeps chugging along. That’s our operating principle. When you’re talking about millions of servers, it’s statistically impossible that some of them won’t fail. The trick is in making sure that we have application and service boundaries so that when there is a failure, we can detect it quickly — we are fast enough to route around it before the people using our services notice. Life goes on. We detect what’s wrong, fix it, yank it out of the fleet if necessary, and replace it. And you never knew. That’s the whole idea.
Tell us about your background. What led you to this role?
SJ: I trained as a mechanical engineer back in India and then did my master’s in information technology in Sydney, Australia. Frankly, I was out of money. I had always been fond of taking computers apart, so I started doing that, and it was lucrative. I realized that writing code and organizing databases came naturally to me, and in about 2000 came to Silicon Valley, where I ended up at this very small startup called PayPal. It was a special place and a fun journey, and not only because of the quality of people — the famous PayPal mafia. I literally came in and within a year we went from being relative nobodies to having millions of people on our site. Going through that teaches you so much about computing, about people, about leadership. After eBay bought PayPal, I ended up at another small startup; this one was called YouTube. Again, I was there very early, and it was another wild ride, another example of hockey-stick growth, but even steeper. Google bought us, and I stayed there a couple of years, and then in 2009, I ended up here. It was a crazy time. We were adding a million users a day to the site, but didn’t have enough servers to sustain that growth. Every day, we would work between midnight and 9 a.m. to resurrect servers that had either broken down or died. So I ended up doing this hockey-stick thing again, putting in a degree of discipline, helping to scale the company’s infrastructure from the early startup days to what it is today.
So PayPal, eBay, YouTube, Google, Facebook — my standing joke is that I’m probably more responsible than anyone for the bulk of humanity spending time online!
You mentioned that these rapid-growth experiences were a kind of crash course in leadership. What were some of the key lessons you learned?
SJ: The higher your vantage point and the bigger your charter, the more important it is to do a few things really well. You need to be able to do the trust-and-verify thing, where you empower people to make decisions and work independently. But to do that, you need to have a viable understanding of exactly what it is you’re managing. In that way, I am blessed: I’ve been here 13 years, helped build a lot of our infrastructure, and know a fair bit about it. I am still part of our on-call rotation, monitoring our infrastructure. I am still what we call a CMOC — crisis manager on call — for when things go wrong. Why? First, it keeps me fresh. And second, it gives me a line of sight and a working relationship with the people who are on the ground. You need to have a pulse of what’s happening, and you need to be accessible and approachable and have conversations with people who do this job day in and day out. The general who always commands from HQ is not as respected as the general who’s willing to go into battle.
You also need to have a strong leadership bench that is diverse and complementary. Most people hesitate to hire people who are better than they are. Most people hesitate to hire people who don’t agree with what they say. And most people do not encourage confrontation within their leadership teams. But confrontation can be so productive if you manage it right. And finally, you need to be very comfortable letting your folks have the limelight when things are going well — while you take responsibility when they are not. That’s how you create trust.
That seems very much in line with Meta’s no-blame culture.
SJ: A predecessor in this job, Jay Parikh, introduced what is called the SEV review process. On a weekly basis, the engineering team gets together and reviews all of the site events, or SEVs, that occurred that week — basically all the technical things that went wrong or caused an issue. This is what caused it. This was the customer implication. We are focused on identifying what we can do to get better so it doesn’t happen again. Rinse repeat, rinse repeat. The important thing is that this is never about blaming the person who might have caused it. It’s about using these as teaching moments and identifying paths to prevention, making our services resilient to future issues.
Are there any SEV reviews that are particularly memorable?
SJ: There’s one incident, probably five, six years ago. Someone new, who was still in boot camp, caused a site issue. He came to the SEV review, but he didn’t know how to present to the group. So he said, “I’m not sure how to proceed.” I was joking with him and said, “First you need to apologize to everyone in the room. Then you can proceed with your presentation.” And the poor guy starts apologizing. I was mortified. I jumped in and said, “Hold on — pause. How many people in this room have caused a site issue? Raise your hand.” And half the room, including me, raises their hands. I remember telling him, “See, this is not a career mistake. This is something we’re learning from. Please, please, please: Never apologize for this ever again.” That has remained the defining moment for the SEV review.
All companies try to learn from mistakes. How is Meta’s SEV review process different from what you’ve seen elsewhere in the industry?
SJ: The focus is on figuring out what went wrong and fixing — not placing blame on an individual. That is really unique. I’ve been in places where people tremble before presenting a problem to senior leadership. Because they know they’ll be shouted at or they’ll be told, “How could you have let this happen?” That can really be a scarring experience. The second thing is that at almost every other company I’ve been at, if someone does something wrong, they fix it by saying, “Next time you do this, get approval from your director.” So you end up adding layers of process. That is almost never helpful.
How do you balance the core functionality of the existing business while laying the groundwork for the future of the company. That seems like a tricky, even fraught, line to navigate?
SJ: Think about a stock portfolio. You have your value stocks, you have your stocks that have more risk, and the most important thing is how you balance them. As a leader, you have to constantly ensure that you’re devoting the correct portion of your time, your attention, and your people for the right tasks. These are different skill sets — so in addition to having the right balance, you need to make sure the right people are leading the right things.
As we move into this new era of AI, what kind of demand is that putting on the infrastructure that you manage and how are you planning for that?
SJ: AI represents a baseline shift for the entire industry. For the longest time, we have tried our best to commoditize infrastructure. Take servers — they’re like LEGO blocks. I can deploy them linearly at scale. I know when they will break. I know how to fix them. But AI changes that. AI changes that because the generation-over-generation improvements that are happening in software are taking place not every four years but, like, every six to nine months. Your server footprint needs to be redesigned just as rapidly. It is breaking a lot of the fundamental guidance the industry has had.
Are you saying that innovation in AI goes far beyond AI? That it fundamentally alters your entire approach to infrastructure — whether we’re talking about cutting-edge technologies or just keeping the core functions of the company running as efficiently as possible?
SJ: Yes. Exactly. It is not just about deploying AI but to use it to do our work better — server placement, heat dissipation, and failure prediction are all examples where we are leveraging it.
Finally, billions of people around the world rely on your infrastructure to communicate, work, play — everything. How do you cope with the pressure of keeping everything running smoothly?
SJ: Pressure only gets to you if you think about your job as a task, as a chore. If you think about it as a privilege, as something that very few people in the world have the opportunity to do, your approach changes. You approach your work very differently. It becomes a question of managing your workload, making sure you take care of yourself, etc. You figure out your coping mechanisms, whatever works for you. I work out a lot. I do a lot of biking and running uphill. Why? Because I get so out of breath that I cannot think of anything else. It focuses me and helps me recharge. Each one of us has to figure out the outlet that works the best.
To hear more from Meta’s Head of Infrastructure, check out the Meta AI Infra @Scale event on May 18, 2023. Meta's Engineering and Infrastructure teams will be hosting the one-day virtual event featuring a range of speakers who will unveil the latest AI infrastructure investments powering Meta’s products and services. Janardhan will deliver opening and closing remarks and guide attendees through six exciting technical presentations on some of Meta’s latest AI infra investments. The event will also feature a fireside chat with a panel of Meta Infra leaders, called “The Future of AI Infra: The Opportunities and Challenges That Await Us on Our Journey.” To learn more and register for the event, click on the link.