In 2008, Meta (then Facebook) was nowhere near the size it is today. Before we built our first data center in Prineville, Oregon, and founded the Open Compute Project, we did what many other companies that need data center capacity do — we leased or rented data center space from colocation providers. This sort of arrangement works fine unless the market experiences a major impact … something like the 2008 financial crisis.
The financial crisis hit the data center business right at a time when we were in the middle of a negotiation with one of the big colocation providers. They didn’t want to commit to all this spending until they had a better idea of what 2009 would be like. This was totally understandable from a business perspective, but it put us, as a potential customer, in a rather uncomfortable position.
We ended up making smaller deals, but they weren’t efficient from the standpoint of what we ultimately wanted — a way to handle how rapidly Facebook was growing. On the Infrastructure team, we always wanted an infrastructure that facilitates the growth of the business rather than holding it back. That’s not easy when your plan for the next two years effectively gets thrown in the trash.
That was the moment where we really asked what we could do to ensure that the company had the infrastructure it would need going forward. The only answer was that we had to take control of our data centers, which meant designing and building our own.
In 2009, we started looking at what it would really mean to build and operate our own data centers, and what our goals should be. We knew we wanted the most efficient data center and server ecosystem possible. To do that, we decided to create an infrastructure that was open and modular, with disaggregated hardware, and software that is resilient and portable. Having disaggregated hardware — breaking down traditional data center technologies into their core components — makes it easy and efficient to upgrade our hardware as new technologies become available. And having software that can move around and be resilient during outages allows us to minimize the number of redundant systems and build less physical infrastructure. It means the data centers will be less expensive to build and operate, and more efficient.
I had previously designed and constructed data centers for Exodus Communications and Yahoo, so I knew what we needed to do and who I wanted to work with on this for Meta: Jay Park, a brilliant electrical engineer I had worked with at Exodus, who I ultimately brought on to lead the Data Center Design & Engineering team. Jay joined the team in early 2009, and we spent those first six months trying to decide exactly what the scope for this project would be. We had an idea that there is a symbiosis between the data center itself and the hardware inside it, so we were standing up the data center and hardware development teams at the same time.
When we think about designing a data center, one thing to remember when data centers are in high availability — operating with limited to no downtime — is that less is often more. Less equipment can yield higher reliability because you’ve eliminated some potential equipment failure. Jay’s view of the electrical system was the same; we want to limit the number of times that we convert electricity from one voltage to another, because each one results in some loss of efficiency. Every time you do that — whether you’re going from utility voltage to medium voltage to voltage inside the data centers — some energy is lost in the form of heat from the transformer. It’s inefficient, and efficiency has always been a core objective of the Infrastructure team.
The challenge was how to deal with these transitions in the electrical system, plus the fact that we have to convert from AC to DC. You need AC voltage driving your servers, but you also need a DC battery of some kind to power things in case of an outage. Some big data centers use very large battery banks that serve the whole facility. In our case, we opted to keep the batteries inside the same racks that the servers are in. The catch, however, was that there weren’t any server power supplies available that could switch from AC to the needed DC voltage from the batteries.
Then Jay had an epiphany. He told me he was lying in bed, thinking about our need for this shift from AC to DC, when the idea hit. He jumped up and all he had was a napkin by his bedside. He scratched down what he thought this electrical circuit would look like and then went to the hardware team the next day and asked if they could make it work.
That was the origin of our highly efficient electrical system, which uses fewer transitions, and the idea that the servers themselves could toggle between AC and DC reasonably simply and quickly. Once this piece of the puzzle was in place, it laid the groundwork for us to start designing and building our very first data center in Prineville.
Once we lined up on the strategy to limit the electrical conversions in the system, we sought the most efficient way to remove the heat that’s generated when conversions are necessary. That meant thinking about things like making the servers a bit taller than usual, allowing for bigger heat sinks, and having efficient air flow through the data center itself.
We knew we wanted to avoid large-scale mechanical cooling (e.g., air or water cooled chillers) because they were very energy intensive and would’ve led to a significant reduction in overall electrical efficiency of the data center. One idea was to run outside air through the data center and let that be part of the cooling medium. Instead of a traditional air conditioning system, then, we’d have one that uses outside air and direct evaporative cooling to cool the servers and remove the heat generated from the servers from the building entirely.
What’s more, today we use an indirect cooling system in locations with less than ideal environmental conditions (e.g., extreme humidity or high dust levels) that could interfere with direct cooling. Not only do these indirect cooling systems protect our servers and equipment, but they’re also more energy- and water-efficient than traditional air conditioners or water chillers. Strategies like this have allowed us to build data centers that use at least 50 percent less water than typical data centers.
Optimization and sustainability
In the 10 years since we built our first data center in Prineville, the fundamental concepts of our original design have remained the same. But we’re continually making optimizations. Most significantly, we’ve added additional power and cooling to handle our increasing network requirements.
In 2018, for example, we introduced our StatePoint Liquid Cooling (SPLC) system into our data centers. SPLC is a first-of-its-kind liquid cooling system that is energy- and water-efficient and allows us to build new data centers in areas where direct cooling isn’t a viable solution. It is probably the single most significant change to our original design and will continue to influence future data center designs.
The original focus on minimizing electrical voltage transitions and determining how best to cool are still core attributes of our data centers. It’s why Facebook’s facilities are some of the most efficient in the world. On average, our data centers use 32 percent less energy and 80 percent less water than the industry standard.
Software plays an important role in all of this as well. As I mentioned, we knew from the start that software resiliency would play a big part in our data centers’ efficiency. Take my word for it when I say that, back in 2009, the software couldn’t do any of the things it can do today. The strides we made in terms of the ability and the resiliency on the software side are unbelievable. For example, today we employ a series of software tools that help our engineers detect, diagnose, remediate, and repair peripheral component interconnect express (PCIe) hardware faults in our data centers.
If I were to characterize the differences between how we thought about our data center program and how more traditional industries do, I think we were much more calculating about trying to assess risk versus the reward to efficiency. And risk can be mitigated by software being more resilient. Software optimizations allow us, for example, to move the server workload away from one data center to another in an emergency without interrupting any of our services.
10 years ahead
Now that we have 10 years of history behind us, we’re thinking about the next 10 years and beyond. We share our designs, motherboards, schematics, and more through the Open Compute Project in the hope of spurring collective innovation. In 2021, we’ve furthered our disaggregation efforts by working with new chipmakers and OEMs to expand the open hardware in our data centers. Open hardware drives innovation, and working with more vendors means more opportunity to develop next-generation hardware to support current and emerging features across Meta’s family of technologies.
As I’m writing this, we have 48 active buildings and another 47 buildings under construction, so we’re going to have more than 70 buildings in the near future that all look like our original concept. But they also need to stay relevant and in line with future trends — particularly when it comes to sustainability.
In 2020, we reached net zero emissions in our direct operations. Our global operations are now supported by 100 percent renewable energy. As of today, we have contracted for over 7 gigawatts of new wind and solar energy, all on the same grids as the data centers they support.
The data centers we build in the future will continue this trend. We think about sustainability at every step, from the energy sources that power them all the way down to the design and construction of the data centers themselves. For example, we have set ambitious goals to reach net zero emissions for our value chain and be water positive by 2030, meaning we will restore more water into local watersheds than our data centers consume. In building our newest data centers we’ve been able to divert, on average, 80 percent of our waste footprint away from landfills by reusing and recycling materials.
There is a lot of activity in the data center and construction industries today, which puts pressure on us to find the right sites and partners. It also means we need to create more flexible site selection and construction processes.
All this effort also involves looking at our vendors and contractors more as partners in all this. We can’t just make this about dollars. We have to make it about performance. We have to make it about driving best practices and continuous improvement.
But that’s not the way the construction industry typically works. So, we’ve had to bring a lot of our own ideas about running operations and making improvements and impress them on the companies we work with.
Moving into the data center arena was never going to be easy. But I think we’ve ended up with an amazing program at a scale that I never, ever would have imagined. And we’re always being asked to do more. That’s the business challenge, and it’s probably one of the main things that keep me and my team coming in to work every day. We have this enormous challenge ahead of us to do something that is unbelievably massive at scale.