A faulty configuration change led Facebook, Instagram and WhatsApp to all go down for at least six hours on Monday, October 4. Those trying to reach the social media platforms were met with browsers and apps displaying DNS errors on connection attempts.
The routing prefixes suddenly disappeared from the internet’s border gateway protocol (BGP), a routing protocol that makes the internet work and makes it possible for devices from around the world to communicate with each other.
Since Facebook’s domain and DNS record are hosted on the company’s own routing prefix, when the BGP prefixes were removed, no one could connect to the IP addresses or services running on top of them.
During one of the routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections, effectively disconnecting Facebook data centers globally prompting a wide outage.
According to a Facebook blog post, its systems were designed to take audit commands like this to prevent mistakes, but a bug in the audit tool prevented it from stopping the command.
The total loss of connection made things worse for Facebook — engineers working on trying to figure what went wrong couldn’t access the data center through normal means because the networks were down and the total loss of DNS broke many of Facebook’s internal tools to investigate and resolve outages like this.
Engineers were sent onsite to the data centers to debug the issue and restart the systems, however it took time because data centers are designed with high levels of physical and system security in mind.
Once the backbone network connectivity was restored, Facebook feared a surge in traffic, which could have caused a dip in power consumption and could have put the electrical system to caches at risk. They had to slowly flip services back on.
While Facebook continually stress tests its systems, the company never tested its global backbone being taken offline. “In the end, our services came back up relatively quickly without any further systemwide failures. And while we’ve never previously run a storm that simulated our global backbone being taken offline, we’ll certainly be looking for ways to simulate events like this moving forward,” said Santosh Janardhan, VP of Infrastructure at Facebook in a blog post.
“We’ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making,” he says.
“I believe a tradeoff like this is worth it — greatly increased day-to-day security vs. a slower recovery froma hopefully rare event like this. From here on out, our job is to strengthen our testing, drills, and overall resilience to make sure events like this happen as rarely as possible,” said Janardhan.