|
It was only a few weeks back, in July of this year, where I remarked that an Akamai report of an outage was unusual for this industry. It was unusual in that it was informative in detailing their understanding of the root cause of the problem, describing the response that they performed to rectify the immediate problem, the measures being undertaken to prevent a recurrence of this issue, and the longer-term measures to improve the monitoring and alerting processes used within their platform.
At the time, I noted that it would be a positive step forward for this industry if Akamai’s outage report was not unusual in any way. It would be good if all service providers spent the time and effort post rectification of an operational problem to produce such outage reports as a matter of standard operating procedure. It’s not about apportioning blame or admitting liability. It’s all about positioning these services as the essential foundation of our digital environment and stressing the benefit of adopting a shared culture of open disclosure and constant improvement as a way of improving the robustness of all these services. It is about appreciating that these services are very much within the sphere of public safety, and their operation should be managed in the same way. We should all be in a position to improve the robustness of these services by appreciating how vulnerabilities can lead to cascading failures.
On October 4th, Facebook managed to achieve one of the more impactful outages of the entire history of the Internet, assuming that the metric of “impact” is how many users one can annoy with a single outage. In Facebook’s case, the 6-hour outage affected the services it provides of some 3 billion users if we can believe Facebook’s marketing hype.
So, what did we learn about this outage? What was the root cause? What were the short-term mitigations that they put in place? Why did it take more than 6 hours to restore service? (Yes, for a configuration change that presumably had a black-out plan, that’s an impressively long time!) What are they doing now to ensure that this situation won’t recur? What can we as an industry learn from this outage to ensure that we can avoid a recurrence of such a widespread outage in other important and popular service platforms?
These are all good questions, and if we are looking for answers, then Facebook’s outage report is not exactly a stellar contribution. It’s short enough for me to reproduce in its entirety here:
To all the people and businesses around the world who depend on us, we are sorry for the inconvenience caused by today’s outage across our platforms. We’ve been working as hard as we can to restore access, and our systems are now back up and running. The underlying cause of this outage also impacted many of the internal tools and systems we use in our day-to-day operations, complicating our attempts to quickly diagnose and resolve the problem.
Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.
Our services are now back online and we’re actively working to fully return them to regular operations. We want to make clear that there was no malicious activity behind this outage—its root cause was a faulty configuration change on our end. We also have no evidence that user data was compromised as a result of this downtime. (Updated on Oct. 5, 2021, to reflect the latest information)
People and businesses around the world rely on us every day to stay connected. We understand the impact that outages like these have on people’s lives, as well as our responsibility to keep people informed about disruptions to our services. We apologize to all those affected, and we’re working to understand more about what happened today so we can continue to make our infrastructure more resilient.
https://engineering.fb.com/2021/10/04/networking-traffic/outage/
Yes, they are “sorry.” Well, they could hardly say anything else, could they?
Yes, they did this to themselves. Again, nothing unusual here, in that configuration changes are the most common cause of service faults. That’s why most communications service providers impose a configuration freeze over important periods, such as “Black Friday” in the US or the new year holiday period, and that’s why such freeze periods are typically the most stable of the entire year! But in Facebook’s case, whatever pre-installation tests they performed, if indeed they did any at all, failed to identify risk in the change process. I guess the engineering team was still applying Mark Zuckerberg’s operational mantra of moving fast and breaking things and doing so with a little too much zeal.
And “they are working to understand more about what happened today so we can continue to make our infrastructure more resilient.” No details.
I must admit this report is a state-of-the-art example of a vacuous statement that takes four paragraphs to be largely uninformative.
NBC News reported that: “A Facebook employee said it appeared to be a problem with the Domain Name System, the ‘phone book’ of the internet, which computers use to look up individual websites. ‘I wish I knew. No internal tooling, DNS seems totally borked. Everyone is just sort of standing around,’ the source said. ‘No reason at this point to suspect anything malicious, but the outage is affecting pretty much everything. Can’t even access third-party tools.’”
It seems sad that this NBC report was far more informative than the corporate blather that Facebook posted as their statement from engineering.
What really did happen, and what can we learn from this outage?
For this, I had to turn to Cloudflare!
They posted an informative description of what they observed, using only a view from the “outside” of Facebook ). Cloudflare explained that Facebook managed to withdraw BGP routes to the authoritative name servers for the facebook.com domain. Now in the DNS, this would normally not be a problem, provided that the interruption to the authoritative servers is relatively short. All DNS information is cached in recursive resolvers, including name server information. If the DNS cache time to live (TTL) is long (and by “long” I mean a day or longer), then it’s likely that only a small proportion of recursive resolvers would have their cached values expire over a short (order of seconds) outage. Any user who used multiple diverse recursive resolvers would not notice the interruption at all. After all, the Facebook domain names are widely used (remember those 3 billion Facebook users?), so it is probably a widely cached name. So caching would help in a “normal” case.
At this point, the second factor of this outage kicks in. Facebook uses short TTLs in their DNS, so the effect of a withdrawal of reachability of their authoritative name servers was relatively immediate. As the local cached entries timed out, the authoritative servers were uncontactable, so the name disappeared from the Internet’s recursive resolvers.
But this form of disappearance in the DNS is a form that raises the ire of the DNS gods. In this situation, where the name servers all go offline, then the result of a query is not an NXDOMAIN response code (“I’m sorry, but that name does not exist in the DNS, go away!”) but a far more indeterminate timeout to a query with no response whatsoever. A recursive resolver will retry the query using all the name server IP addresses stored in the parent zone (.com in this case), and then return the SERVFAIL response code (which means something like: “I couldn’t resolve this name, but maybe it’s me, so you might want to try other resolvers before giving up!”). So, the client’s stub resolver then asks the same question to all the other recursive resolvers with which it has been configured. As the Cloudflare post points out: “So now, because Facebook and their sites are so big, we have DNS resolvers worldwide handling 30x more queries than usual and potentially causing latency and timeout issues to other platforms.”
Then the third factor kicked in. Once the domain name facebook.com and all the names in this space effectively disappeared from the Internet, their own internal command and control tools also disappeared. Whether this was a consequence of the DNS issue or the original BGP route withdrawal isn’t possible to determine from here, but the result was that they lost control of the service platform. And this then impacted the ability of their various data centers to exchange traffic, which further exacerbated the problem. As Facebook’s note admitted, the outage “impacted many of the internal tools and systems we use in our day-to-day operations, complicating our attempts to quickly diagnose and resolve the problem.” Other reports in Twitter were more fanciful, including a report that the Facebook office buildings defaulted to a locked mode, preventing staff from entering the facilities, presumably in order to work on the outage! At this point, Facebook’s service platform was presumably running solo, as no one could get into the platform and chase down the issue and rectify it directly. They were evidently locked out!
There are numerous lessons to learn from this outage, so let’s look at a few:
If you want your customers, your investors, your regulators, and the broader community to have confidence in you and assurance that you are doing an effective job, you need to be open and honest about what you are doing and why. The entire structure of public, corporate entities was intended to reinforce that assurance by insisting on full and frank public disclosure of the corporate’s actions. However, these are not rules that seem to apply to Facebook.
Now I could be surprised if, in the coming days, Facebook released a more comprehensive analysis of this outage, including root cause analysis and the factors that lead to cascading failures. It could explain why efforts to rectify the immediate failure took an amazingly long 6 hours. It could describe the measures they took to restore their service and the longer-term actions they will undertake to avoid similar failure scenarios in the future. It could detail the risk profile that guides their engineering design decisions and how this affects service resilience. And more.
Yes, I could be surprised if this were to happen.
But, between you and me, I have absolutely no such expectations. And, I suspect, neither do you!
Sponsored byCSC
Sponsored byDNIB.com
Sponsored byVerisign
Sponsored byRadix
Sponsored byIPv4.Global
Sponsored byWhoisXML API
Sponsored byVerisign
Geoff, good summary! FYI, Facebook DID publish a post with a bit more technical information about what happened and some of the issues they had in returning their networks to service:
https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/
Thanks for the pointer Dan. As Santosh points out in his followup: "Every failure like this is an opportunity to learn and get better, and there’s plenty for us to learn from this one. After every issue, small and large, we do an extensive review process to understand how we can make our systems more resilient. That process is already underway." I suppose I am saying that the "we" who are trying to make our digital environment more resilient is more than just Facebook - it's all of us. We need to look at other industries, such as the aviation or nuclear power industries, who have gone through a sometimes painful process of understanding that such post-event analysis can allow the broader industry to learn from these incidents. I believe that the Internet is now so pervasive and our reliance so critical that we are talking about topics that can be seen as matters of public safety and security.