Home / Blogs

Big Internet Outages - There Is No Such Thing as a Routine Software Upgrade

Last year I wrote about big disruptive outages on the T-Mobile and the CenturyLink networks. Those outages demonstrate how a single circuit failure on a transport route or a single software error in a data center can spread quickly and cause big outages. I join a lot of the industry in blaming the spread of these outages on the concentration and centralization of networks where the nationwide routing of big networks is now controlled by only a handful of technicians in a few locations.

In early October, we saw the granddaddy of all network outages when Facebook, WhatsApp, and Instagram all crashed for much of a day. This was a colossal crash because the Facebook apps have billions of users worldwide. It’s easy to think of Facebook as just a social media company, but the app of suites is far more than that. Much of the third world uses WhatsApp instead of text messaging to communicate. Small businesses all over the world communicate with customers through Facebook and WhatsApp. A Facebook crash also affected many other apps. Anybody who automatically logs into other apps using the Facebook login credentials was also locked out since Facebook couldn’t verify their credentials.

Facebook blamed the outage on what it called routine software maintenance. I had to laugh the second I saw that announcement and the word ‘routine’. Facebook would have been well advised to have hired a few grizzled telecom technicians when it set up its data centers. We learned in the telecom industry many decades ago that there is no such thing as a routine software upgrade.

The telecom industry has long been at the mercy of telecom vendors that rush hardware and software into the real world without fully testing it. An ISP comes to expect to have issues in glitches when it is taking part in a technology beta test. But during the heyday of the telecom industry throughout the 80s, and 90s, practically every system small telcos operated was in beta test mode. Technology was changing quickly, and vendors rushed new and approved features onto the market without first testing them in real-life networks. The telcos and their end-user customers were the guinea pigs for vendor testing.

I feel bad for the Facebook technician who introduced the software problem that crashed the network. But I can’t blame him for making a mistake—I blame Facebook for not having basic protocols in place that would have made it impossible for the technician to crash the network.

I bet that Facebook has world-class physical security in its data centers. I’m sure the company has redundant fiber transport, layers of physical security to keep out intruders, and fire suppression systems to limit the damage if something goes wrong. But Facebook didn’t learn the basic Telecom 101 lesson that any general manager of a small telco or cable company could have told them. The biggest danger to your network is not from physical damage—that happens only rarely. The biggest danger is from software upgrades.

We learned in the telecom industry to never trust vendor software upgrades. Instead, we implemented protocols where we created a test lab to test each software upgrade on a tiny piece of the network before inflicting a faulty upgrade on the whole customer base. (The even better lesson most of us learned was to let the telcos with the smartest technicians in the state tackle the upgrade first before the rest of us considered it).

Shame on Facebook for having a network where a technician can implement a software change directly without first testing it and verifying it a dozen times. It was inevitable that a process without a prudent upgrade and testing process would eventually result in the big crash we saw. It’s not too late for Facebook—there are still a few telco old-timers around who could teach them to do this right.

By Doug Dawson, President at CCG Consulting

Dawson has worked in the telecom industry since 1978 and has both a consulting and operational background. He and CCG specialize in helping clients launch new broadband markets, develop new products, and finance new ventures.

Visit Page

Filed Under

CircleID Newsletter The Weekly Wrap

More and more professionals are choosing to publish critical posts on CircleID from all corners of the Internet industry. If you find it hard to keep up daily, consider subscribing to our weekly digest. We will provide you a convenient summary report once a week sent directly to your inbox. It's a quick and easy read.

I make a point of reading CircleID. There is no getting around the utility of knowing what thoughtful people are thinking and saying about our industry.

VINTON CERF
Co-designer of the TCP/IP Protocols & the Architecture of the Internet

Comments

Comment Title:

  Notify me of follow-up comments

We encourage you to post comments and engage in discussions that advance this post through relevant opinion, anecdotes, links and data. If you see a comment that you believe is irrelevant or inappropriate, you can report it using the link at the end of each comment. Views expressed in the comments do not represent those of CircleID. For more information on our comment policy, see Codes of Conduct.

Related

Topics

Domain Names

Sponsored byVerisign

Domain Management

Sponsored byMarkMonitor

Cybersecurity

Sponsored byVerisign

Threat Intelligence

Sponsored byWhoisXML API

Brand Protection

Sponsored byAppdetex

IPv4 Markets

Sponsored byIPXO