Home / Blogs

Big Regional Network Outages

T-Mobile had a major network outage last week that cut off some voice calls and most texting for nearly a whole day. The company’s explanation of the outage was provided by Neville Ray, the president of technology.

The trigger event is known to be a leased fiber circuit failure from a third party provider in the Southeast. This is something that happens on every mobile network, so we’ve worked with our vendors to build redundancy and resiliency to make sure that these types of circuit failures don’t affect customers. This redundancy failed us and resulted in an overload situation that was then compounded by other factors. This overload resulted in an IP traffic storm that spread from the Southeast to create significant capacity issues across the IMS (IP multimedia Subsystem) core network that supports VoLTE calls.

In plain English, the electronics failed on a leased circuit, and then the backup circuit also failed. This then caused a cascade that brought down a large part of the T-Mobile network.

You may recall that something similar happened to CenturyLink about two years ago. At the time, the company blamed the outage on a bad circuit card in Denver that somehow cascaded to bring down a large swatch of fiber networks in the west, including numerous 911 centers. Since that outage, there have been numerous regional outages, which is one of the reasons that Project THOR recently launched in Colorado—the cities in that region could no longer tolerate the recurring multi-hour or even day-long regional network outages,

Having electronics fail is a somewhat common event. This is particularly true on circuits provided by the big carriers, which tend to push the electronics to the max and keep equipment running to the last possible moment of its useful life. Anyone visiting a major telecom hub would likely be aghast at the age of some electronics still being used to transmit voice and data traffic.

I can recall two of my clients that have had similar experiences in the last few years. They had a leased circuit fail and then also saw the redundant path fail as well. In both cases, it turns out that the culprit was the provider of the leased circuits, which did not provide true redundancy. Although my clients had paid for redundancy, the carrier had sold them primary and backup circuits that shared some of the same electronics at ley points in the network—and when those key points failed, their whole network went down.

However, what is unusual about the two big carrier outages is that the outages somehow cascaded into big regional outages. That was largely unheard of a decade ago. This reminds more of what we saw in the past in the power grid when power outages in one town could cascade over large areas. The power companies have been trying to remedy this situation by breaking the power grid into smaller regional networks and putting in protection so that failures can’t overwhelm the interfaces between regional networks. In essence, the power companies have been trying to introduce some of the good lessons learned over time by the big telecom companies.

But it seems that the big telecom carriers are going in the opposite direction. I talked to several retired telecom network engineers, and they all made the same guess about why we see big regional outages. The telecom network used to be comprised of hundreds of regional hubs. Each hub had its own staff and operations, and it was physically impossible for a problem from one hub to take down a neighboring hub somehow. The worst that would happen is that routes between hubs could go dark, but the problem never moved past the original hub.

The big telcos have all had massive layoffs over the last decade, and those purges have emptied the big companies of the technicians who built and understood the networks. Meanwhile, the companies are trying to find efficiencies to get by with smaller staffing. It appears that the efficiencies that have been found are to introduce network solutions that cover large areas or even the whole nation. This means that the identical software and technicians are now being used to control giant swaths of the network. This homogenization and central control of a network mean that failure in any one place in the network might cascade into a more significant problem if the centralized software and/or technicians react improperly to a local outage. It’s likely that the big outages we’re starting to routinely see are caused by a combination of people’s failure and software systems.

A few decades ago, we somewhat saw regular power outages that affected multiple states. At the prodding of the government, the power companies undertook a nationwide effort to stop cascading outages, and in doing so, they effectively emulated the old telecom network world. They ended the ability for an electric grid to automatically interface with neighboring grids and the last major power outage that wasn’t due to weather happened in the west in 2011.

I’ve seen absolutely no regulatory recognition of the major telecom outages we’ve been seeing. Without the FCC pushing the big telcos, it’s highly likely nothing will change. It’s frustrating to watch the telecom networks deteriorate at the same time that electric companies got together and fixed their issues.

By Doug Dawson, President at CCG Consulting

Dawson has worked in the telecom industry since 1978 and has both a consulting and operational background. He and CCG specialize in helping clients launch new broadband markets, develop new products, and finance new ventures.

Visit Page

Filed Under

    Comments

    Comment Title:

      Notify me of follow-up comments

    We encourage you to post comments and engage in discussions that advance this post through relevant opinion, anecdotes, links and data. If you see a comment that you believe is irrelevant or inappropriate, you can report it using the link at the end of each comment. Views expressed in the comments do not represent those of CircleID. For more information on our comment policy, see Codes of Conduct.

    CircleID Newsletter The Weekly Wrap

    More and more professionals are choosing to publish critical posts on CircleID from all corners of the Internet industry. If you find it hard to keep up daily, consider subscribing to our weekly digest. We will provide you a convenient summary report once a week sent directly to your inbox. It's a quick and easy read.

    I make a point of reading CircleID. There is no getting around the utility of knowing what thoughtful people are thinking and saying about our industry.

    VINTON CERF
    Co-designer of the TCP/IP Protocols & the Architecture of the Internet

    Related

    Topics

    New TLDs

    Sponsored byRadix

    Cybersecurity

    Sponsored byVerisign

    Threat Intelligence

    Sponsored byWhoisXML API

    IPv4 Markets

    Sponsored byIPv4.Global

    DNS

    Sponsored byDNIB.com

    Domain Names

    Sponsored byVerisign

    Brand Protection

    Sponsored byCSC