|
||
|
||
Ookla recently published an interesting article that emphasizes what I have been telling folks for a long time. Not that many years ago, telephone and broadband networks were structured in such a way that most outages were local events. A fiber cut might kill service to a neighborhood; an electronics failure might kill service to a larger area, but for the most part, outages were contained within a discrete and local area.
There were exceptions. Rural areas have been susceptible to fiber cuts in the fiber that provides the Internet backbone. Years ago, I worked with Cook County, Minnesota, which would lose voice and broadband every time there was a cut in the single fiber between Minneapolis and northern Minnesota that supported the area. A public-private partnership was established to develop the THOR network, aiming to address backhaul failures in a significant portion of southeastern Colorado.
As the article points out, this has all changed because network operators have consolidated and interconnected networks across large geographic areas. Ookla says that the new phenomenon of large-scale outages is a direct result of digital transformation. As carriers, companies, and governments have grown increasingly reliant on cloud services, managed providers, and interconnected networks, they now have to contend with outages that can cascade from a local problem to a regional or even national issue.
The article examines the recent power outage in Spain and Portugal, which quickly escalated from a local incident to a widespread power outage across much of the Iberian Peninsula. Ookla points out that in today’s world, there is not that much difference between outages of a power grid, a cellular network, or a fiber network.
The article points out that outages can cascade much faster than anybody expects. The difference between a temporary disruption and a system-wide crisis depends on how quickly the network operators can recognize and analyze the causes of a problem. Ookla says there are five key steps needed to keep disruptions from escalating. Every major network outage is likely due to network operators failing at one of the early steps of this process.
Ookla believes that the local reaction within the first hour can make a huge difference in the extent and length of an outage. There was one power company in Iberia that was able to isolate itself from the cascading shutdown because it was prepared to react quickly. I wonder how many local ISPs are prepared to respond quickly to problems originating outside their local network. The Ookla article suggests that local operators can do a lot more to protect themselves and their customers against major outages.
Sponsored byIPv4.Global
Sponsored byVerisign
Sponsored byWhoisXML API
Sponsored byRadix
Sponsored byCSC
Sponsored byDNIB.com
Sponsored byVerisign
I agree that we are facing a major problem on the internet. And that this problem is not adequately comprehended and the solutions are not only complex but tend to fly against the strong headwinds of security.
This problem is becoming more serious as the net is tied into other forms of infrastructure, such as our power grids, telephones, air traffic control, water delivery, etc.
I wrote about some of this in a piece I titled “Is The Internet At Risk From Too Much Security?” at https://www.cavebear.com/cavebear-blog/netsecurity/
And more than twenty years ago I gave a presentation titled ” From Barnstorming to Boeing – Transforming the Internet Into a Lifeline Utility” (Speaker’s notes at https://www.cavebear.com/archive/rw/Barnstorming-to-Boeing.pdf and presentation slides at https://www.cavebear.com/archive/rw/Barnstorming-to-Boeing-slides.pdf )
I grew up in a family with generations of repairmen - radio and TV - and I’ve long been involved with diagnosis and repair of networks. (I built the internet’s first “buttset” back in the early 1990s.)
The telco world long ago learned that getting things to work is merely the first step. An additional step was to add test and monitoring facilities, to create tools to keep watch and perform isolation and diagnosis, and have a cadre of trustworthy people who can deploy repairs.
In our internet world, we have rather forgotten that additional step.
Tools like ping, traceroute, SNMP, etc etc are all nice, but they are relatively weak. We need to add much better test points and systems to use those. One thing that I’ve long wanted is a database of network pathologies through which a reasoning system can work back from symptoms towards possible causes, with ability to call upon test tools (which may be under different administrative regimes) to distinguish between potential causes.
Security barriers make much of this difficult.
My sense is that we need to have pre-qualified people who can be trusted to dig into network issues - often crossing administrative boundaries - and open security windows to observe, run tests, and deploy repairs. That, of course, would make many administrators quite reasonably nervous.
However, many users and businesses have come to believe, incorrectly, that the internet is a lifeline grade utility. The internet is an informal, largely unregulated, utility, but certainly not a lifeline grade one that users should trust to protect health or life. (I was appalled to see remote operated surgery being performed over the net.)
It is time for us to get serious about this stuff. But after 30 years of saying these things I am feeling rather like Cassandra.