|
Time flies. Although it was over 18 months ago, it seems just like yesterday that a small Czech provider, SuproNet, caused global Internet mayhem by making a perfectly valid (but extremely long) routing announcement. Since Internet routing is trust-based, within seconds every router in the world saw this announcement and tried to pass it on. Unfortunately, due to the size of this single message, quite a few routers choked—resulting in widespread Internet instability. Today, over a year later, we were treated to a somewhat different version of the exact same story.
First, let’s review the Czech incident from February 2009. There were many positives to take away.
The complete technical details can be found here.
Deja vu all over again
Fast forward to today: Friday, 27 August 2010. What do you think would happen if another large and unusual routing announcement was made on the Internet? Do you think all the router vendors have perfected their code in the past 18 months? Do you think the entire planet has upgraded to this new, improved and perfect code base? Do you think it makes sense to use the Internet as your testbed? I doubt you answered “yes” to any of these questions.
We’ll begin to describe what happened today with a snippet from a private mailing list. We’ll purposely leave out the technical details so that we don’t inadvertently contribute to the building of a Cybernuke.
On Friday 27 August, from 08:41 to 09:08 UTC, the RIPE NCC Routing Information Service (RIS) announced a route with an experimental BGP attribute. During this announcement, some Internet Service Providers reported problems with their networking infrastructure.
Immediately after discovering this, we stopped the announcement and started investigating the problem. Our investigation has shown that the problem was likely to have been caused by certain router types incorrectly modifying the experimental attribute and then further announcing the malformed route to their peers. The announcements sent out by the RIS were correct and complied to all standards.
Um, while standards compliance is nice, it is foolhardy to assume that all BGP implementations are perfectly compliant, especially given recent history. Over 3,500 prefixes (announced blocks of IP addresses) became unstable at the exact moment this “experiment” started. Not surprisingly, they were located all over the world: 832 in the US, 336 in Russia, 277 in Argentina, 256 in Romania and so forth. We saw over 60 countries impacted by a “correct” announcement that “complied with all standards”. The following graph shows the timeline of the event, followed by a map of the impacted countries by prefix count. Notice that it takes a bit for the Internet to stabilize after RIPE claims to have withdrawn the announcement at 09:08 UTC.
Conclusions
On the positive side, the incident was very brief, the damage was limited to under 2% of the Internet and the responsible parties quickly fessed up, aborting their “experiment”. On the negative side, the Internet remains a very fragile place, even if that fragility is highly localized and different in different places. Standards aren’t followed, code isn’t tested and people make mistakes. That’s life with any complex system and, while we can certainly do a better job, we will continue to see these types of events no matter what safeguards we might take. What puzzles me is how anyone thought it might be a good idea to test fate in this way. The end result was completely predictable.
Sponsored byWhoisXML API
Sponsored byDNIB.com
Sponsored byVerisign
Sponsored byRadix
Sponsored byIPv4.Global
Sponsored byCSC
Sponsored byVerisign
Well, the announcement was re-posted to the NANOG listed earlier today:
http://mailman.nanog.org/pipermail/nanog/2010-August/024837.html
And as an aside, I see that Cisco posted a security advisory regarding this late this afternoon:
http://www.cisco.com/warp/public/707/cisco-sa-20100827-bgp.shtml
- ferg
Not true. The fact that the impact was very small, unnoticed by the vast majority of users, and quickly fixed means that “The Internet” was just fine, and as robust as ever. Perhaps the point was that “Global Internet routing remains one of the most complex systems ever built, and subject to degradation just as any other highly complex system” ?
/John
All due respect to Renesys but some weightage for the percentage of significant ASNs affected should have been built in.
> 500 ASNs in the USA is a drop in the bucket. > 20 ASNs in another country with fewer networks might be most of the country.
Some study along those lines might have painted the bright colors elsewhere on the map