|
When a network is subject to a rapid increase in traffic perhaps combined with a rapid decrease in capacity (for example due to a fire or a natural disaster), there is a risk of congestion collapse. In a congestion collapse, the remaining capacity is so overloaded with access attempts that virtually no traffic gets through. In the case of telephony, everyone attempts to call their family and friends in a disaster area. The long standing telephony approach is to restrict new call attempts upstream of the congested area, for example by call gapping. This limits the amount of new traffic to that which the network can handle. Thus, if only 30% capacity is available, at least the network handles 30% of the calls, not 3% or zero.
There are comparable issues for the Internet backbone, but they are not so completely solved. As Wikipedia puts it:
Congestion collapse was identified as a possible problem as far back as 1984 (RFC 896). It was first observed on the early internet in October 1986, when the NSFnet phase-I backbone dropped three orders of magnitude from its capacity of 32 kbit/s to 40 bit/s, and continued until end nodes started implementing Van Jacobson’s congestion control between 1987 and 1988.
TCP congestion control solved day-to-day congestion collapse, but it didn’t deal with disasters such as the Taiwan earthquake of December 2006.
As a result of problems in the Internet backbone in Asia caused by that earthquake, there’s new discussion of Internet congestion. I stumbled on one interesting discussion on the NANOG (North American Network Operators Group) mailing list.
In this thread, the NANOG group seems to be arriving at conclusions similar to those of the telecom industry, i.e. throttle new traffic attempts upstream of the congested area. From Fred Baker:
So plan B would be to in some way rate limit the passage of TCP SYN/SYN-ACK and SCTP INIT in such a way that the hosed links remain fully utilized but sessions that have become established get acceptable service (maybe not great service, but they eventually complete without failing).
And from Sean Donelan:
This would be a useful plan B (or plan F - when things are really FUBARed), but I still think you need a way to signal it upstream 1 or 2 ASNs from the Extreme Congestion to be effective.
...what should the alternate queue plan B be?
Probably not fixed capacity numbers, but a distributed percentage across different upstreams…
Session protocol start packets (TCP SYN/SYN-ACK, SCTP INIT, etc) 1% queue
Everything else (UDP, ICMP, GRE, TCP ACK/FIN, etc) normal queueAnd finally why only do this during extreme congestion? Why not always do it?
The thread contains over 50 messages and doesn’t reach a specific call-to-action (that I could detect), but it does show similar problems and similar potential solutions.
I would have thought there’d be a few old telecom folks on the NANOG list, but I guess not…
Sponsored byRadix
Sponsored byCSC
Sponsored byVerisign
Sponsored byIPv4.Global
Sponsored byWhoisXML API
Sponsored byDNIB.com
Sponsored byVerisign