|
Yesterday’s Wikipedia outage, which resulted from invalid DNS zone information, provides some good reminders about the best and worst attributes of active DNS management. The best part of the DNS is that it provides knowledgeable operators with a great tool to use to manage traffic around trouble spots on a network. In this case, Wikipedia was attempting to route around its European data center because an over heating problem that caused Wikipedia’s servers at that location to shut down. This is a classic example of how important DNS is to disaster recovery planning. Effectively implemented, the failover away from the European site would have been transparent to the user and the ‘disaster’ would have been averted, showcasing the resiliency of the Internet at its best.
However, unfortunately, it is also a classic example of how devastating even ‘small’ errors in implementing such a DNS based failover strategy can be to site uptime. The reason small errors can grow into a big problem is because, while it is flexible, the DNS can also be incredibly unforgiving. The root (pun intended) of the issue is that the DNS works by storing information in a relative few authoritative name servers, from which recursive name servers pull and cache information as a result of a user’s request for the zone information. Thus, depending on the Time-To-Live (TTL) setting for the zone file at issue, the recursive server could keep the information from seconds to hours before going back to the authoritative server to get an update. Once an invalid zone is pulled to a recursive DNS server, the server won’t check back for new information until the TTL expires, which means that bad information can linger long after the zone information is fixed, sending user after user (after user) to the wrong place or to no place at all. This architecture is why, at least with respect to DNS zone information, Spock was wrong—long life is not prosperous.
The good news is that you can get the best from DNS while avoiding the problems caused by its architecture. The key is having a DNS solution that mitigates against both of the risks that caused the Wikipedia outage: the failure to notice (or prevent) the introduction of invalid zone data and the failure to use low latency DNS resolution (or short TTLs). What constitutes a best practice in the area is up for some debate, but our perspective is that, with respect to zone data, using a utility such as ‘named-checkzone’ for BIND or ‘tinydns-data’ for DJBDNS is a must as it affords administrators an opportunity to check zone data for errors and correct them before they go live. With respect to TTLs, best practice suggests that an optimal TTL is one that is half of the desired ‘mean time to repair’ (MTTR)—which is generally determined based upon the sites larger disaster recovery strategy—for the site.
Sponsored byWhoisXML API
Sponsored byDNIB.com
Sponsored byIPv4.Global
Sponsored byVerisign
Sponsored byVerisign
Sponsored byCSC
Sponsored byRadix
We have produced a tool that will automate the switching of DNS records for disaster recovery or maintenance purposes. It integrates with Infoblox (using bloxTools) and VitalQIP.
Please see here for more info:
tuscany networks DNS Contingency Switcher
Cheers,
Paul