|
Not matter how much robustness and redundancy you build around your multi-tiered infrastructure you are bound to suffer outage(s). I’m not implying the failure of a single server, but a complex outage that’s usually external to the operation of the infrastructure. What matters is how you communicate outage notification when things do go awry. I think the words that I’m searching for are transparency and openness.
We’ve seen over and over again the lack of notification and continual updates that either gets overlooked or ignored during an outage. I do understand the first and foremost goal for any organization during an outage is to stabilize their network infrastructure but what’s even more frustrating is lack of communication channel (status blog, public health dashboard, etc) where customers have no clue let alone getting through the NOC to get a straight answer.
What matters to most, including myself is how an outage analysis is communicated during downtime events in a timely manner and the outcome of such event that leads to lessons learned (post-mortem) as it serves as a great example of how not to do things.
The intention here is not to criticize companies, but quite the opposite; those who have chosen to publicize the causes of outages should be applauded by customers for being open. They have shared their efforts to learn and improve the availability, robustness, scalability, and performance of network services. After all, every ISP encounters the same challenges.
Fred Brooks nicely articulated:
You can learn more from failure than success. In failure you’re forced to find out what part did
not work. But in success you can believe everything you did was great, when in fact some parts may not have
worked at all. Failure forces you to face reality.
However, the “real” costs include some losses that are harder to quantify but may be far greater. For example,
There seems there can be three simple rules of outages:
Keep in the mind a good deal of what I’ve outlined here will seem a lot like common sense which is a good thing. Quite often the simplest approaches to problem solving are the best ones, and openness and transparency is no exception.
There, got that off my chest.
Full Disclosure: Moderator for wiki.outages.org and I’m doing this to provide transparency to what I do and why I do it.
Sponsored byCSC
Sponsored byRadix
Sponsored byDNIB.com
Sponsored byWhoisXML API
Sponsored byVerisign
Sponsored byVerisign
Sponsored byIPv4.Global
Although I read it between the lines, you should make risk management more explicit. Service providers need to have in place monitoring and operating procedures to handle such events (including unknown unknowns), which require, as you note, technical and customers’ emotions management solutions. Unfortunately, however, service providers, in general, ignore the emotional side.
With risk management, the losses that you note would be minimized and, when “properly” implemented, can result in sticky customers. Moreover, you can use risk management successes as emotional springboard stories.
Hi Alex, The 64 bit question is, how can we engage and /or encourage providers to be more forthcoming and report outages w/o being concerned about bottom line and instead putting their customer's interest first? I will even go on a limb and say this, its matter of time heavy handedness of government aka "regulation" will force companies into a corner if things continues when it comes to close door outages reporting and this will further diminish "free market".
Now, I am confused.
The original post suggests that you are an advocate of technical and emotional solutions. If true, emotional solutions should improve their bottom line. But in the comment you say, “being concerned about bottom line and instead putting their customer’s interest first.” I am pointing out that emotional solutions improve the bottom line because they put customers’ interest first.
Many solutions providers (whether individuals, corporations, or governments) seem not to understand the value to the bottom line of risk management and the need to integrate technical/ooperational with emotional solutions.
I think we are saying the same thing, and maybe I wasn't clear. We want customer's interest first. IMO, openness and transparency are key to building trust with customers. Given the reluctance of providers to pagree with you and maybe publicly report their service as “bad”, especially if not everyone has to report on the same basis and/or the measurement is not universally recognized. Even with the existence of a protective agreement, no one wants to report and how that's defined is a separate discussion for some other day.
Actually, rule #2 is wrong. I’ve been in many places where many folks have unreasonable expectations, effectively believing things don’t break. Senior management and sales first and foremost, both of which are typically not technical. And customers too. It’s actually gotten bad, because most don’t understand the cost implications of redundancy and it’s downstream impact on pricing that ultimately gets passed through. It’s a losing battle to fight though. It is what it is, the excpecation is basically the very unreaslistic “always-up” 100% availability. You just have to suffer your lumps from time to time when things do break, which they definitely will sooner or later.
As Ken Scafer of OpenSRS summed it very well,
What I'm adovocating here is really simple, openness and those who provide openness simply shows quality of service of their organization.