North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Estimated Time To Repair (was Re: History: lengthy outages)

  • From: Sean Donelan
  • Date: Thu Jan 25 22:14:37 2001

On Thu, 25 January 2001, Clayton Fiske wrote:
> 1. I think it might be prudent to weed out the 2-5 hour outages here.
> While that's still an excessively long time to recover a change that
> should have been monitored and tested properly in the first place
> (and still probably cause for firing in some shops), I can at least
> conceive of it taking this amount of time. Too long, yes, but not
> quite in the jaw-dropping category.

I deliberately included them for a reason.

Historically, when I look at a lot of network problems I notice
an interesting coincidence.  Outages involving "operator error"
tended to take the longest time to fix, while outages involving
equipment failure tend to be the shortest times.

Complete hardware box failure (smoke makes debugging easy): 1 hour
Power failure (utility, generator, etc): 3 hours
External malicious attack (ddos, etc): 4 hours
Fiber/Cable cut: 5 hours
Electronic DCS failure: 18 hours
Operator error: 1 business day (24-72 hours depending whether the operator
made the change before leaving on a Friday night or a Tuesday night)
Vendor software error: 2 business days (1 day to "escalate" the problem
through customer channels, 1 day to actually get the fix, can take as
long as 5 days if the problem happens after 3pm on Friday)

Psychologists study why people have a difficult time recognizing their
own mistakes.  It is a very difficult problem.  The problem is worse with
"smart" people.