North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Limits of reliability or is 99.999999999% realistic

  • From: Sean Donelan
  • Date: Sat Nov 25 23:28:55 2000

On Fri, 24 November 2000, Roeland Meyer wrote:
> After all the discussion that we had on the Datacenter list, I am surprised
> at this. You'd think that they'd have redundant PS's with redundant UPS's.

There are interesting electrical faults which will kill redundant UPSes and
redundant power supplies.  However, I don't know if Sprint uses redudant
power supplies, or had a failure affecting multiple power supplies on the
STP.  They use Compaq/Tandem NonStop hardware for their SCPs.
 
After previous power problems at Sprint COs (i.e. Kansas City) which
affected my service, I've asked my Sprint sales person what happened.
He was never able to get anyone to call me back with an answer.

> For the internet, I see an amazing number of systems with no redundancy
> whatsoever. Of course, the first hardware failure usually corrects the
> problem, at the cost of substantial down-time. But many second-tier ISPs and
> dot-coms are still operating on brand-new equipment that hasn't started
> hitting its MTBF specs yet and they don't even have a clue on their MTTR
> ratings. In the next few years, I expect to see a lot more failures, as the
> equipment starts to age.

I'm not sure that is true.  Brand new electronic equipment tends to have a
period of infant mortality.  If it survives, it tends to be reliable for a
fairly long period of time.  I had customers still using Proteon routers 10
years after Proteon discontinued the model.  However scaling requires most
dot-coms to replace/upgrade their equipment every few months, so they are
always dealing with infant mortality.

But back to my question.  What is the real requirement?  Amazon.COM had
system problems on Friday, and their site was unusuable for 30 minutes,
definitely not 99.999%.  But what did that really mean?  The FAA loses
its radar for several hours in various parts of the country.  What did
that really mean?  Essentially every system given as an example of "high-
availability, high-reliability" I've looked at, doesn't hold up under
close examination.

Is 99.999% just F.U.D. created by consultants?

Instead of pretending we can build systems which will never fail, should
we work on a realistic understanding of what can be delivered?