North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: Quick question.

  • From: Robert E. Seastrom
  • Date: Sun Aug 01 18:40:22 2004

"Michel Py" <[email protected]> writes:

> The dead processor still has to be replaced, but this is scheduled
> maintenance, not outage. A little extra ammo when you have to hunt five
> or six nines.

MTTR on a single box is irrelevant when you are off playing Ponce de
Leon, hunting the Fountain of Five or Six Nines.  Even when your
architecture doesn't depend on any one particular machine (or even whole
big sets of machines) being available, you don't get to "five or six
nines"... just ask Google, Akamai, or Microsoft - there are other
things beyond your control that spoil the picnic first.

As has been observed time and time again, the tried and true way to
make five or six nines of reliability in a system of more than trivial
complexity is to take a lesson from the telcos (the progenitors of the
"five nines" lie) and build a framework and evaluation methodology
that excludes broad classes of unavailability-causing events or
prorates them in such a way as to make them non-reportable.  Add to
that list incrementally, until the remaining time listed shows your
target number of nines of reliability.  Presto, five nines.

                                        ---Rob