North American Network Operators Group|
Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical
Limits of reliability or is 99.999999999% realistic
On Fri, 24 November 2000, Roeland Meyer wrote: > After all the discussion that we had on the Datacenter list, I am surprised > at this. You'd think that they'd have redundant PS's with redundant UPS's. There are interesting electrical faults which will kill redundant UPSes and redundant power supplies. However, I don't know if Sprint uses redudant power supplies, or had a failure affecting multiple power supplies on the STP. They use Compaq/Tandem NonStop hardware for their SCPs. After previous power problems at Sprint COs (i.e. Kansas City) which affected my service, I've asked my Sprint sales person what happened. He was never able to get anyone to call me back with an answer. > For the internet, I see an amazing number of systems with no redundancy > whatsoever. Of course, the first hardware failure usually corrects the > problem, at the cost of substantial down-time. But many second-tier ISPs and > dot-coms are still operating on brand-new equipment that hasn't started > hitting its MTBF specs yet and they don't even have a clue on their MTTR > ratings. In the next few years, I expect to see a lot more failures, as the > equipment starts to age. I'm not sure that is true. Brand new electronic equipment tends to have a period of infant mortality. If it survives, it tends to be reliable for a fairly long period of time. I had customers still using Proteon routers 10 years after Proteon discontinued the model. However scaling requires most dot-coms to replace/upgrade their equipment every few months, so they are always dealing with infant mortality. But back to my question. What is the real requirement? Amazon.COM had system problems on Friday, and their site was unusuable for 30 minutes, definitely not 99.999%. But what did that really mean? The FAA loses its radar for several hours in various parts of the country. What did that really mean? Essentially every system given as an example of "high- availability, high-reliability" I've looked at, doesn't hold up under close examination. Is 99.999% just F.U.D. created by consultants? Instead of pretending we can build systems which will never fail, should we work on a realistic understanding of what can be delivered?