North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: "They all suck!" Re: UPS failure modes (was: fire at NAC)

  • From: Sean Donelan
  • Date: Thu May 29 16:57:15 2003

On Thu, 29 May 2003, Alex Rubenstein wrote:
> Even in instances where 'High availability' is designed, in the case where
> one of the units has a failure that causes a fire and FM200 dump, either
> the FM200 will still trigger an EPO, or the fire department will.

Why do you think most telephone central offices don't have EPO's?  It is
possible to meet code without an EPO, if you have a smart PE on the

> So, the second 'high available' unit will generally not prevent you from
> dropping the critical load, but instead, will help you get back on line
> quicker.

That's why you have geographic diversity, if one node goes down the other
location may be unaffected.

> A much cheaper and easier to implement external maintenance
> make-before-break bypass will accomplish the same thing.

Pick two out of three.  The "Internet philosphy" has tended to be a
lots of cheap equipment connected by diverse paths.  Designing for
failure also means defining "failure" in terms of the service, not
particular pieces of equipment.  I don't care how many 9's your switch
is, I just care if my packets get through.

> I've heard many a story of the paralleling gear causing the problem in the
> first place, as well...

Yep, tieing together "redundant" systems with parelleling gears turns two
independent systems into one "co-dependent" system.  In a failure
situation, you want to compartmentalize the failure.  Loosing half your
systems may be better than loosing all your systems.