North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: Operate until failure

  • From: Shawn McMahon
  • Date: Mon Jan 08 10:15:57 2001

On Mon, Jan 08, 2001 at 08:49:17AM -0600, Eric Whitehill wrote:
> 
> We've had issues here with power outages and usually the UPS' will hold.
> The one time they didn't, we went and brought all the machines down
> gracefully as we didn't have the auto-shutdown installed on the systems.  

We don't shut anything down with a management call, unless it's going to
fail and break something in the next 15 minutes.

We have a generator, but we have had two amazing coincidences cause it
to fail.  The first time, the generator was fine, but the switch didn't
switch.  The person who was signing off (erroneously) that he was checking
that switch monthly lost his job shortly before we stopped using his
company entirely.  We discovered the problem when the batteries reached
the point where it was supposed to cut over, and the entire data center
went dark.  That was a very, very bad day.

The second time, an o-ring blew out, and we dumped so much oil on the
ground, we were told that if it'd been a tiny bit more we'd have had to
call the EPA.  This one gave us enough warning to shut things down, but
we had to hustle and a few things were triaged as "let it die, we don't have
time."

In general, however, we start planning for a controlled shutdown the minute
we know there's a problem, and we attempt to schedule that shutdown for
our scheduled weekly outage window if possible.  If not, we try to make it
after peak processing time for the affected components.

Attachment: pgp00004.pgp
Description: PGP signature