North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: Why do we use facilities with EPO's?

  • From: Valdis . Kletnieks
  • Date: Thu Jul 26 10:28:59 2007

On Wed, 25 Jul 2007 12:43:17 PDT, Roy said:

> > Funny story about that and the EPO we have here...
> > ...
> Story #1

> Story #2

Story #3

So about 4 -5 years ago, we were in the middle of a major renovation of our
server room.  Moving machines all over the place, trying to clear about
6K contiguous square feet of floor space to drop a top-5 supercomputer in.
Upgrading the power, bringing in another 1.5Mw feed, cooling to get the
resulting BTUs *out*, etc.  And we decide it's time to put in a new 600kw
diesel backup generator to replace the old one that was way too small, for
all the non-supercomputer systems in the room.

So we take a multi-hour outage one Saturday for a full powerdown so we can wire
all the new UPS gear in.  And one of our scarier moments is rebooting the Sun
E10K, because it was a bit long in the tooth, and had 400 disk drives, and
hadn't been powered off in so long we weren't sure if it *would* power up again
without field engineering assistance.  And it *had* to come back up, because
it had all the Oracle databases that had all our business records, HR,
student records, everything.  There's a few tense moments - we lose about a
dozen drives, but fortunately they're all in RAID sets and no more than one
drive per set died.  We also notice that we dodged a bullet - the main boot
drive was supposed to be mirrored, but due to a config error, wasn't.

Tuesday, that boot drive is moved, it's now mirrored on 2 drives.

Friday, some construction guys come in to move the main entrance door into the
room - it has to move about 20 feet to the right so you can go *around* the
supercomputer, rather than walk straight into it.  And as per plan, one of them
starts moving the kind of odd light switch junction box next to the door, to
its new location next to the new door. Unfortunately, as *not* per plan, he
fails to double-check with our Facilities team that it's been disarmed first...

5 seconds later, it's very quiet and foggy in the room, as the Halon has dumped
and the interlock with the EPO has killed the power.

Several hours later, we finally get to start powering up the Sun E10K.

The good news:  We only lost 2 drives out of 400 this time, rather than a dozen.

The bad news:  Guess which 2 failed.....

Attachment: pgp00016.pgp
Description: PGP signature