North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: NSI bulletin 097-004 | Root Server Problems

  • From: Perry E. Metzger
  • Date: Thu Jul 17 17:09:00 1997

Randy Bush writes:
> Gosh, Perry.  I wish I could be and make things that perfect.

Randy, we are engineers, not witch doctors.

Real engineers can and do build things to far higher levels of safety
than you seem to feel is acceptable.

I work for Wall Street firms. I've been around to witness things like
a head trader at a client order the purchase of ~$20 Billion in
securities on heavy leverage and have a couple dozen people work at
executing the trade for half the day in such a way as to avoid moving
the market.

In the sort of firms I work at, literally hundreds of millions to
billions of dollars are on the line in our systems.  When failures
happen in this environment, people can lose money. If people lose
money, YOU NEVER WORK AGAIN. No excuses. No "oh, what were you
expecting, perfection?" You just don't work again.

As a result of that, WE BUILD THINGS NOT TO FAIL. We do things like
putting every machine is two nets, dual connecting all networks to
every other network. We "red/black" checkerboard the workstations
on our trading floors so that even if something really bad happens
your neighbor's machine will still be up and you can get out of your
positions. We engineer things so no single power cord, router, server
or communications link is a potential source of failure.

Its not trivial. You actually have to work to do this sort of
thing. However, it can be done, and it is done.

Moreover, we tend to build things such that dangerous infrastructure
is treated with care. We automate our management systems. We don't
have humans touch anything that could cause a network or system to go
bye-bye because we can't afford to.

Do we get failures? Sure, we get failures. Do we get catastrophic
failures? Very, very, very rarely. No, not never -- but very, very
rarely and when they happen we almost always recover very, very
fast. I've heard tales of things not recovering fast. I've never heard
tales of the people working on them after that, though.

As I said, we are engineers here, NOT WITCH DOCTORS. Saying
sarcastically "I wish I could be perfect, too" is to belittle your
profession.

When civil engineers build bridges, they almost never collapse. If
they collapse, the engineer probably isn't going to work again,
because he's just killed someone. Do collapses happen? Sure, once
every great while. Do they happen often? No.

When telephone network engineers build telephone networks, they rate
their switches at two minutes downtime per year, and they work very,
very carefully. Do failures happen? Sure, back in 1989 AT&T had a
catastrophic failure for a couple of hours. Do they happen regularly?
No, these things are goddamned rare.

When aerospace guys build fly-by-wire computer systems, people die if
they fail, so they build them so that they don't fail. Do they
sometimes fail? Sure, very rarely -- no one is perfect -- but compared
to the way the DNS has been failing of late it isn't even a contest.

Internet guys are no more stupid than areospace guys or civil
engineers, and frankly our job is no harder. We just have lots of lazy
people who spit out lots of excuses.

I'm used to an environment where excuses get you fired, though, so I
find myself less tolerant of "oh, poor us" claims from others.

There is no excuse for any system that will let you install a broken
zone file without requiring some serious levels of manual override --
there is, in fact, little excuse for building such a system so that
humans have to be involved in the first place.

I'll say this for a third time: we are engineers, not witch
doctors. These things *can* be built correctly.

Perry