North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: Monitoring highly redundant operations

  • From: Greg A. Woods
  • Date: Thu Jan 25 02:06:53 2001

[ On Wednesday, January 24, 2001 at 23:23:11 ( -0500), Howard C. Berkowitz wrote: ]
> Subject: Re: Monitoring highly redundant operations
>
> My first point is having what physicians call a "high index of 
> suspicion" when seeing a combination of minor symptoms.  I suspect 
> that we need to be looking for patterns of network symptoms that are 
> sensitive (i.e., high chance of being positive when there is a 
> problem) but not necessarily selective (i.e., low probability of 
> false positives).

Your analogy is very interesting because just like in this case with
M$'s DNS, the root cause may very well not have been in failing to
notice the symptoms or diagnose them correctly, but rather in allowing a
situation to build such that these symptoms even occur in the first
place.

I don't wish to read more into your analogy and your personal life (in a
public forum, no less!) than I have a right to do, so let's say
"theoretically" if it were past events in your life that were under your
direct personal control and which were known at the time to be almost
guaranteed to bring on your condition, then presumably you could have
avoided that condition by actively avoiding or counter-acting those past
events.

In the same way M$'s DNS would not likely have suffered any significant
visible problems, even if their entire campus had been torn to ruin by a
massive earthquake or whatever, if only they had deployed registered DNS
servers in other locations around the world (and of course if they'd
have been careful enough to use them fully for all relevant zones).

The DNS was designed to be, and is at least in theory possible to be,
one of the most reliable subsystems on the Internet.  However it isn't
that way by default -- every zone must be specifically engineered to be
that way, and then of course the result needs to be managed properly
too.  Luckily the engineering and management is extremely simple and in
most cases only requires periodic co-operation of autonomous entities to
make it all fit together.  No doubt M$'s zones get a larger than average
number of queries, but still it's just basic engineering to build an
enormously reliable DNS system to distribute those zones and answer
those queries.  If this were not true the root and TLD zones would have
crumbled long ago (and stayed that way! :-).

> Only after there is a clear clinical problem, or several pieces of 
> laboratory evidence, does a physician jump to more invasive tests, or 
> begin aggressive treatment on suspicion.  In like manner, you 
> wouldn't do a processor-intensive trace on a router, or do a possibly 
> disruptive switch to backup links, unless you had reasonable 
> confidence that there was a problem.

No, perhaps not, but surely in an organisation the size of M$ there
should have been enough operational procedures in place to have
identified the events shortly preceding the beginning of the incident
(eg. the configuration change).  Similarly of course there should have
been procedures in place to roll back all such changes to see if the
problem goes away.

Obviously such operational recovery procedures are not always perfect,
as history has shown, but in the case of something as simple as a set of
authoritative nameservers is supposed to be, they should have been
highly effective.

Furthermore in this particular case there's no need for expensive or
disruptive tests -- a company the size of M$ should have had (and
perhaps do have, but don't know how to use effectively) proper test gear
that can passively analyse the traffic at various points on their
networks (including their connection(s) to the Internet) without having
to actually use their routers or servers for diagnostic purposes.

Finally in this particular case the outage was so long that there was
ample time for them to have deployed new, network diverse, servers and
added their IP#s to the TLD delegations for their zone and had them show
up world-wide well before they'd fixed the actual problem!

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <[email protected]>      <robohack!woods>
Planix, Inc. <[email protected]>; Secrets of the Weird <[email protected]>