North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Monitoring highly redundant operations

  • From: Sean Donelan
  • Date: Wed Jan 24 18:38:55 2001

Not to pick on Dave, since I suspect he is going to have to face
the Microsoft PR department for re-indoctrination for speaking out
of turn, I'm glad to see someone from microsoft made an appearance.

But he does raise an interesting problem.  How do you know if your
highly redudant, diverse, etc system has a problem.  With an ordinary
system its easy.  It stops working.  In a highly redudant system you
can start losing critical components, but not be able to tell if
your operation is in fact seriously compromised, because it continues
to "work."

As many of us have found out as we moved from simple networks to more
complex networks, the network management is often much harder than
the architecture of the network itself.  Instead of relying on being
notified when stuff "breaks" you have to actively monitor the state
of your systems.  Fairly frequently I see cases where the backup system
failed, but no one knows about it until after the primary system also
fails.

>From the typical monitoring stations Dave sees, everything appears
"normal."  Yet, out in the real world there is a problem.  Like most
things its rarely a single thing that breaks, but chain of problems
resulting in the final failure.  So what should you be monitoring
in addition to the typical graphs and logins to detect the problem
seen by Microsot yesterday and today?


On Wed, 24 January 2001, Dave McKay wrote:
> Microsoft's ITG is investigating this issue.  I haven't been clued in as 
> of yet as to what is the main issue.  Hotmail's graphs and logins are
> currently following the same trends as normal, they seem unaffected, 
> however this is not the case in all locations.  DNS seems to be the 
> obvious choice for the blame.  This is not the case in all areas, however.
> At this point Microsoft is not willing to put the blame on anyone, or
> any protocol for that matter.  (Unless they already released a public 
> statement saying so, then who knows?)  Anyway, the issues are being worked
> on and service will be restored as soon as possible.  I apolozise for not
> being able to disclose more information.
> 
> -- 
> Dave McKay
> [email protected]
> Microsoft Global Network Architect