North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Provider outage explanations...

  • From: Jonathan Disher
  • Date: Sun Mar 25 07:05:41 2001

[Names withheld to protect the inept]

Tonight has been a frustrating night.  Two outages, both totally
unexpected, and as of yet, totally unexplained.  Admittedly, it's only
been 8 hours since the first, and 2 since the second.  But it really
should not take 48 (or, often times, MORE) to come up with a valid
explanation of what happened.  We've had outages that have -never- been
explained... just ignored.

Basically, the first outage around 8 tonight, we saw everything blip for
about 3 minutes.  We're dual homed to $provider within the facility, one
gig-e pipe to each of two hosting routers.  What it looked like to me was
that both were rebooted, or something was cycled in between us and the
world.  When I called $provider, I was shooed off by one tech who promised
(and failed) to call back.  The second crammed the "No, you're an idiot,
you missed the published maintenance list which I'm emailing to you, go
away" line down my throat.  Of course, I dont know of providers that
perform that level of maintenance at 8pm PST.  None of their scheduled
maintenance was listed.

The second was, indeed, scheduled maintenance.  Replacing the secondary
hosting router.  Then someone (apparently) reloaded the primary router
while the secondary was -in pieces-... Then when I called, I was told "Oh,
no, we have no idea what happened, but we'll let you know in 48 hours..."  
My BGP session was reset.  -SOMETHING- happened, guys.

Neither of these outages generated an email to the customer notification
list.

IMO, this shouldn't be acceptable.  I may be new to some of this stuff,
but in my experience, it doesn't take long to figure out whether someone
reloaded the wrong router, or of the GRP let go of the precious magic
smoke.  What are some of the other major hosting/transit providers' outage
notification and post-mortem policies?  Should our next contract include
provisions for Rogaine, so that when I finish tearing out my hair I can
recover?

-j

-- 
-Jonathan Disher
-Sr. Systems and Network Engineer, Web Operations
-Internet Pictures Corporation, Palo Alto, CA
-[v] (650) 388-0497 | [p] (877) 446-9311 | [e] [email protected]