North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Netcom Outage (Was: My InfoWorld Column About NANOG)

  • From: Peter Kaminski
  • Date: Fri Jun 21 16:58:56 1996

On Fri, 21 Jun 1996 11:31:32 EDT, Bob Metcalfe wrote:

>By the way, there are reports from two days ago that 400,000 people lost
>their Internet access for 13 hours.  Sounds like an outage approaching
>"collapse."  Was that just a Netcom thing that NANOG has no interest in?
>Netcom is not talking very much about what happened.  Any clues/facts out
>there?  Were any NAPs involved?

The news reports about the outage were that somehow numerous external
routes got into the internal Netcom backbone routing, and the extra load
caused a chain reaction that caused everything to go down.

Apparently it was mainly confined to Netcom's network.  Whether by design
or by dumb luck, I don't know.  We currently hang off of AGIS in San
Jose, and for about four hours after Netcom came back, we were up and
down.  Couldn't tell from here if it was AGIS, MAE West, or what, or if
Netcom coming back had anything to do with it.

I watched this outage from the periphery, and was completely blown away by
the non-reaction to it.  Official statements from Netcom (essentially
confirming Bob's numbers above) were quoted on the Reuters newswire, and on
the front page of the San Jose Mercury News Business section the next day
(although the editor played down the impact of it a little, and mixed a
one-hour AOL email outage into the same story and turned it into "outages
affect online services").

On the other hand, Netcom has said essentially nothing to its subscriber
base about the outage.  I've seen only a little mention of it around the
net.  Am I looking in the wrong places -- or is there no good way to
communicate about these sorts of things yet?  (I've signed up to the outage
discussion list, as Sean suggested.)

My impression is starting to be that most Netcom subscribers didn't really
notice the difference between normal Internet operations and the 13-20 hour
outage, and/or didn't have the diagnostic capabilities to be able to tell.
There were technically-oriented folks that could see that something was
going on, but even for them, it was hard to tell what.

I'm wearing two hats for the next set of questions -- the first as
a technical manager for an ISP growing an international backbone, and
the second as someone who's concerned about marketing the Internet
(and my company) to the public.

Can other big parts of the backbone fall down and take 13 (or more) hours
to get back up?  Or is the rest of the net engineered more redundantly than
Netcom?  Should I build two backbones, each with separate technologies?
Was this a foreshock of the coming Metcalfean Big One, or just lousy
procedures at one of the bigger ISPs?

Inquiring minds want to know.  Right now, it appears to be just a few
(thankfully?).  And now is the time to develop communications and publicity
strategies for this sort of thing -- along with the engineering to
hopefully prevent them.

-- 
Pete Kaminski
[email protected]
- - - - - - - - - - - - - - - - -