North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

NETCOM downtime a programming error. (fwd)

  • From: Michael Dillon
  • Date: Fri Jun 28 19:35:58 1996

---------- Forwarded message ----------
Date: Fri, 28 Jun 1996 17:42:30 -0400 (EDT)
From: Brian Tao <[email protected]>
Reply-To: [email protected]
To: [email protected]
Subject: NETCOM downtime a programming error. (fwd)
Resent-Date: Fri, 28 Jun 1996 16:45:28 -0500 (CDT)
Resent-From: [email protected]

    Not sure where this first showed up, but it sounds like some sort
of trade publication...

---------- Forwarded message ----------

IA: Take us through what happened...

GARRISON: Think about the network in three layers. The first layer is
Network Access Points, where we have peer agreements with other
careers. It's the entry point to the Internet. (They're about a half
dozen across the U.S.) The next level down are hubs, which are our
internal virtual private network routing hubs that look at traffic and
direct it along the speediest line available. The third level down is
where the customer actually logs on, at an access POP (Point Of
Presence). At each of those levels there are routers made by Cisco and
others that have instruction tables on them -- "IF THEN" statements
that tell the traffic where to go, what route to go to get to its
destination.

At the network access level, you have some pretty complex code that
says, 'If the traffic comes from this party, then do the following
thing with it.' And because of the number of new access providers or
changes in the access providers, there are daily changes made at the
network access layer. And these are changes that are made in software
to the routers. It's done in a language called BGP, or Border Gateway
Protocol. So, there was one line of code that said, literally, "No
redist bgp access list 25 in," just a line of code that revised an
instruction. Because the two sentences were put together as opposed to
being done on separate lines, the network read it as an "AND"
statement instead of an "OR" or an "IF statement.

So, what happens is the network automatically replicates the
instruction set from the network access point from where this was
entered, which was Washington, DC, and it replicated itself to the
other network access points.  Because of the way the code was written,
it then said, 'ah hah, it's a network instruction, not a peering
instruction -- I'd better send it out to the hubs.' The hubs saw it,
and said, 'ah hah, I'd better send it out to the POPs.' Well, the POPs
memory -- the routers at the lower levels of the network -- do not
have the memory or capacity for the peering instructions because they
don't interface with anybody else, so they don't need that capacity.

So, when they got it, it basically froze the routers down at the third
level of the network. Meantime, we're sitting reprogramming the
routers, but as fast as we can reprogram the replication feature of
the intelligent network, it overwhelms our ability to reprogram.
Basically our decision was to shut down the network to reboot the
routers, to put in a fresh instruction set.

That's a long winded explanation, but because your readers are more
technical, it's worthwhile!



============================== ISP Mailing List ==============================
Email ``unsubscribe'' to [email protected] to be removed.
inet-access archives are at ftp://ftp.earth.com/pub/archive/inet-access/

- - - - - - - - - - - - - - - - -