North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: Followup British Telecom outage reason

  • From: Jared Mauch
  • Date: Sat Nov 24 16:24:53 2001

On Sat, Nov 24, 2001 at 02:16:38PM -0500, Sean Donelan wrote:
> On Sat, 24 Nov 2001, Neil J. McRae wrote:
> > I'd be surprised if it was the GSR, and in anycase that doesn't
> > absolve anyone. If it was a software issue- why wasn't the software
> > properly tested? Why was such a critical upgrade rolled out across
> > the entire network at the same time? It doesn't add up.

	After a fully run lab-test as well as limited "real-life"
deployment you can still never see all the possible cases that would
possibly come to haunt you later.  Sometimes you do an across-the-board
upgrade for security as well as specific feature/bugset reasons to
fix the set of bugs into the "we know what they are and how to deal
with them".

	No vendor claims to have perfect software.  Nor will you find
anyone but the irresponsible vendor to suggest that any specific
image is "perfect".

> It appears to be yet another CEF bug.  If you want to use a GSR
> you are stuck using some version of IOS with a CEF bug.  The
> question is which bug do you want.  Each version of IOS has
> a slightly different set.  Several US network providers have also
> been bitten by CEF bugs too.

	True, but most of those are in the past.  I'm not familiar
with the specifics of the bugs that BT encountered but something that
should be taken note of is the ability for a Cisco router to function
when in a "broken" state and you want to get a 'fixed' image onto it.

	It would be nice if there were easier ways to do it in some cases
but you can't have a perfect environment esp when you do sw upgrades
you don't always have your on-site hands standing by to help you swap
flash cards or deal with whatever logistical issues you may encounter.

> While trying to fix one set of bugs, BT upgraded of their network.
> I'm not sure if they were upgrading at 9am in the morning, or had
> upgraded earlier and the bug finally came out under load at 9am.
> When the BT network melted down, Cisco suggested installing a
> different version of IOS, which had previously been tested.  At
> noon, BT found the new version had an even worse bug, sending packets
> out the wrong interface.  It was until 2200 (13 hours later), BT and
> Cisco found a version of IOS which stablized the network.  "Stablized"
> not fixed.  The running version of IOS still has a bug, but it isn't
> as severe.

	I'm sure that BT and Cisco have had some conversations about
what can be done to improve the testing that Cisco does to better
simulate their network at this time from such a public outage.

-- 
Jared Mauch  | pgp key available via finger from [email protected]
clue++;      | http://puck.nether.net/~jared/  My statements are only mine.