North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: AT&T Outage cause last week.

  • From: Brett Frankenberger
  • Date: Wed Apr 22 18:21:06 1998

:: Rodney Joffe writes ::
> 
> According to AT&T, in a story just released, the problem was a Cisco
> code problem. It would be interesting to hear what the actual software
> cause was, if anyone from Cisco cares to let us know?.../rlj

http://www.cisco.com/warp/public/770/fn042198.shtml

is what's available publically.  Summary:

Most stratacom cards support what they call "Y-Redundancy" in which you
connect up to two physical cards with a Y cable and then switch can
them failover in the event of a hardware failure.  This problem in
question concerns BXM cards (a DS3/OC-3/OC-12 trunk card) that
are configured for Y-Redundancy.

If the active card restarts, the previosuly-standby card will quickly
become the active card, and the previosuly-active card, when it comes
back, will become the standby card.  If the newly-active card then
restarts or fails within a certain period of time, the system is
vulnerable to the problem.

The specific problem is that the two cards get into a loop sending
cells to each other, and during each iteration of the loop, the active
card also sends one cell out on the trunk.  The cells that loop are
control cells, and when the node at the far end of the trunk receives
these cells at full line rate, it will eventually end up aborting (and
restarting).

One way of becoming vulnerable to this is to upgrade BXM firmware is a
specific sequence that leads the the above-mentioned card restarts. 
Specific sequences of hardware failures can also cause this, as can
specific sequences of card-level control commands (reset card, etc.). 

Based on ATT's press release, I would guess that they were upgrading
firmware in on a pair of Y-Redundant BXMs in the order that opens up
vulnerability to this issue.  ATT's release said only one node was
being upgraded, and Cisco's field notice isn't clear on how this
failure on one node cascades throughout a large network.

Firmware fixes are available today.  Software upgrades are being
developed to contain the problem if it or a similar problem does occur.


          - Brett  ([email protected])  (Who's glad that the BPX's in
his network don't have BXMs).
 
------------------------------------------------------------------------------
                               ... Coming soon to a      | Brett Frankenberger
.sig near you ... a Humorous Quote ...                   | [email protected]