North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

update regarding 12/3/94 service disruption

  • From: Steve Heimlich
  • Date: Sat Dec 10 16:19:02 1994

All,

ANS has a fix being exercised on our testnet for the routing software
(known as "GateD" - gateway daemon) bug which caused the service
disruption on Saturday, 12/3/94.  The sequence of events leading to the
problem is extremely obscure, as should be evident from the description
below.  This particular bug has been exercised only twice before in the
history of our use of this software and will not appear again following
deployment of the new software.

We will implement a phased rollout of the new routing software.  The
rollout will begin this coming week on a small number of routers.
During the course of the week the behavior of the new software will be
observed.  Pending successful results, a network-wide deployment will
take place the week of 12/19.

Steve Heimlich
Manager, Infrastructure Development
ANS
-------

We have these prerequisites:

- there is a network X which is announced into our backbone 

- there is a primary announcement (1), a secondary announcement (2),
  and a tertiary announcement (3)

- one ENSS A acts as (1) and one ENSS B acts as both (2) and (3) (e.g.,
  MAE-East may speak with Sprint and Alternet, which may be secondary
  and tertiary providers, respectively)

- ENSS A must have a lower router ID than ENSS B (i.e., A < B)

and this sequence of events:

- ENSS A goes away non-gracefully such that iGP connectivity from the
  backbone to ENSS A is withdrawn but the iBGP session stays up (e.g.,
  a power loss or circuit outage but not a clean GateD shutdown)

- all routers notice loss of iGP connectivity to ENSS A within
  one minute and reset the next hop for route (1) to network X to be
  null, keeping the route in the BGP RIB in case iGP connectivity is
  restored

- in addition to the above, ENSS B injects (2) into the backbone via
  iBGP

- the exterior peer providing (2) withdraws the route to network X
  within 2 minutes of the initial AS 690 loss of iGP connectivity to
  ENSS A

- ENSS B then injects (3) into the backbone via iBGP 

- all other routers see that the preference for network X has worsened
  and therefore traverse the BGP RIB to find the best current route to
  network X, attempting to verify as well that any route under
  consideration has a valid next hop

- during the traversal, the routers mistakenly use an incorrect pointer
  to verify existence of a good next hop, not realizing that the former
  primary route (1) has a null next hop

- due to a bug in some comparison logic, the formerly primary route (1)
  is selected from the BGP RIB if A > B and is installed into the
  kernel

- the iBGP sessions from all backbone machines to ENSS A time out three
  minutes after loss of iGP connectivity to ENSS A 

- GateD crashes when it attempts to delete the mistakenly installed
  formerly primary route (1) from the kernel


- - - - - - - - - - - - - - - - -