North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: flap flap: AS 10916

  • From: John Todd
  • Date: Fri Aug 11 17:04:33 2000

A heavily flapping AS struck my curiosity: AS 10916.

Somehow, AS 6138 constantly appears and disappears out of the
path leading through AS 1239 - the other path to them via AS 701
is barely flapping at all.

# sh ip bgp regexp 10916
Network Next Hop Metric LocPrf Weight Path
*d censored 95 110 0 13789 1239 10916 i
*> censored 98 100 0 701 10916 i

And every few minutes:

*d censored 95 110 0 13789 1239 6138 10916 i
*> censored 98 100 0 701 10916 i

What could be the cause for an AS appearing/disappearing in a path
every few minutes? Is it really AS 6138 that is flapping for 10916?
For some reason they prefer the indirect route through 6138 to 1239
(SprintLink), instead of their direct connection to 1239. These are
the times when such a peer should be shut down for the sanity of the
rest of the network.


Easily identified (but certainly not complete catalog of) reasons for such a flap that come to mind knowing nothing else about them other than what you describe above:

(a) Router with insufficient memory for full BGP table from that view perspective (it fills up to memory capacity, collapses, BGP is reset, routes flap, wash, rinse, repeat)

(b) Link that both BGP and traffic pass through is insufficient for continued keepalives once traffic moves in that direction (line becomes preferred by a large amount of traffic, traffic floods line, BGP keepalives fail, BGP session fails, traffic moves away, wash, rinse, repeat - see RED discussion archives some months ago for more detailed discussions on traffic flow dampening with similar patterns.)

Quite a few people have problem type (a) happen more often that you might think - I've run across it several times in the dim past, either as a memory problem or with BGP implementations that choked on certain corrupted/unusual advertisements halfway through the table transfer. If it's a memory problem, it's often an ACL issue that is related to someone removing the "sanity" ACL that otherwise would protect a smaller router from the falures that would occur with a full table update. ("Gee, this ACL seems to be preventing a full table from being sent to the customer. I'm sure the customer really wanted a full table - I'll remove the list.")

This all being said, I'm willing to bet that neither (a) nor (b) is at the root of the problem here, but they're both possible. Your question centered more on the path than on the cause, so I'll take a swing at it.

Since you're looking at the insertion of a route into a path, a possible situation might be that AS10916 peers with AS1239, and also peers with AS6138 who is also a transit user for AS1239. Sprint (AS1239) would prefer and re-advertise the route from their most direct customer when they could hear it from their direct customer (AS10916), and that would override the announcement coming from AS6138, which would be less-preferred. When (link 1) goes away, then Sprint would prefer and re-announce the route being heard from 10916 via (link 2).

| link 2
/ \________________/
/ link 1