North American Network Operators Group|
Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical
Re: Persistent BGP peer flapping - do you care?
> > A good rule of thumb (possibly from RFC 822) is, be liberal in what you > accept and strict in what you send. That's a good rule of thumb in general, but I'm not sure it makes sens eto apply it to the routing fabric of the entire Internet. Any router that sends you a malformed update is without question broken. I think the point of saying "drop the session on receipt of a bad update" is that accepting updates from broken routers is bad practice when those updates are being used as the basis for routing in the global Internet. If you leave the session up, and just drop the malformed update, you are then accepting and passing on (assuming you have peers or downstreams) routing updates from a router known to be broken. In terms of the original rule, if you're liberal in what you accept from BGP (i.e. you reject the malformed update, but accept the other updates from the same router), you are also (if you have any peeers or downstreams) effectively being liberal (isntead of strickt) in what you send them. (Sure, you'll be strict about the *formatting* of what you send. But you're being liberal in the sense that you're passing on routes from a known-to-be-broken router.) > I would suspect that during implementation, brand C routers were the > victims during testing, and perhaps the change was made to avoid that > happening. ISTM that if that were the case, Brand C would have chosen to reject the update but maintain the session, as opposed to accepting and passing on the update. My guess is that it was just an ordinary bug in the AS_PATH validation code, that resulted in the BGP implementation failing to realize that the update was malformed. The ensuing meltdowns caused by the bug is essentially a problem of the homogeneousness of the Internet. The malformed update could only spread from Brand C to Brand C, had there been a lot more diversity in the core of the Internet, the update would probably not have spread as faror had as great an impact. > My suggestion would be, rather than a back-off of resetting BGP sessions, > that first attempt strict interpretation (to insulate against completely > insane routers), and then loose interpretation. The model is "Fool me once, > shame on you, fool me twice, shame on me." > > On first receiving a bad update, reset. If upon re-establishing the session, > the same bad update is heard, drop the bad update but keep the session up > (along with the messages back, etc.) The potential risk I see is that you are still passing on updates from a router known to be broken. From a purely reactive perspective, we look at past failures and say "when it happened last time, that would have been a good idea, because all the other updates were good". But from a proactive, more general perspective, the receiving router really has no way of knowing just how broken the router on the other end of the link is. I do agree, though, with the observation that this can vary on a case by case basis. For example, a multi-homed end user isn't generally propogating any of its BGP-received routes. So it might make sense in such a case to just reject only the malformed packet, because the alternative is to signifigantly degrade their connectivity over a single routing update. (And there is no offsetting benefit to the core routing fabric of the Internet, because such an end-user isn't really participating in that.) So, yes, having a knob to control the behavior might be a good idea. But I would stop short of saving that everyone in the core should configure that knob to leave the session up with a known to be broken neighbor. > Resetting BGP more than a small, finite number of times is, IMHO, a bad > idea. After all, BGP is a stateful protocol, and state changes should be > triggered deterministically, even if that requires operator input. Yes, I agree with that also. Dropping a session to a misbehaving peer is a good idea; restarting it immediately after every drop (so you can just drop it again when it misbehaves again) is bad. -- Brett