North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: Time to revise RFC 1771

  • From: Clayton Fiske
  • Date: Tue Jun 26 23:54:58 2001

[Note: I am not trying to bash on Cisco for this. Everyone has their
bugs, this one is just good for illustrating the point.]

On Tue, Jun 26, 2001 at 07:29:24PM -0700, Sean Donelan wrote:
> 
> On Tue, 26 June 2001, Clayton Fiske wrote:
> > Plus, a CRC error can occur between two valid, compliant, bug-free
> > implementations. A bad route, by definition, can't. We're not talking
> > about external faults here, but broken implementations. When one side
> > of a protocol session simply breaks the rules, I don't think it's
> > reasonable to say that the other side needs to be "fixed" to accept
> > that breakage. Fix the broken side.
> 
> Uhm, lets see what you think of this press announcement.
> 
> "Pardon us, we must shutdown the Internet will we decide whose software
> needs to be fixed.  Don't worry, as soon as it is fixed, the Internet
> will be rebooted."

That's funny, I thought it was pretty clear who was broken here. How
about one more accurately reflecting the situation?

"A major vendor has discovered a bug in their BGP implementation which
can cause unnecessary instability in the event of a malformed
announcement. While this has been a relatively rare occurrence thus far,
they have already issued a patch to correct this behavior. Network
providers are encouraged to apply this patch as soon as possible."

> Yep, somebody's implementation was broken.  One part of the response
> is to fix their implementation.  While we waiting to get the fix, the
> rest of the Internet should not have been flapping.

I happen to have some Vendor X routers on my network, and I sure didn't
notice The Internet flapping. I guess I'm in that small section that's
not a downstream or direct peer of the offending network.

I don't object to the discussion of changing the RFC (whether I agree
or not), and I accept that Vendor [everyone else except Cisco] having
a knob for this would have prevented some routing disruptions for some
networks. But then again, static routes would have prevented that too.
It doesn't mean they're a good idea. What I object to is that people
are using this particular case as justification for said discussion.

Suppose the bug in question had manifested itself differently. Suppose
it thought the announcement was malformed, when in fact it was correct.
Suppose it behaved the same way, by passing on the announcement and
then dropping the originating session. All of the same BGP sessions on
offending provider's [presumably] homogenous network would still have
dropped. Sure, the border router's session with Vendor X would have
stayed up, but the border router probably wouldn't have any routes
left to feed to Vendor X, because its iBGP mesh was bouncing all over
the place. Now is it the fault of Vendor X for not having a knob? The
damage was essentially the same, major routing disruptions for traffic
transiting that network.

My point is that the nature of this bug was particularly nasty, and it
bit several people when it was triggered. However, it was only because
of someone else's protocol violation that it was triggered. If we are
going to decide that the RFC needs to change to allow for what -might-
happen if one implementation's bug triggers another implementation's
bug to wreak havoc, I think we're going to be here for a long time.

Had the first router(s) to receive the malformed route behaved as
the RFC dictates and dropped the offending session, damage would have
been limited to the offending router and its downstreams only. I don't
agree that this behavior makes The Internet more brittle.

This is my final post on the subject. I will be happy to continue the
discussion privately if anyone wishes.

-c