North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: Scalability issues in the Internet routing system

  • From: Andre Oppermann
  • Date: Wed Oct 26 16:07:25 2005

Blaine Christian wrote:

Another thing, it would be interesting to hear of any work on breaking the "router code" into multiple threads. Being able to truly take advantage of multiple processors when receiving 2M updates would be the cats pajamas. Has anyone seen this? I suppose MBGP could be rather straightforward, as opposed to one big table, in a multi-processor implementation.

You may want to read this thread from the beginning. The problem is not
the routing plane or routing protocol but the forwarding plane or ASIC's
or whatever. Both have very different scaling properties. The forwarding
plane is at an disadvantage here because at the same time it faces growth
in table size and less time to perform a lookup . With current CPU's you
can handle a 2M prefix DFZ quite well without killing the budget. For the
forwarding hardware this ain't the case unfortunatly.

Hi Andre...

I hear what you are saying but don't agree with the above statement. The problem is with the system as a whole and I believe that was the point Vladis, and others, were making as well. The forwarding plane is only one part of the puzzle. How do you get the updates into the forwarding plane? How do you get the updates into the router in the first place and how fast can you do that? I have seen at least one case where the issue did not appear to be the ASICs but getting the information into them rapidly. If you go and create a new ASIC without taking into account the manner in which you get the data into it you probably won't sell many routers <grin>.
Sure, if you have a bottleneck at FIB insertion you fail much earlier.
I'd say if that happens it's an engineering oversight or a design
tradeoff.  However I don't think this is the choke point in the entire
routing table size equation.

Depending on the type of prefix churn you don't have that many transactions
reaching the FIB.  Most far-away churn doesn't change the next hop for
example.  Local churn, when direct neighbors flap, mostly just changes the
nexthop (egress interface).  In a high performant ASIC/TCAM whatever FIB
a nexthop change can be done quite trivially.  Prefix drop can be handled
by marking it invalid and garbage collecting it later.  Prefix insertions
may either salvage an invalidated prefix or have to be re-inserted.  The
insertion time depends on the algorithms of the FIB table implementation.

For all practical purposes a FIB can be designed to be quite speedy in this
regard without busting the budget.

The link speed between two DFZ routers has seldomly been the limit for initial
routing table exchanges.  Neither has TCP.  It is mostly dominated by the
algorithm choice and CPU of the RIB processor on both ends.

BTW, I do agree that spinning new ASICs is a non-trivial task and is certainly the task you want to get started quickly when building a new system.
It is non-trivial for its prefix storage size and ultra-fast lookup times.
Longest prefix match is probably the most difficult thing to scale properly
as a search always must be done over a number of overlapping prefixes.

To scale this much better and remove the bottleneck you may drop the 'overlapping'
part or the 'longest-match' part and the world suddenly looks much brighter.

This is the crucial thing that got forgotten during the IPng design phase
which brought us IPv6.

So far we have learned that limiting the number of IPv[46] prefixes in the
DFZ is not an option for commercial and socio-technical reasons.  That leaves
only the other option of changing the routing lookup to something with better
scaling properties.

I did read your comment on BGP lending itself to SMP. Can you elaborate on where you might have seen this? It has been a pretty monolithic implementation for as long as I can remember. In fact, that was why I asked the question, to see if anyone had actually observed a functioning multi-processor implementation of the BGP process.
I can make the SMP statement with some authority as I have done the internal
design of the OpenBGPd RDE and my co-worker Claudio has implemented it.  Given
proper locking of the RIB a number of CPU's can crunch on it and handle neighbor
communication indepently of each other.  If you look at Oracle databases they
manage to scale performance with factor 1.9-1.97 per CPU.  There is no reason
to believe we can't do this with the BGP 'database'.