North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

RE: OSPF multi-level hierarch: side question

  • From: Vadim Antonov
  • Date: Fri May 28 18:47:46 1999

Steve Meuse <[email protected]> wrote:

> On the other hand, you can choose to build a box that can handle thousands
> of customers, and decrease the traffic load, but also increase the
> liklihood of a failure that can directly affect a larger percentage of
> customers.

Dan Rabb <[email protected]> wrote:

> Routers will inevitably fail.  The question becomes how much exposure do you
> want when it does?

First, you have to stop thinking of routers as "black boxes" and expose internal
structure of large boxes so you can compare it with clusters.

In this respect, big router designs i know of are eminently more reliable than
clusters of traditional routers for a number of reasons:

1) the connectivity between components ("elementary routers") is significantly
   richer, with many diverse paths between components.

2) the design is inherently simpler than that of a multi-vendor and
   multi-standard cluster; with significantly fewer number of different components
   and a lot more regular topology.  Simplicity directly translates into
   reliability.

3) there is a built-in support for extensive fault-tolerance and self-diagnostics
   at a level simply unachievable with standard routing protocols (which by
   their nature do not have a foggiest idea of the internal structure and
   diagnostic possiblilities of the routers; and do not provide any support
   for state mirroring).

4) the individual failure blocks are much smaller (i.e. one "line card" vs entire
   router, at least in Pluris design -- the line card interface is not a bus,
   but a serial line with protocol which cannot be screwed up by misbehaving
   line card, unlike any known bus protocols).

5) power supplies are distributed (Pluris box simply has a separate DC-DC converter
   on every card)

6) at least one vendor (Pluris) has all card cages complteley isolated electrically

7) the last (but not least) aspect of terabit routing is its inherent reliance
   on inverse-multiplexing over multiple parallel channels allowing to degrade
   service gracefully in case of individual channel or path failures - without
   any need to make the problem visible at IP level; and therefore not limited
   by performance of distributed routing algorithms.

Alex Zinin <[email protected]> wrote:

> "Have more bigger boxes rather than less smaller ones"-approach is not
> for everybody and not for every case. If you have clusters sitting in one room,
> powered from the same source, sharing the same ceiling that can fall, running
> the same version of soft, using the same config., etc., than yes it's ok,
> because they will more likely crash at the same moment.

A big router does not have to be all in one place physically.  Pluris design
allows hundreds feet of component separation with optical cabling.

> Also, even if you do use a large box, you probably don't wanna know
> all the details about it's connections at some level of your network.

The whole premise of big box design is that its internal capacity is so much
bigger than interface capacity that from outside it looks like a single
point w/o any need to optimize routing inside.  From the perspective of
network management, of course, big boxes have to provide detailed internal
status info.  A sane design for a big router has an out-of-band diagnostic
network within the box.

>>    to eliminate updates which "do not matter" unlike SPF-based algorithms
>>    which have to inform everyone about local topology changes.
>
> In SPF-based protocols we have areas for this purpose---we do not propogate
> topology information across the area boundaries.

Across boundaries which have to be configured _manually_.  DV and diffuse
algorithms tend to squelch topology updates automatically _within_ an
area if a same-metric alternative path is found.  SPF has to have a coherent
picture of network topology at all times; so flap can easily kill it off.
Diffuse algorithms are by design work well in a network with rapidly
changing topology.

--vadim