North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: Server Redundancy

  • From: Joe Abley
  • Date: Thu Aug 07 12:41:32 2003

On Thursday, 7 August 2003, at 07:28AM, Rob Pickering wrote:

Then you've just got your BGP convergence time and unequal load balancing effects to worry about.

Whilst I'm not knocking Paul's solution in an application like running a root NS for which it is perfect, I'm not so sure it's necessarily best for every kind of service load balancing.
We're using the technique Paul used in local clusters with OSPF; the convergence time in an OSPF area which contains only a small number of server and a couple of routers in a single area is pretty small. There's no BGP convergence issue in this application (there's no BGP within the server cluster).

We're using another anycast technique in the wide area, using BGP to advertise covering supernets for services which are offered autonomously in multiple locations. BGP is involved in this one, but we are mitigating the potential for flap damage or transient convergence loops by offering service from remote nodes to a local community only, and not the whole Internet (i.e. the service supernet is offered as a peering route, with restricted propagation, and not for global transit).

The general approach we're taking with the wide-area, global service distribution technique is described here:

http://www.isc.org/tn/isc-tn-2003-1.html
http://www.isc.org/tn/isc-tn-2003-1.txt

I've used both the route hack based and commercial NAT load balancers, and they both have their place.
It's not really that much of a hack; it's just anycast over an IGP coupled with routers which can populate the FIB with multiple equal-cost routes with different next-hops, with some manner of flow hash to keep traffic from a s single session pointing at the same server.

If you are running complex web services (think expensive per server sw licences etc) then the investment in a pair of redundant load balancers for the front end to give more consistent performance under load as well as resilience can look very sane indeed.
I've deployed services behind foundry layer-4/layer-7/content/SLB/buzzword-du-jour switches before, and they worked very well; from the brief time I spent with them, they seemed well-designed and feature rich.

However, the foundries still suffered from the (near) single point of failure problem. It only takes one person to mess up the switch config whilst modifying a service or adding a new one, or a firmware upgrade that goes bad, and you lose all your services at once.

As Paul mentioned, the advantage of using local-scope anycast with an IGP to build a cluster is that there are no additional components, and hence no additional points of failure.


Joe