North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: ultradns reachability

  • From: Joe Abley
  • Date: Fri Jul 02 10:26:21 2004

On 2 Jul 2004, at 00:18, Christopher L. Morrow wrote:

So, I thought of it like this:
1) Rodney/Centergate/UltraDNS knows where all their 35000billion copies of
the 2 .org TLD boxes are, what network pieces they are connected to at
which bandwidths and the current utilization
2) Rodney/Centergate/UltraDNS knows which boxes in each location (there
could be multiple inside each pod, right?) are running their dns process
and answering at which rates
3) Rodney/Centergate/UltraDNS knows when processes die and locally stop
pushing requests to said system inside the pod
4) Rodney/Centergate/UltraDNS knows when a pod is completely down (no
systmes responding inside the local pod) so they can stop routing the /24
from that pod's location

So, Rodney/Centergate/UltraDNS should know almost exactly when they have a
problem they can term 'critical'... I most probably left out some steps
above, like wedged proceseses or loss of outbound routing to prefixes
sending reqeusts. I'm sure Paul/ISC has a fairly complete list of failure
modes for anycast DNS services.
All the failure modes that ISC has seen with anycast nameserver instances can be avoided (for the authoritative DNS service as a whole) by including one or more non-anycast nameservers in the NS set.

This leaves the anycast servers providing all the optimisation that they are good for (local nameserver in toplogically distant networks; distributed DDoS traffic sink; reduced transaction RTT) and provides a fall-back in case of effective reachability problems for the anycast nameservers.

This is so trivial, I continue to be amazed that PIR hasn't done it.

The problem then becomes the "Hey, .org is dead!" From where is it dead?
What pod are you seeing it dead from? Is it routing TO the pod from you?
FROM the pod to you? The pod itself? Stuck/stale routing information
somewhere on the path(s)? This is very complex, or seems to be to me :(
With the fix above, the problem becomes "hey, *some* of the nameservers for ORG are dead! We should fix that, but since not *all* of them are dead, at least ORG still works."

I think more failure modes will be investigated before that comes :)
fortunately lots of people are already investigating these, eh?
I don't know about lots, but I know of a few. None of the people I know of are using an entire production TLD as their test-bed, however.


Joe