North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: .ORG problems this evening

  • From: Majdi S. Abbas
  • Date: Thu Sep 18 18:06:07 2003

On Thu, Sep 18, 2003 at 02:22:19PM -0400, Todd Vierling wrote:
> Sucks to be anyone trying to use the service whose routers pick those nodes
> as the only ones available.  That's the fault of the implementor, not the
> client.

	I have a sneaking suspicion that if UltraDNS's tld cluster that is
apparently located in Equinix-Ashburn stopped responding to queries for
two hours last night, a lot more people would have noticed.  A *lot* more
people.

	I think it's out of line to speculate on how UltraDNS has configured
these clusters, particularly in terms of how reachability information is
verified and propagated without any knowledge of their configuration.

> The major issue here is that no *gTLD*, particularly one of the Big Three,
> should be subject to a SPOF -- even if it's only a regionally visible SPOF
> due to anycast selection.  It should *always* be possible to attempt queries
> to more than one physical location's servers for a gTLD.  Yet last night, I
> could not query .ORG from several different locations in the continental US,
> even though there were perfectly functional servers available (in the same
> country, no less).

	First it was two locations, one of which you can't tell us about
(Deep inside OSPF Area 51?) -- now it's several?  I've tried myself from 
many different hosts today, and they all route to different clusters.  I'm
having trouble finding more than one, geographically diverse host that 
routes to the same cluster.

> BGP errors happen (everyone here should be able to attest to that readily),
> and they did.  What's to stop some other boneheaded DoS or oversight from
> causing this again?  And again?

	Are you absolutely, positively sure this cluster was responding to 0
queries, but still propagating those two /24's?

> This particular outage was in the late evening in what appeared to be the
> affected area from my probing, which is why people like you don't appear to
> care; it "didn't affect you".  What about when it happens in the middle of
> the day in your neck of the woods?

	The reason for this is simple -- given the query volume a tld like
.org receives, and given just how "close" this cluster is to so many 
millions of users in the eastern US, the odds of you being the *only* 
person, even amongst the few thousand on this list, to notice a problem...
are incredibly slim.

	Since you won't tell us where these "several" hosts you tried to
query from are addressed, and you won't tell us exactly which queries
you tried, and how...it is incredibly hard to look into.

	This is the equivalent of calling every fire department in the
nation and telling them that there is a fire, but refusing to tell them
where you are, or what you've witnessed.

> Uh-huh.  Quite a few people here know better; they also know I am surrounded
> by <cloak/> on this list and others.  If my public resume were up to date
> and filled in more detail, you'd know otherwise.  Don't try to speak for my
> experience from your pedestal when you don't have the information to make
> that kind of baseless judgment.

> On the other hand, if you can't see the fatal flaw in a major Internet
> infrastructure service depending on a single point of failure, I can point
> you at a few books that could enlighten you.

	It isn't a single point of failure, but even if it were, I can
assure you that the collective experience of this list would fill quite
a few more volumes then you are capable of referring us to.

	You ask that we make no assumptions as to your experience --
grant us the same courtesy.

	--msa