North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: Journal of Internet Disasters

  • From: Paul Vixie
  • Date: Fri Nov 13 12:28:39 1998

> This should go on the name-droppers list, but here goes....

these days it's not clear whether namedroppers is an operations list
or a protocol list or still both.  i think nanog is a fine forum for this:

> What do we know about the events with the name servers
> 
>    - f.root-servers.net was not able to transfer a copy of some of
> 	the zone files from a.root-servers.net
>    - f.root-servers.net became lame for some zones

just COM.

>    - tcpdump showed odd AXFR from a.root-servers.net

just a lot of missed/retransmitted ACKs.

>    - [fjk].gtld-servers.net have been reported answering NXDOMAIN to
> 	some valid domains, NSI denies any problem

the nanog archives include some dig results that are hard for NSI to deny.

> Other events which may or may not have been related
>     - BGP routing bug disrupted connectivity for some backbones in the
> 	preceeding days

this turned up a performance problem in BIND's retry code, btw, but was
not otherwise related to the COM lossage of yesterday (as far as i know).

>     - Last month the .GOV domain was missing on a.root-servers.net due
> 	to a 'known bug' affecting zone transfers from GOV-NIC

different bug.  that one causes truncated zone transfers; the secondary
zone files on [fjk].gtld-servers.net yesterday were not truncated and it
just took a restart to make them stop behaving badly.

>     - Someone has been probing DNS ports for an unknown reason
> 
> Things I don't know
>     - f.root-servers.net and NSI's servers reacted differently.  What
> 	are the differences between them (BIND versions, in-house source
> 	code changes, operating systems/run-time libraries/compilers)

they are completely different systems (solaris vs. digital unix) running
the same (unmodified) bind 8.1.2 sources, which had completely different
failure modes for completely different reasons.

>     - how long were servers unable to transfer the zone?  The SOA says
> 	a zone is good for 7 days.  Why they expire/corrupt the old zone
> 	before getting a new copy?

damn good question.  i'll look into that.  shouldn't've happened.

>     - Routing between ISC and NSI for the preceeding period before the
> 	problem was discovered

there was asymmetry (they reached me via bbnplanet, i reached them via
alternet).  they are now preferring alternet to reach me, so we have
better path symmetry now.  but their first mile is still congested and
i am still retransmitting a lot of ACKs.

> Theories
>     - Network connectivity was insufficient between NSI and ISC for long
> 	enough the zones timed out (why were other servers affected?)

other servers are more conservative, and had switched to manual daily FTP
of the COM zone longer ago than F has done.  (with manual daily FTP you
get the advantages of gzip, and of the pretense of "zone master" status
while you manually retry after timeouts.  AXFR needs those properties.)

>     - Bug in BIND (or an in-house modified version) (why did vixie's and
> 	NSI's servers return different responses?)

there's definitely a bug in BIND if [fjk].gtld-servers.net were able to
return different answers after restarts with no new zone transfers.  (i'm
sitting here wishing i had core dumps.)

>     - Bug in a support system (O/S, RTL, Compiler, etc) or its installation
>     - Operator error (erroneous reports of failure)
>     - Other malicious activity?

i think there were a goodly number of procedural errors.
-- 
Paul Vixie <[email protected]>