North American Network Operators Group Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical Re: Journal of Internet Disasters
> This should go on the name-droppers list, but here goes.... these days it's not clear whether namedroppers is an operations list or a protocol list or still both. i think nanog is a fine forum for this: > What do we know about the events with the name servers > > - f.root-servers.net was not able to transfer a copy of some of > the zone files from a.root-servers.net > - f.root-servers.net became lame for some zones just COM. > - tcpdump showed odd AXFR from a.root-servers.net just a lot of missed/retransmitted ACKs. > - [fjk].gtld-servers.net have been reported answering NXDOMAIN to > some valid domains, NSI denies any problem the nanog archives include some dig results that are hard for NSI to deny. > Other events which may or may not have been related > - BGP routing bug disrupted connectivity for some backbones in the > preceeding days this turned up a performance problem in BIND's retry code, btw, but was not otherwise related to the COM lossage of yesterday (as far as i know). > - Last month the .GOV domain was missing on a.root-servers.net due > to a 'known bug' affecting zone transfers from GOV-NIC different bug. that one causes truncated zone transfers; the secondary zone files on [fjk].gtld-servers.net yesterday were not truncated and it just took a restart to make them stop behaving badly. > - Someone has been probing DNS ports for an unknown reason > > Things I don't know > - f.root-servers.net and NSI's servers reacted differently. What > are the differences between them (BIND versions, in-house source > code changes, operating systems/run-time libraries/compilers) they are completely different systems (solaris vs. digital unix) running the same (unmodified) bind 8.1.2 sources, which had completely different failure modes for completely different reasons. > - how long were servers unable to transfer the zone? The SOA says > a zone is good for 7 days. Why they expire/corrupt the old zone > before getting a new copy? damn good question. i'll look into that. shouldn't've happened. > - Routing between ISC and NSI for the preceeding period before the > problem was discovered there was asymmetry (they reached me via bbnplanet, i reached them via alternet). they are now preferring alternet to reach me, so we have better path symmetry now. but their first mile is still congested and i am still retransmitting a lot of ACKs. > Theories > - Network connectivity was insufficient between NSI and ISC for long > enough the zones timed out (why were other servers affected?) other servers are more conservative, and had switched to manual daily FTP of the COM zone longer ago than F has done. (with manual daily FTP you get the advantages of gzip, and of the pretense of "zone master" status while you manually retry after timeouts. AXFR needs those properties.) > - Bug in BIND (or an in-house modified version) (why did vixie's and > NSI's servers return different responses?) there's definitely a bug in BIND if [fjk].gtld-servers.net were able to return different answers after restarts with no new zone transfers. (i'm sitting here wishing i had core dumps.) > - Bug in a support system (O/S, RTL, Compiler, etc) or its installation > - Operator error (erroneous reports of failure) > - Other malicious activity? i think there were a goodly number of procedural errors. -- Paul Vixie <[email protected]>
|