North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Journal of Internet Disasters

  • From: Sean Donelan
  • Date: Fri Nov 13 10:44:21 1998

[email protected] (Eric M. Carroll) writes:
>Do we actually need the cooperation of the organizations in question to
>effect this?

Yes and no.  It would be fairly 'easy' to become a editor, start Donelan's
Journal Of Internet Disasters, get a number of noted experts to contribute
articles analyzing failures with no cooperation of the organizations.  But
I can predict what the organizations in question would say about such an
endeavor:

   1) Donelan is engaging in FUD to sell his journal.
   2) They are making rash assumptions without knowing all the facts.
   3) You know the Internet, you can't please everyone.  Its just a
	small group of people with an axe to grind.
   4) It didn't happen.  If it did happen, it was minor.  If it wasn't
	minor, not many people were affected.  If many people were
	affected, it wasn't as bad as they said.  If it was that bad,
	we would have known about it.  Besides we fixed it, and it
	isn't a problem (anymore).

Sure, sometimes a problem breaks through to the public even when the
company tries all those things.  Just ask INTEL's PR department about
their handling of the Pentinum math bug.  But that is relatively rare,
and not really the most efficient way to handle problems.

>For large enough failures, the results are obvious and the data is fairly
>clear. Perhaps a first stage of a Disruption Analysis Working Group would
>simply be for a coordinated group to gather the facts, sort through the
>impact, analyze the failure and report recommendations in a public forum. 

I'm going to get pedantic.  The results may be obvious, but the cause isn't.
I would assert there are a number of large failures where the initial obvious
cause has turned out to be wrong (or only a contributing factor). Was the
triggering fault for the western power grid failure last year caused by a
terrorist bombing or a tree growing too close to a high tension line.  From
just the results you can't tell the cause.  It may have been possible for
an outside group, with no cooperation from the power companies, to have
discovered the blackened tree on the utility right of way.  But without the
utility's logs and access to their data, I think it would have been very
difficult for an outside group to analyze the failure.  In particular I
think it would have been close to impossible for an outside group to
find the other contributing factors.

This should go on the name-droppers list, but here goes....

What do we know about the events with the name servers

   - f.root-servers.net was not able to transfer a copy of some of
	the zone files from a.root-servers.net
   - f.root-servers.net became lame for some zones
   - tcpdump showed odd AXFR from a.root-servers.net
   - [fjk].gtld-servers.net have been reported answering NXDOMAIN to
	some valid domains, NSI denies any problem

Other events which may or may not have been related
    - BGP routing bug disrupted connectivity for some backbones in the
	preceeding days
    - Last month the .GOV domain was missing on a.root-servers.net due
	to a 'known bug' affecting zone transfers from GOV-NIC
    - Someone has been probing DNS ports for an unknown reason

Things I don't know
    - f.root-servers.net and NSI's servers reacted differently.  What
	are the differences between them (BIND versions, in-house source
	code changes, operating systems/run-time libraries/compilers)
    - how long were servers unable to transfer the zone?  The SOA says
	a zone is good for 7 days.  Why they expire/corrupt the old zone
	before getting a new copy?
    - Routing between ISC and NSI for the preceeding period before the
	problem was discovered

Theories
    - Network connectivity was insufficient between NSI and ISC for long
	enough the zones timed out (why were other servers affected?)
    - Bug in BIND (or an in-house modified version) (why did vixie's and
	NSI's servers return different responses?)
    - Bug in a support system (O/S, RTL, Compiler, etc) or its installation
    - Operator error (erroneous reports of failure)
    - Other malicious activity?

-- 
Sean Donelan, Data Research Associates, Inc, St. Louis, MO
  Affiliation given for identification not representation