North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: Katrina Network Damage Report

  • From: Todd Underwood
  • Date: Sat Sep 10 15:05:20 2005

interesting discussion.  at least we're talking about networking now.
:-)

wrt sean's comment, the only thing i can think he means by 'partition'
is that the networks may have power may be in some routing table but
just not the routing table of any of renesys's (or routeviews or ripe)
peers.  in that case, i guess i would agree.  our use of 'outage' is a
special case of 'partition' where the whole internet is on one side
and it's possible that the networks in question are on the other.
they may route somewhere.  just not to the internet.

quick question below...

> There are some inconsistent terms used in computer
> dependability research, but I prefer and use two
> key definitions: failure (something is offline)
> and outage (customer sees the service offline).

not sure i understand these definitions.  i'm happy to use any
well-defined terms (vocabulary never being worth fighting over).
again, when i use 'outage' i mean:  previously in global internet
tables of a consensus of a large peerset and now removed from those
tables.  which is that in your terms?

> Looking at the routing tables you see failures.

not necessarily, if i'm understanding your definitions (which i guess
i'm not).  

> If a prefix goes away completely and utterly,
> and is truly unreachable, then anyone trying to
> see it is going to see an outage.  But you can
> have a lot of intermediate cases where routes are
> mostly down but not completely, or where parts
> of the net can see it but other parts can't
> due to the vagarities of route propogation
> and partial failures.

yes.  we cover all of these by having a large peerset and integrating
our data across them.  the outages that we report are not from a
particular point on the net.  they are from a consensus of a large,
selected peerset.

> And there are situations where the route is
> down but the service is still up.

unless you use words differently, this is not true.  by 'service' i
mean 'IP service'.  if the route is down, no one can reach anything
associated with that route, obviously.  do you mean 'service' as local
loop service? 

> There are other network monitoring groups
> that do end to end connectivity tests from
> geographically distributed clients out to
> sample systems around the net.  Some for research
> and some for hire for network monitoring.
> 
> I think what they do is much closer to
> identifying true outages than your method.

yes, that may be.  those are good ways of identifying certain kinds of
outages.  the problem is that they only measure what they measure.
frequently these systems measure well-connected sites monitoring
well-connected sites.  this creates a bias in the data, tending to
suggest that no big event ever really impacts the internet.  this is
obviously a false conclusion.

for reference compare the analysis of the 2003 US blackouts from
keynote:

http://www.keynote.com/news_events/releases_2003/03august14.html  

(summary:  nothing to see here, move along)

with those from renesys:

http://www.renesys.com//resource_library/blackout_results.html

(summary:  >4K prefixes disappeared from the global table impacting
connectivity to hospitals, schools, government and lots of
businesses).  

i would agree that our method of routing table analysis has
significant limitations and needs to be combined with other data.  but
it's a fantastic way of showing a lower bound on what was affected:
prefixes without entries in the global table almost certainly have no
service.

t.

-- 
_____________________________________________________________________
todd underwood
director of operations & security
renesys - interdomain intelligence
[email protected]   www.renesys.com