Re: Resilience: faults, causes, statistics, open issues

North American Network Operators Group

Re: Resilience: faults, causes, statistics, open issues

From: David Andersen
Date: Thu Jan 27 11:38:58 2005

On Jan 27, 2005, at 6:39 AM, András Császár (IJ/ETH) wrote:

Hi people!

I've begun research on (carrier-grade, aka telecom-grade) resiliency in IP transport networks. The first step would be to collect possible failure events, their causes and consequences, statistics about downtimes (mean time to repair) and mean times between failures, and I would like to identify which of the problems are most typical (HW bug, SW bug, cable cut through, plugged out (link going down), severe misconfiguration).

I think this is the perfect forum to get some feedback from real network-operational experience.

Is anyone out there who has some statistics/documents that would help me in any way?

This is self-serving, but see the intro and related work sections of my thesis (we'll have a conference paper version of it done soon for NSDI, but we're still revising it. Apologies for not having a shorter reference to give you):

http://nms.lcs.mit.edu/papers/index.php?detail=113

It doesn't focus specifically on carrier failures, but it has a batch of references that might get you started on what the academic side knows. I've also got some refs in there to some of the earlier teleco studies, which I recommend taking a peek at. Again, relation to year 2005 ISP failures isn't totally clear, but it's a starting point.

Unfortunately, the reality is that we don't actually know all that much as far as what's _really_ happening! Nick Feamster and I took a look at some of the BGP routing failures (but didn't get back to root causes):

http://nms.lcs.mit.edu/papers/index.php?detail=23

Nick's also done some work on configuration management and building a better routing protocol that's somewhat related to your question.

Ratul Mahajan examined BGP configuration errors - but it's not clear exactly what fraction of failures or downtime are really due to those errors:

http://www.cs.washington.edu/homes/ratul/bgp/index.html

David Oppenheimer studied failures at a few edge companies (app. service providers, hosting providers, etc.). Has a nice breakdown of failure causes and durations, but it's not clear if those numbers directly translate to the carrier realm:

http://roc.cs.berkeley.edu/papers/usits03.pdf

Finally, google back for some of Sean Donelan's NANOG posts. You'll get some good individual cases from those, though the last time I looked, I didn't find a big overall analysis.

Also, do you have any suggestions on open research issues to be solved in the area?

Most of it. :) I (and probably others on this lis) would be interested in what you find.

-Dave

References:
- Resilience: faults, causes, statistics, open issues András Császár (IJ/ETH)

Prev by Date: Resilience: faults, causes, statistics, open issues
Next by Date: Am I crazy!?
Date Index
Thread Index
Author Index
Historical