North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Ph.D. student looking for data on network failure causes

  • From: David L. Oppenheimer
  • Date: Mon Apr 08 19:25:53 2002

Hello network operators! I'm a Ph.D. student at UC Berkeley working for Dave
Patterson on the ROC project, which is investgiating techniques for
improving the availability and manageability of large-scale Internet
services and systems. I'm currently conducting a study of the root causes
(hardware, softare, human, etc.) and durations of failures in such systems.
To do this, I have been examining the operations trouble ticket databases
from several large-scale Internet services (of the Hotmail, eBay, Yahoo!,
etc. type).

In doing this research,  it has become apparent that for many services
(especially geographically distributed ones, e.g. those that use multiple
colocation facilities), a major cause of problems is failures of various
types in the Internet. Thus I've become interested in finding out the types
and root causes of problems in wide-area networks, e.g. within the kinds of
large-scale ASes that are administered by the folks on this list. I'm not
sure how your services track failures and problems; the problem tracking
databases at the services I've examined have been a great source of data
about problem scale, symptoms, root causes, durations, steps (and missteps!)
taken in diagnosing and fixing problems, etc.

I'm writing to the list because I'm very interested in working with network
operators to study the causes of failures in large networks. I realize this
type of data is very sensitive to your organizations. I would be happy to
talk offline with anyone who is interested in the possibility of sharing
data, about how I've overcome the multitude of objections that have been
raised by folks I have solicited for data (protecting their customers'
privacy, securing datasets when they are not examined on the premises of the
services, anonymizing and aggregating data in reporting, etc. etc.). I'm
interested in the relative causes of failures, *not* overall availability
numbers. As a result of the precautions we've taken, several household-name
Internet services have allowed me to examine and report on the problems
their servcies have experienced.

If you're interested in discussing the possibility of sharing access to this
kind of data about your service, please contact me. I'm willing to examine
data on the premises of your service, to anonymize it fully, to submit any
results I want to publish to your organization prior to publication, to sign
any necessary NDAs, etc. In return, I'm happy to share with you any insights
I have about the problems your service experiences, and you'll contribute to
the world's knowledge of why bad things happen to good networks. :-)

If you're not the right person in your organization to contact with this
request, but you think your organization might be interested in
participating in this study, perhaps you could forward this email to the
appropriate person or let me know who the right person to contact in your
organization would be.

Many thanks!

David