|
You are hereHome » NANOG Meeting Presentation Abstract
|
|
NANOG Meeting Presentation Abstract
Root Cause Analysis of BGP Routing Dynamics | Meeting: | NANOG30 | |
Date / Time: | 2004-02-10 1:50pm - 2:10pm | |
Room: | Symphony Ballroom II - IV | |
Presenters: | Speakers:
Matthew C. Caesar, UC BerkeleyMatthew Caesar is a graduate student in Computer Science at the University of California, Berkeley. His current work focuses on improving the properties of interdomain routing. He has previously worked on IP telephony and fast restoration from failures.L. Subramanian, UC Berkeley.R.H. Katz, UC Berkeley. | |
Abstract: | I. Introduction
Internet routing is plagued with several problems today, including chronic instability, convergence problems, and misconfigurations of routers [2]. We believe that a first step towards making BGP robust to these dynamics is by developing a systematic methodology for analyzing routing changes and inferring why they happen and where they originate. Answers to these questions can provide useful insights into the sources of anomalous routing events and instabilities.
We are working towards development of a BGP health inferencing system for determining the root cause of routing changes. The health inferencing system collects and correlates route updates from multiple vantage points to determine the routing events that trigger each route update. We envision deploying our inference algorithms in data collection centers such as Routeviews [3] and RIPE, which receive streams of route updates from multiple vantage points (views). More generally, we can use a BGP health monitor to continuously infer the state of the network. Such inferences may then be used: (a) offline for network performance monitoring and troubleshooting; or (b) online to improve path selection and damping of instability.
II. Inference techniques
II.a Turbulent vs. quiescent periods
The rate at which prefixes get updated signifies the type(s) of event that caused the stream of updates. In a turbulent period, one or a few major routing events cause several routes to simultaneously get updated. We assume that many observations in such a period are correlated (i.e., arise from the same routing event). In a quiescent period, when very few prefixes are updated, it is harder to determine which updates are caused by the same routing event. In this case, we analyze updates to each prefix in isolation.
For example, suppose at view V we have 2000 prefixes which all use AS Path [V,A,B,C], and 2000 prefixes which use AS Path [V,A,B,D]. Suppose within a short period of time we observe updates to 2000 prefixes traversing the inter-AS link (B,C), and suppose very few other prefixes are updated. Then, it is likely that a single major event took place place at (B,C). On the other hand, suppose we instead observed updates to 4000 prefixes that traverse (A,B), and of these prefixes, 2000 traverse (B,C) and 2000 traverse (B,D). Then, the major event likely took place on the link (A,B).
II.b Matching causes with observations
For every potential cause of a routing event, there exist different patterns of route updates that can be observed at a vantage point. Based on the pattern of observations, we classify the causes into equivalence classes, where each class contains different causes that might trigger the same pattern of updates. While Griffin et al. [1] have shown that matching causes with observations is a hard problem, we find that certain patterns of updates (e.g., presence of route withdrawals) can help in narrowing down the set of possible causes.
For example, suppose view V\'s routing table contains an AS path [V,A,B,C,D] to a prefix X, and assume for simplicity that AS\'s are singly peered (we do not make this assumption in our design). Suppose after some time the path changes to [V,A,B,Y,C,D] and remains stable for some time. There are several possible events that could explain this change: perhaps Y advertised a lower MED to B, or perhaps the link (B,C) failed, or perhaps D changed a community attribute in the message triggering a route change at B. However, certain events could not explain this change: a failure of link (B,Y), or Y advertising a higher MED to B could not have caused this observation. In general, there are three possible explanations: either (1) some event happened on the path [B,C] to make it less desirable to B, (2) some event happened on the path [B,Y,C] to make it more desirable to B, or (3) some router on the path [C,D] changed a community attribute.
II.c Multiple vantage points
Observing the same event from several vantage points allows us to acquire additional information about the event. By comparing similarities and differences in observations across the views, and by measuring the magnitudes of the event at each view, we can distinguish the signature of the event from effects introduced by intermediate routers along the path.
For example, suppose in the previous scenario another view V2 simultaneously observed a routing change to the same prefix X from AS Path [V2,F,G,C,D] to [V2,F,Y,C,D]. Then, it is most likely that a single event caused the observations at V and V2. Moreover, it is highly likely that the event took place at C,Y, or D, and not at B.
III. Validation and results
Most ISPs do not wish to reveal the types or frequency of events taking place in their networks, making validation of our approach difficult. However, there are several well-known major events that are public knowledge, such as the spread of Internet worms, or routing problems suffered by major ISPs. In addition, we know the location where certain classes of updates are caused, for example updates pertaining to prefixes originated by the AS containing the vantage point, or updates generated by BGP Beacons [4]. We considered a large number of such updates, and found that inference was performed correctly in every case. Although we aren\'t able to directly validate all of our inferences using this approach, we are able to verify the correctness of a base set of rules that we used to acquire our results.
To demonstrate the utility of such a system, we apply our inference methodology to updates collected from Routeviews and RIPE over a period of 18 months. We make several observations from our analysis:
We can pinpoint the location where the update was generated to a single pair of AS\'s for over 70% of updates. Additionally, we output a list of potential causes that might have caused an event, but may not always be able to identify the specific cause.
Our system can detect major routing anomalies, many of which were previously unknown.
We detected nearly 1,400 resets per month, and found certain inter-AS links to be perennially unstable.
Roughly 25% of prefixes continuously flap at least every 30 minutes, and these account for a large fraction (20%) of routing updates.
Routing events in the Internet core usually trigger short-term flaps, but an event taking place at the network edge is nine times more likely to cause a long-term route change.
Bibliography:
[1] T. Griffin, \"What is the sound of one route flapping?,\" presentation made at the Network Modeling and Simulation Summer Workshop, 2002.
[2] C. Labovitz, A. Ahuja, F. Jahanian, \"Experimental study of Internet stability and wide-area network failures,\" in Proc. of Fault Tolerant Computing Symposium, June 1999.
[3] \"Route Views Project,\" http://www.routeviews.org.
[4] Z. Mao, R. Bush, T. Griffin, M. Roughan, \"BGP beacons,\" in Proc. Internet Measurement Conference, October 2003.
[5] M. Caesar, L. Subramanian, R. Katz, \"BGP health monitoring: realtime analysis,\" web site | |
Files: | Matthew Caesar Presentation(PDF)
Root Cause Analysis of BGP Routing Dynamics
| |
Sponsors: | None. | |
Back to NANOG30 agenda. NANOG30 Abstracts- Making Sense of BGP
Speakers: Tina Wong, Packet Design; Van JacobsonPacket Design; .Cengiz AlaettinogluPacket Design; .
- Making Sense of BGP
Speakers: Tina Wong, Packet Design; Van JacobsonPacket Design; .Cengiz AlaettinogluPacket Design; .
- Making Sense of BGP
Speakers: Tina Wong, Packet Design; Van JacobsonPacket Design; .Cengiz AlaettinogluPacket Design; .
- Real-time Global Routing Metrics
Speakers: Jim CowieRenesys Corporation; .Andy T. OgielskiRenesys Corporation; .B.J. PremoreRenesys Corporation; .Eric A. SmithRenesys Corporation; .Todd UnderwoodRenesys Corporation; .
- Real-time Global Routing Metrics
Speakers: Jim CowieRenesys Corporation; .Andy T. OgielskiRenesys Corporation; .B.J. PremoreRenesys Corporation; .Eric A. SmithRenesys Corporation; .Todd UnderwoodRenesys Corporation; .
- Real-time Global Routing Metrics
Speakers: Jim CowieRenesys Corporation; .Andy T. OgielskiRenesys Corporation; .B.J. PremoreRenesys Corporation; .Eric A. SmithRenesys Corporation; .Todd UnderwoodRenesys Corporation; .
- Real-time Global Routing Metrics
Speakers: Jim CowieRenesys Corporation; .Andy T. OgielskiRenesys Corporation; .B.J. PremoreRenesys Corporation; .Eric A. SmithRenesys Corporation; .Todd UnderwoodRenesys Corporation; .
- Real-time Global Routing Metrics
Speakers: Jim CowieRenesys Corporation; .Andy T. OgielskiRenesys Corporation; .B.J. PremoreRenesys Corporation; .Eric A. SmithRenesys Corporation; .Todd UnderwoodRenesys Corporation; .
|
|