North American Network Operators Group|
Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical
2006.06.05 NANOG-NOTES BGP tools BOF notes
(ok, last set of notes for tonight, and then it's off to bed for 90 minutes of sleep before heading back to the convention center. ^_^; --MNP) 2006.06.05 Welcome to the 4th BGP Tools BOF! [slides are at http://www.nanog.org/mtg-0606/pdf/lixia-zhang.pdf Nick Feamster GeorgeTech Dan Massey CUS Mohit Lad and Lixia Zhang, UCLA The Goal sharing some tools develop from our research efforts. hopefully will be useful for operations community. Also to collect input on new tools we would like to see so they can develop them. Routing Configuration Checker Nick Feamster O-BGP data organization tool Dan Massey [slides are at http://www.nanog.org/mtg-0606/pdf/dan-massey.pdf The Datapository by Nick Feamster [I'm sorry, that just sounds *far* too much like something you do *NOT* want your bedside nurse administering...--MNP] Visualizing BGP dynamics using Link-Rank by Mohit Lad Open discussions and demos Nick Feamster Network Troubleshooting: rcc and beyond rcc: router configuration checker proactive routing configuration analysis idea: analyze configs before deployment many faults can be detected with static analysis. rcc implementation. http://nms.csail.mit.edu/rcc/ preprocessor -> parser -> relational database (mySQL), constraints <-> verifier <-> faults verifier is a template checker and set of constraints your configs are checked against. He's looking for GUI developers. very bare-bones command line right now. Parsing configurations--shows some output. He shows examples of the abilene configs, which are non anonymized. show all routers peering with a given AS, can look at route maps in each direction, etc. After running rcc on it, you get a web output which shows relationships--oh, pictures don't matter, with some more grease could be a reasonable representation of your network. Q: Randy Bush asks if it could show which peering sessions are missing? A: Not yet, but it could be added, thank you! Shows processing and errors; you get a page that summarizes the things RCC thinks are errors. Signalling partition? that's a missing iBGP session; he needs some better lingo in places. Also shows anomalous imports, could be intended for traffic engineering; that's "inconsistent policy" in ISP speak. Some of the names will get fixed to make Randy Bush happy. Yes, but surprises happen! link failures node failures traffic volumes shift network devices "wedged" ... two problems detection localization Need to marry static config analysis with dynamic information (route is configured but isn't in the dynamic table) he skips a closer look, just some jargon. Detection: analyze routing dynamics; drill down on interesting operational issues. idea: routers exhibit correlated behaviour blips across signals may be more operationally interesting than any spike in one signalling system. How do you spot things in the churn? Detection three types of events single-router bursts correlated bursts multi-router bursts <---common; and commonly missed using simple thresholds Localization: joint dynamic/static which routers are "border routers" for that burst topological properties of routers in the burst. proactive analysis -> deployment -> dynamic -> reactive detection -> diagnosis/correction -> static -> By going back to the configs, lets you see if it's something happening inside the network, or on the edge. Specific Focus: firewall configuration difficult to understand and audit configs subject to continual modifications roughly 1-2 touches per day federated policy, distributed dependencies each department has independent policies local changes may affect global behaviour (These are pulled from Georgia Tech; 130 firewall configs. Builds static connectivity matrix.) Reactive monitoring...use probes from subnets to verify reachability/connectivity. (immediate) open issues reachability and reliability of controller service-level probes diagnostic tools != service-level happiness policy conformance. Q: can it give suggested remediation, or provide config templates for new routers being added? A: Good idea! OK, over to next presenter. Helps with understanding BGP data. BGP data collection and organization (OBGP) Tool Colorado state university/university of Arizona/UCLA BGP data collection takes lots of BGP data, from RIPE RIS, etc. ISP BGP peer router -> update oreg -> rib+update -> feeds into gigabytes of data, different formats, potential errors enter in, and severe lack of metadata. Other tools can use it, LinkRank, BGP-Inspect, and a bunch of people cite it in reports and research. OBGP motivation Large Volume of Data data from many sources (RIPE, RV, private data) Long time scales and very recent (real-time?) data Slightly different formats RIPE/RV use different naming conventions different dump intervals different timezones for older data Lack of MetaData would like to only see desired peers and desired update types Possible errors in the data are updates missing due to log errors? what is lost due to session failures? So, OBGP is the "thing" in the middle. A simple perl script called oBGP that simplifies data. Features: Uniform data organization consistent and easy to use for scripts consistent view of multiple monitoring points annonatations/labels can be stripped, help locate useful data easily table transfer detection distinguish updates from data collection peering Data inconsistency detection and correction understand and fix possible data errors Uniform data organization Uniform naming and organization conventions for all monitoring points RIB and update data split by peer One rib and update file per peer per day, dumped at beginning of the day. Labels and Annotations are more interesting Existing format labels update as announce (A) or Withdraw (W) also includes some STATE messages OBGP enhances the labels Adds a status message Adds an update type More STATE messages route table dump table transfers A:INC:DPATH (shows it's an announcement, it's incremental, and it's updating the destination path OBGP Added labels |<original update type:<status info>:<OBGP udate type>| <orginal update type> add E for error correction <status infor> INC incremental update TT table transfer update RIB: correction update <OBGP update type> new announcement duplicate announcement change in AS path (DPATH) change in other attribute (not ASpath) withdraw duplicate withdraw If you don't need this, it's just a few extra characters in your log; but could be useful. Using Labels to filter data example: find suballocation hijacks Only need new announcements and withdraws so 83% of the update data can be ignored. Is the collected data accurate? May lose updates due to data collection errors start with an accurate RIB apply updates in log should match the next RIB dumped by the router modulo some race conditions near dump time does this clearly work with RouteViews? 85 of 111 peers from RV suffered inconsistencies in 2006 May About 25 were rock solid right on. One peer had 378,998 inconsistencies in one day. Is this evenly distributed? Not really. Inconsistencies and session failures session down: RIB-IN drops to empty session up: table transfer (failure to recognize a session dropping) look for table transfer, can estimate where sessions went down and came back up. How long does an error persist? Lifetime of correction updates can last 43 days! If you miss an update, you can have bad data for a long, LONG time!! Correction updates added by OBGP E:RIB updates; figure a change in RIB had to happen due to a routing update that was missed. Summary: consistent format adds label to easily sort and limit adds additional state messages identifies and corrects update error messages http://netsec.cs.colostate.edu/tools.html [NOTE URL at end of slide deck is WRONG --MNP] If you're using RouteViews or RIPE RIS, consider using this tool, and give feedback! Randy is using it to check propagation of his prefixes, and for research. RIPE NCC--performance of these tools? With multiple collectors, perl didn't scale. Perl is mainly demonstration. He pulls data from RIPE and has it stored, hopes to make it public some day. he has stacks of disks with text format data for easy search; considering binary format for it. Randy--on that subject: Matt Rowan, he's spent half his life getting the data out of the system; make it in funny format and sticking it back in, Disk is cheap! Look at raw data. With binary data, what tools are there? Hard enough to look at router configs. One tool to look at binary data, lots of tools to look at text. Q: Matt asks how much space it takes to store data A: Takes about 1TB to store all the RIS data. Q: Are they planning to make it available to the public? A: Well, he'd like to host it at route-views or ripe, rather than create a new site. How long does it take to process the RIPE data? Need a fast CPU, will take a couple of days to process the data. Q: can it deal with live updates? It can keep up with route-views and RIPE, but that's not live; there is a lag; route-views is every 15 minutes. The update files sometimes take 8 hour lags to show up on the site. The Datapository Nick Feamster and David Anderson? Architecture: raw data -> compute engines-> storage and DB plus archival storage ->analysis. Very alpha right now datapository.net NOT realtime! inserting data in greedy approach; when he needs it, he inserts it, and starts running queries. You can see a list of feeds, he has abilene but not route-views yet. Can restrict it, look at neighbor ASes, etc. see it in graphical form, or list form can diagnose issues, has an XML query engine and output for programatically accessing it. If you use matlab, could be interesting to throw this into a multidimensional time series. Randy Bush notes all his tools take MRT output. Oh, he can spit out sparse matrices He could spit out MRT format; he has python that speaks MRT format. he'll look at adding that. Do spammers hijack BGP routes? Theory: 1 announce BGP route for mail server 2 send lots of spam 3 withdraw route, becoming invisible reality? let's check! export formats Web interface XML/RPC text-based output programmatic interface output to matlab and per Randy Bush, MRT format would be good too! BGP-Inspect vs this tool? this has additional datasets beside BGP, like active probes, traffic, etc. This has a better collection setup as well; unified formats. Mohit will do last one, Link-Rank show the dynamics Visualizing BGP dynamics with Link-Rank constructing rank-change graphs closest to BGPlay. weight is number of prefixes reached across that link. weight changes are on specific links, can do easy root-cause analysis. Activity bar--routing activity across time. http://linkrank.cs.ucla.edu/ green shows gains, red shows losses, sums all gains, sums all losses. visualization graph of where prefixes gained and lost. again, green are links that gain, red are links that lost. other observation points highlighted in orange May 23rd, instability 293 flapping from 1239 to 3356 for multiple observation points, dashes are lost, solids are gains. highlight sources and sinks cutting one link explains most of the errors 3561 to 4134 link issue case II, one node that sucks in all the flows, no single link, 3356 Ongoing work, automated root cause identification min-cut scheme characterizing Can look at destination link-rank graphs, see how the rest of the internet change going to you. connectivity issues to 7018 should show your prefix hijackings moving from one link to another. Could simplify this to BGPlay if you wanted. open source http://linkrank.cs.ucla.edu/ current version download client, configure... Future version work with any BGP data Q: Matt asks when we'll be able to use our own BGP data; A: about 4-5 months, hopefully! Haven't looked at netflow yet. Q: Randy Bush. Common problem we all face. I'm at 42 peering points; my neighbors are X. I have route views dumps, I have my BGP dumps. I have my netflow data. Want a whatifatron that shows what happens to my traffic if depeer someone, or add someone, or peer with SingTel in singapore, or stop peering with Joe in SF. That's a question many operators ask every day. A: Matt notes that if they can solve that question/write something that does all that, they'll have Arbor and others beating on their door. ^_^ Panel wraps up at 1728 hours Pacific time.