North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: Extreme congestion (was Re: inter-domain link recovery)

  • From: Fred Baker
  • Date: Wed Aug 15 14:02:46 2007
  • Authentication-results: sj-dkim-1; [email protected]; dkim=pass (si g from cisco.com/sjdkim1004 verified; );
  • Dkim-signature: v=0.5; a=rsa-sha256; q=dns/txt; l=6174; t=1187197157; x=1188061157; c=relaxed/simple; s=sjdkim1004; h=Content-Type:From:Subject:Content-Transfer-Encoding:MIME-Version; d=cisco.com; [email protected]; z=From:=20Fred=20Baker=20<[email protected]> |Subject:=20Re=3A=20Extreme=20congestion=20(was=20Re=3A=20inter-domain=20 link=20recovery) |Sender:=20; bh=2LsZnglWmmb1Dst/llt8AZduluDqXjeNzVRnR09d50w=; b=tLu4Z3sPVewrmr+8yeL9FLk9Kd1rplt97SKDRHvP9hjGrqaFqghXtScKsh6+n7gdtv9HRh9y 8X7lMSU3MaZuXOWsqhF+KIGfQvxtDXe6r4TLqc95+cgs56B6jaO1/R+QcAO/WrT4sBscNbWJPA +pn7oq0odyVG4u6mDKLVf2v6Y=;


let me answer at least twice.


As you say, remember the end-2-end principle. The end-2-end principle, in my precis, says "in deciding where functionality should be placed, do so in the simplest, cheapest, and most reliable manner when considered in the context of the entire network. That is usually close to the edge." Note the presence of advice and absence of mandate.

Parekh and Gallagher in their 1993 papers on the topic proved using control theory that if we can specify the amount of data that each session keeps in the network (for some definition of "session") and for each link the session crosses define exactly what the link will do with it, we can mathematically predict the delay the session will experience. TCP congestion control as presently defined tries to manage delay by adjusting the window; some algorithms literally measure delay, while most measure loss, which is the extreme case of delay. The math tells me that place to control the rate of a session is in the end system. Funny thing, that is found "close to the edge".

What ISPs routinely try to do is adjust routing in order to maximize their ability to carry customer sessions without increasing their outlay for bandwidth. It's called "load sharing", and we have a list of ways we do that, notably in recent years using BGP advertisements. Where Parekh and Gallagher calculated what the delay was, the ISP has the option of minimizing it through appropriate use of routing.

ie, edge and middle both have valid options, and the totality works best when they work together. That may be heresy, but it's true. When I hear my company's marketing line on intelligence in the network (which makes me cringe), I try to remind my marketing folks that the best use of intelligence in the network is to offer intelligent services to the intelligent edge that enable the intelligent edge to do something intelligent. But there is a place for intelligence in the network, and routing its its poster child.

In your summary of the problem, the assumption is that both of these are operative and have done what they can - several links are down, the remaining links (including any rerouting that may have occurred) are full to the gills, TCP is backing off as far as it can back off, and even so due to high loss little if anything productive is in fact happening. You're looking for a third "thing that can be done" to avoid congestive collapse, which is the case in which the network or some part of it is fully utilized and yet accomplishing no useful work.

So I would suggest that a third thing that can be done, after the other two avenues have been exhausted, is to decide to not start new sessions unless there is some reasonable chance that they will be able to accomplish their work. This is a burden I would not want to put on the host, because the probability is vanishingly small - any competent network operator is going to solve the problem with money if it is other than transient. But from where I sit, it looks like the "simplest, cheapest, and most reliable" place to detect overwhelming congestion is at the congested link, and given that sessions tend to be of finite duration and present semi-predictable loads, if you want to allow established sessions to complete, you want to run the established sessions in preference to new ones. The thing to do is delay the initiation of new sessions.

If I had an ICMP that went to the application, and if I trusted the application to obey me, I might very well say "dear browser or p2p application, I know you want to open 4-7 TCP sessions at a time, but for the coming 60 seconds could I convince you to open only one at a time?". I suspect that would go a long way. But there is a trust issue - would enterprise firewalls let it get to the host, would the host be able to get it to the application, would the application honor it, and would the ISP trust the enterprise/host/application to do so? is ddos possible? <mumble>

So plan B would be to in some way rate limit the passage of TCP SYN/ SYN-ACK and SCTP INIT in such a way that the hosed links remain fully utilized but sessions that have become established get acceptable service (maybe not great service, but they eventually complete without failing).

On Aug 15, 2007, at 8:59 AM, Sean Donelan wrote:

On Wed, 15 Aug 2007, Fred Baker wrote:
On Aug 15, 2007, at 8:35 AM, Sean Donelan wrote:
Or should IP backbones have methods to predictably control which IP applications receive the remaining IP bandwidth? Similar to the telephone network special information tone -- All Circuits are Busy. Maybe we've found a new use for ICMP Source Quench.

Source Quench wouldn't be my favored solution here. What I might suggest is taking TCP SYN and SCTP INIT (or new sessions if they are encrypted or UDP) and put them into a lower priority/rate queue. Delaying the start of new work would have a pretty strong effect on the congestive collapse of the existing work, I should think.

I was joking about Source Quench (missing :-), its got a lot of problems.


But I think the fundamental issue is who is responsible for controlling the back-off process? The edge or the middle?

Using different queues implies the middle (i.e. routers). At best it might be the "near-edge," and creating some type of shared knowledge between past, current and new sessions in the host stacks (and maybe middle-boxes like NAT gateways).

How fast do you need to signal large-scale back-off over what time period? Since major events in the real-world also result in a lot of "new" traffic, how do you signal new sessions before they reach the affected region of the network? Can you use BGP to signal the far-reaches of the Internet that I'm having problems, and other ASNs should start slowing things down before they reach my region (security can-o-worms being opened).