North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: Thoughts on increasing MTUs on the internet

  • From: Fred Baker
  • Date: Fri Apr 13 19:59:32 2007
  • Authentication-results: rtp-dkim-2; [email protected]; dkim=pass (s ig from cisco.com/rtpdkim2001 verified; );
  • Dkim-signature: v=0.5; a=rsa-sha256; q=dns/txt; l=13332; t=1176508563; x=1177372563; c=relaxed/simple; s=rtpdkim2001; h=Content-Type:From:Subject:Content-Transfer-Encoding:MIME-Version; d=cisco.com; [email protected]; z=From:=20Fred=20Baker=20<[email protected]> |Subject:=20Re=3A=20Thoughts=20on=20increasing=20MTUs=20on=20the=20intern et |Sender:=20 |To:=20Simon=20Leinen=20<[email protected]>; bh=ePYU8NLrvhKigQNiYQOLz3KDT03eQvse4fyMEhyfM4o=; b=yN/8CYipErOLTXMxlxkgpk2YvEdbzNTK2FnM+whZor8ITW8nV3fWOSSq5txEhwdi4uxz64Vz Uf7KhVOuk2koS3AUYXmHWmHAdyXyBBeFuQbHVv6agpBBdJUuc7/DabL9;


I agree with many of your thoughts. This is essentially the same discussion we had upgrading from the 576 byte common MTU of the ARPANET to the 1500 byte MTU of Ethernet-based networks. Larger MTUs are a good thing, but are not a panacea. The biggest value in real practice is IMHO that the end systems deal with a lower interrupt rate when moving the same amount of data. That said, some who are asking about larger MTUs are asking for values so large that CRC schemes lose their value in error detection, and they find themselves looking at higher layer FEC technologies to make up for the issue. Given that there is an equipment cost related to larger MTUs, I believe that there is such a thing as an MTU that is impractical.


1500 byte MTUs in fact work. I'm all for 9K MTUs, and would recommend them. I don't see the point of 65K MTUs.

On Apr 14, 2007, at 7:39 AM, Simon Leinen wrote:


Ah, large MTUs. Like many other "academic" backbones, we implemented large (9192 bytes) MTUs on our backbone and 9000 bytes on some hosts. See [1] for an illustration. Here are *my* current thoughts on increasing the Internet MTU beyond its current value, 1500. (On the topic, see also [2] - a wiki page which is actually served on a 9000-byte MTU server :-)

Benefits of >1500-byte MTUs:

Several benefits of moving to larger MTUs, say in the 9000-byte range,
were cited.  I don't find them too convincing anymore.

1. Fewer packets reduce work for routers and hosts.

Routers:

   Most backbones seem to size their routers to sustain (near-)
   line-rate traffic even with small (64-byte) packets.  That's a good
   thing, because if networks were dimensioned to just work at average
   packet sizes, they would be pretty easy to DoS by sending floods of
   small packets.  So I don't see how raising the MTU helps much
   unless you also raise the minimum packet size - which might be
   interesting, but I haven't heard anybody suggest that.

   This should be true for routers and middleboxes in general,
   although there are certainly many places (especially firewalls)
   where pps limitations ARE an issue.  But again, raising the MTU
   doesn't help if you're worried about the worst case.  And I would
   like to see examples where it would help significantly even in the
   normal case.  In our network it certainly doesn't - we have Mpps to
   spare.

Hosts:

   For hosts, filling high-speed links at 1500-byte MTU has often been
   difficult at certain times (with Fast Ethernet in the nineties,
   GigE 4-5 years ago, 10GE today), due to the high rate of
   interrupts/context switches and internal bus crossings.
   Fortunately tricks like polling-instead-of-interrupts (Saku Ytti
   mentioned this), Interrupt Coalescence and Large-Send Offload have
   become commonplace these days.  These give most of the end-system
   performance benefits of large packets without requiring any support
   from the network.

2. Fewer bytes (saved header overhead) free up bandwidth.

   TCP segments over Ethernet with 1500 byte MTU is "only" 94.2%
   efficient, while with 9000 byte MTU it would be 99.?% efficient.
   While an improvement would certainly be nice, 94% already seems
   "good enough" to me.  (I'm ignoring the byte savings due to fewer
   ACKs.  On the other hand not all packets will be able to grow
   sixfold - some transfers are small.)

3. TCP runs faster.

This boils down to two aspects (besides the effects of (1) and (2)):

a) TCP reaches its "cruising speed" faster.

      Especially with LFNs (Long Fat Networks, i.e. paths with a large
      bandwidth*RTT product), it can take quite a long time until TCP
      slow-start has increased the window so that the maximum
      achievable rate is reached.  Since the window increase happens
      in units of MSS (~MTU), TCPs with larger packets reach this
      point proportionally faster.

      This is significant, but there are alternative proposals to
      solve this issue of slow ramp-up, for example HighSpeed TCP [3].

b) You get a larger share of a congested link.

      I think this is true when a TCP-with-large-packets shares a
      congested link with TCPs-with-small-packets, and the packet loss
      probability isn't proportional to the size of the packet.  In
      fact the large-packet connection can get a MUCH larger share
      (sixfold for 9K vs. 1500) if the loss probability is the same
      for everybody (which it often will be, approximately).  Some
      people consider this a fairness issue, other think it's a good
      incentive for people to upgrade their MTUs.

About the issues:

* Current Path MTU Discovery doesn't work reliably.

  Path MTU Discovery as specified in RFC 1191/1981 relies on ICMP
  messages to discover when a smaller MTU has to be used.  When these
  ICMP messages fail to arrive (or be sent), the sender will happily
  continue to send too-large packets into the blackhole.  This problem
  is very real.  As an experiment, try configuring an MTU < 1500 on a
  backbone link which has Ethernet-connected customers behind it.
  I bet that you'll receive LOUD complaints before long.

  Some other people mention that Path MTU Discovery has been refined
  with "blackhole detection" methods in some systems.  This is widely
  implemented, but not configured (although it probably could be with
  a "Service Pack").

  Note that a new Path MTU Discovery proposal was just published as
  RFC 4821 [4].  This is also supposed to solve the problem of relying
  on ICMP messages.

  Please, let's wait for these more robust PMTUD mechanisms to be
  universally deployed before trying to increase the Internet MTU.

* IP assumes a consistent MTU within a logical subnet.

  This seems to be a pretty fundamental assumption, and Iljitsch's
  original mail suggests that we "fix" this.  Umm, ok, I hope we don't
  miss anything important that makes use of this assumption.

  Seriously, I think it's illusionary to try to change this for
  general networks, in particular large LANs.  It might work for
  exchange points or other controlled cases where the set of protocols
  is fairly well defined, but then exchange points have other options
  such as separate "jumbo" VLANs.

  For campus/datacenter networks, I agree that the consistent-MTU
  requirement is a big problem for deploying larger MTUs.  This is
  true within my organization - most servers that could use larger
  MTUs (NNTP servers for example) live on the same subnet with servers
  that will never bother to be upgraded.  The obvious solution is to
  build smaller subnets - for our test servers I usually configure a
  separate point-to-point subnet for each of its Ethernet interfaces
  (I don't trust this bridging-magic anyway :-).

* Most edges will not upgrade anyway.

  On the slow edges of the network (residual modem users, exotic
  places, cellular data users etc.), people will NOT upgrade their MTU
  to 9000 byte, because a single such packet would totally kill the
  VoIP experience.  For medium-fast networks, large MTUs don't cause
  problems, but they don't help either.  So only a few super-fast
  edges have an incentive to do this at all.

  For the core networks that support large MTUs (like we do), this is
  frustrating because all our routers now probably carve their
  internal buffers for 9000-byte packets that never arrive.
  Maybe we're wasting lots of expensive linecard memory this way?

* Chicken/egg

  As long as only a small minority of hosts supports >1500-byte MTUs,
  there is no incentive for anyone important to start supporting them.
  A public server supporting 9000-byte MTUs will be frustrated when it
  tries to use them.  The overhead (from attempted large packets that
  don't make it) and potential trouble will just not be worth it.
  This is a little similar to IPv6.

So I don't see large MTUs coming to the Internet at large soon.  They
probably make sense in special cases, maybe for "land-speed records"
and dumb high-speed video equipment, or for server-to-server stuff
such as USENET news.

(And if anybody out there manages to access [2] or http:// ndt.switch.ch/
with 9000-byte MTUs, I'd like to hear about it :-)
--
Simon.


[1] Here are a few tracepaths (more or less traceroute with integrated
    PMTU discovery) from a host on our network in Switzerland.
    9000-byte packets make it across our national backbone (SWITCH),
    the European academic backbone (GEANT2), Abilene and CENIC in the
    US, as well as through AARnet in Australia (even over IPv6).  But
    the link from the last wide-area backbone to the receiving site
    inevitably has a 1500-byte MTU ("pmtu 1500").

: [email protected][leinen]; tracepath www.caida.org
1: mamp1-eth2.switch.ch (130.59.35.78) 0.110ms pmtu 9000
1: swiMA1-G2-6.switch.ch (130.59.35.77) 1.029ms
2: swiMA2-G2-5.switch.ch (130.59.36.194) 1.141ms
3: swiEL2-10GE-1-4.switch.ch (130.59.37.77) 4.127ms
4: swiCE3-10GE-1-3.switch.ch (130.59.37.65) 4.726ms
5: swiCE2-10GE-1-4.switch.ch (130.59.36.209) 4.901ms
6: switch.rt1.gen.ch.geant2.net (62.40.124.21) asymm 7 4.429ms
7: so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) asymm 8 12.551ms
8: abilene-wash-gw.rt1.fra.de.geant2.net (62.40.125.18) asymm 9 105.099ms
9: 64.57.28.12 (64.57.28.12) asymm 10 121.619ms
10: kscyng-iplsng.abilene.ucaid.edu (198.32.8.81) asymm 11 153.796ms
11: dnvrng-kscyng.abilene.ucaid.edu (198.32.8.13) asymm 12 158.520ms
12: snvang-dnvrng.abilene.ucaid.edu (198.32.8.1) asymm 13 180.784ms
13: losang-snvang.abilene.ucaid.edu (198.32.8.94) asymm 14 177.487ms
14: hpr-lax-gsr1--abilene-LA-10ge.cenic.net (137.164.25.2) asymm 20 179.106ms
15: riv-hpr--lax-hpr-10ge.cenic.net (137.164.25.5) asymm 21 185.183ms
16: hpr-sdsc-sdsc2--riv-hpr-ge.cenic.net (137.164.27.54) asymm 18 186.368ms
17: hpr-sdsc-sdsc2--riv-hpr-ge.cenic.net (137.164.27.54) asymm 18 185.861ms pmtu 1500
18: cider.caida.org (192.172.226.123) asymm 19 186.264ms reached
Resume: pmtu 1500 hops 18 back 19
: [email protected][leinen]; tracepath www.aarnet.edu.au
1: mamp1-eth2.switch.ch (130.59.35.78) 0.095ms pmtu 9000
1: swiMA1-G2-6.switch.ch (130.59.35.77) 1.024ms
2: swiMA2-G2-5.switch.ch (130.59.36.194) 1.115ms
3: swiEL2-10GE-1-4.switch.ch (130.59.37.77) 3.989ms
4: swiCE3-10GE-1-3.switch.ch (130.59.37.65) 4.731ms
5: swiCE2-10GE-1-4.switch.ch (130.59.36.209) 4.771ms
6: switch.rt1.gen.ch.geant2.net (62.40.124.21) asymm 7 4.424ms
7: so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) asymm 8 12.536ms
8: ge-3-3-0.bb1.a.fra.aarnet.net.au (202.158.204.249) asymm 9 13.207ms
9: so-0-1-0.bb1.a.sin.aarnet.net.au (202.158.194.145) asymm 10 217.846ms
10: so-3-3-0.bb1.a.per.aarnet.net.au (202.158.194.129) asymm 11 275.651ms
11: so-0-1-0.bb1.a.adl.aarnet.net.au (202.158.194.6) asymm 12 293.854ms
12: so-0-1-0.bb1.a.adl.aarnet.net.au (202.158.194.6) 297.989ms pmtu 1500
13: tiny-teddy.aarnet.edu.au (203.21.37.30) asymm 12 297.462ms reached
Resume: pmtu 1500 hops 13 back 12
: [email protected][leinen]; tracepath6 www.aarnet.edu.au
1?: [LOCALHOST] pmtu 9000
1: swiMA1-G2-6.switch.ch 1.328ms
2: swiMA2-G2-5.switch.ch 1.703ms
3: swiEL2-10GE-1-4.switch.ch 4.529ms
4: swiCE3-10GE-1-3.switch.ch 5.278ms
5: swiCE2-10GE-1-4.switch.ch 5.493ms
6: switch.rt1.gen.ch.geant2.net asymm 7 5. 99ms
7: so-7-2-0.rt1.fra.de.geant2.net asymm 8 13.239ms
8: ge-3-3-0.bb1.a.fra.aarnet.net.au asymm 9 13.970ms
9: so-0-1-0.bb1.a.sin.aarnet.net.au asymm 10 218.718ms
10: so-3-3-0.bb1.a.per.aarnet.net.au asymm 11 267.225ms
11: so-0-1-0.bb1.a.adl.aarnet.net.au asymm 12 299. 78ms
12: so-0-1-0.bb1.a.adl.aarnet.net.au 298.473ms pmtu 1500
12: www.ipv6.aarnet.edu.au 292.893ms reached
Resume: pmtu 1500 hops 12 back 12


[2] PERT Knowledgebase article: http://kb.pert.geant2.net/PERTKB/ JumboMTU

[3] RFC 3649, HighSpeed TCP for Large Congestion Windows, S. Floyd,
    December 2003

[4] RFC 4821, Packetization Layer Path MTU Discovery. M. Mathis,
    J. Heffner, March 2007