Ah, large MTUs. Like many other "academic" backbones, we implemented
large (9192 bytes) MTUs on our backbone and 9000 bytes on some hosts.
See [1] for an illustration. Here are *my* current thoughts on
increasing the Internet MTU beyond its current value, 1500. (On the
topic, see also [2] - a wiki page which is actually served on a
9000-byte MTU server :-)
Benefits of >1500-byte MTUs:
Several benefits of moving to larger MTUs, say in the 9000-byte range,
were cited. I don't find them too convincing anymore.
1. Fewer packets reduce work for routers and hosts.
Routers:
Most backbones seem to size their routers to sustain (near-)
line-rate traffic even with small (64-byte) packets. That's a good
thing, because if networks were dimensioned to just work at average
packet sizes, they would be pretty easy to DoS by sending floods of
small packets. So I don't see how raising the MTU helps much
unless you also raise the minimum packet size - which might be
interesting, but I haven't heard anybody suggest that.
This should be true for routers and middleboxes in general,
although there are certainly many places (especially firewalls)
where pps limitations ARE an issue. But again, raising the MTU
doesn't help if you're worried about the worst case. And I would
like to see examples where it would help significantly even in the
normal case. In our network it certainly doesn't - we have Mpps to
spare.
Hosts:
For hosts, filling high-speed links at 1500-byte MTU has often been
difficult at certain times (with Fast Ethernet in the nineties,
GigE 4-5 years ago, 10GE today), due to the high rate of
interrupts/context switches and internal bus crossings.
Fortunately tricks like polling-instead-of-interrupts (Saku Ytti
mentioned this), Interrupt Coalescence and Large-Send Offload have
become commonplace these days. These give most of the end-system
performance benefits of large packets without requiring any support
from the network.
2. Fewer bytes (saved header overhead) free up bandwidth.
TCP segments over Ethernet with 1500 byte MTU is "only" 94.2%
efficient, while with 9000 byte MTU it would be 99.?% efficient.
While an improvement would certainly be nice, 94% already seems
"good enough" to me. (I'm ignoring the byte savings due to fewer
ACKs. On the other hand not all packets will be able to grow
sixfold - some transfers are small.)
3. TCP runs faster.
This boils down to two aspects (besides the effects of (1) and
(2)):
a) TCP reaches its "cruising speed" faster.
Especially with LFNs (Long Fat Networks, i.e. paths with a large
bandwidth*RTT product), it can take quite a long time until TCP
slow-start has increased the window so that the maximum
achievable rate is reached. Since the window increase happens
in units of MSS (~MTU), TCPs with larger packets reach this
point proportionally faster.
This is significant, but there are alternative proposals to
solve this issue of slow ramp-up, for example HighSpeed TCP [3].
b) You get a larger share of a congested link.
I think this is true when a TCP-with-large-packets shares a
congested link with TCPs-with-small-packets, and the packet loss
probability isn't proportional to the size of the packet. In
fact the large-packet connection can get a MUCH larger share
(sixfold for 9K vs. 1500) if the loss probability is the same
for everybody (which it often will be, approximately). Some
people consider this a fairness issue, other think it's a good
incentive for people to upgrade their MTUs.
About the issues:
* Current Path MTU Discovery doesn't work reliably.
Path MTU Discovery as specified in RFC 1191/1981 relies on ICMP
messages to discover when a smaller MTU has to be used. When these
ICMP messages fail to arrive (or be sent), the sender will happily
continue to send too-large packets into the blackhole. This problem
is very real. As an experiment, try configuring an MTU < 1500 on a
backbone link which has Ethernet-connected customers behind it.
I bet that you'll receive LOUD complaints before long.
Some other people mention that Path MTU Discovery has been refined
with "blackhole detection" methods in some systems. This is widely
implemented, but not configured (although it probably could be with
a "Service Pack").
Note that a new Path MTU Discovery proposal was just published as
RFC 4821 [4]. This is also supposed to solve the problem of relying
on ICMP messages.
Please, let's wait for these more robust PMTUD mechanisms to be
universally deployed before trying to increase the Internet MTU.
* IP assumes a consistent MTU within a logical subnet.
This seems to be a pretty fundamental assumption, and Iljitsch's
original mail suggests that we "fix" this. Umm, ok, I hope we don't
miss anything important that makes use of this assumption.
Seriously, I think it's illusionary to try to change this for
general networks, in particular large LANs. It might work for
exchange points or other controlled cases where the set of protocols
is fairly well defined, but then exchange points have other options
such as separate "jumbo" VLANs.
For campus/datacenter networks, I agree that the consistent-MTU
requirement is a big problem for deploying larger MTUs. This is
true within my organization - most servers that could use larger
MTUs (NNTP servers for example) live on the same subnet with servers
that will never bother to be upgraded. The obvious solution is to
build smaller subnets - for our test servers I usually configure a
separate point-to-point subnet for each of its Ethernet interfaces
(I don't trust this bridging-magic anyway :-).
* Most edges will not upgrade anyway.
On the slow edges of the network (residual modem users, exotic
places, cellular data users etc.), people will NOT upgrade their MTU
to 9000 byte, because a single such packet would totally kill the
VoIP experience. For medium-fast networks, large MTUs don't cause
problems, but they don't help either. So only a few super-fast
edges have an incentive to do this at all.
For the core networks that support large MTUs (like we do), this is
frustrating because all our routers now probably carve their
internal buffers for 9000-byte packets that never arrive.
Maybe we're wasting lots of expensive linecard memory this way?
* Chicken/egg
As long as only a small minority of hosts supports >1500-byte MTUs,
there is no incentive for anyone important to start supporting them.
A public server supporting 9000-byte MTUs will be frustrated when it
tries to use them. The overhead (from attempted large packets that
don't make it) and potential trouble will just not be worth it.
This is a little similar to IPv6.
So I don't see large MTUs coming to the Internet at large soon. They
probably make sense in special cases, maybe for "land-speed records"
and dumb high-speed video equipment, or for server-to-server stuff
such as USENET news.
(And if anybody out there manages to access [2] or http://
ndt.switch.ch/
with 9000-byte MTUs, I'd like to hear about it :-)
--
Simon.
[1] Here are a few tracepaths (more or less traceroute with integrated
PMTU discovery) from a host on our network in Switzerland.
9000-byte packets make it across our national backbone (SWITCH),
the European academic backbone (GEANT2), Abilene and CENIC in the
US, as well as through AARnet in Australia (even over IPv6). But
the link from the last wide-area backbone to the receiving site
inevitably has a 1500-byte MTU ("pmtu 1500").
: [email protected][leinen]; tracepath www.caida.org
1: mamp1-eth2.switch.ch (130.59.35.78) 0.110ms
pmtu 9000
1: swiMA1-G2-6.switch.ch (130.59.35.77) 1.029ms
2: swiMA2-G2-5.switch.ch (130.59.36.194) 1.141ms
3: swiEL2-10GE-1-4.switch.ch (130.59.37.77) 4.127ms
4: swiCE3-10GE-1-3.switch.ch (130.59.37.65) 4.726ms
5: swiCE2-10GE-1-4.switch.ch (130.59.36.209) 4.901ms
6: switch.rt1.gen.ch.geant2.net (62.40.124.21) asymm
7 4.429ms
7: so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) asymm 8
12.551ms
8: abilene-wash-gw.rt1.fra.de.geant2.net (62.40.125.18) asymm 9
105.099ms
9: 64.57.28.12 (64.57.28.12) asymm 10
121.619ms
10: kscyng-iplsng.abilene.ucaid.edu (198.32.8.81) asymm 11
153.796ms
11: dnvrng-kscyng.abilene.ucaid.edu (198.32.8.13) asymm 12
158.520ms
12: snvang-dnvrng.abilene.ucaid.edu (198.32.8.1) asymm 13
180.784ms
13: losang-snvang.abilene.ucaid.edu (198.32.8.94) asymm 14
177.487ms
14: hpr-lax-gsr1--abilene-LA-10ge.cenic.net (137.164.25.2) asymm
20 179.106ms
15: riv-hpr--lax-hpr-10ge.cenic.net (137.164.25.5) asymm 21
185.183ms
16: hpr-sdsc-sdsc2--riv-hpr-ge.cenic.net (137.164.27.54) asymm 18
186.368ms
17: hpr-sdsc-sdsc2--riv-hpr-ge.cenic.net (137.164.27.54) asymm 18
185.861ms pmtu 1500
18: cider.caida.org (192.172.226.123) asymm 19
186.264ms reached
Resume: pmtu 1500 hops 18 back 19
: [email protected][leinen]; tracepath www.aarnet.edu.au
1: mamp1-eth2.switch.ch (130.59.35.78) 0.095ms
pmtu 9000
1: swiMA1-G2-6.switch.ch (130.59.35.77) 1.024ms
2: swiMA2-G2-5.switch.ch (130.59.36.194) 1.115ms
3: swiEL2-10GE-1-4.switch.ch (130.59.37.77) 3.989ms
4: swiCE3-10GE-1-3.switch.ch (130.59.37.65) 4.731ms
5: swiCE2-10GE-1-4.switch.ch (130.59.36.209) 4.771ms
6: switch.rt1.gen.ch.geant2.net (62.40.124.21) asymm
7 4.424ms
7: so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) asymm 8
12.536ms
8: ge-3-3-0.bb1.a.fra.aarnet.net.au (202.158.204.249) asymm 9
13.207ms
9: so-0-1-0.bb1.a.sin.aarnet.net.au (202.158.194.145) asymm 10
217.846ms
10: so-3-3-0.bb1.a.per.aarnet.net.au (202.158.194.129) asymm 11
275.651ms
11: so-0-1-0.bb1.a.adl.aarnet.net.au (202.158.194.6) asymm 12
293.854ms
12: so-0-1-0.bb1.a.adl.aarnet.net.au (202.158.194.6) 297.989ms
pmtu 1500
13: tiny-teddy.aarnet.edu.au (203.21.37.30) asymm 12
297.462ms reached
Resume: pmtu 1500 hops 13 back 12
: [email protected][leinen]; tracepath6 www.aarnet.edu.au
1?: [LOCALHOST] pmtu 9000
1: swiMA1-G2-6.switch.ch 1.328ms
2: swiMA2-G2-5.switch.ch 1.703ms
3: swiEL2-10GE-1-4.switch.ch 4.529ms
4: swiCE3-10GE-1-3.switch.ch 5.278ms
5: swiCE2-10GE-1-4.switch.ch 5.493ms
6: switch.rt1.gen.ch.geant2.net asymm 7 5. 99ms
7: so-7-2-0.rt1.fra.de.geant2.net asymm 8 13.239ms
8: ge-3-3-0.bb1.a.fra.aarnet.net.au asymm 9 13.970ms
9: so-0-1-0.bb1.a.sin.aarnet.net.au asymm 10 218.718ms
10: so-3-3-0.bb1.a.per.aarnet.net.au asymm 11 267.225ms
11: so-0-1-0.bb1.a.adl.aarnet.net.au asymm 12 299. 78ms
12: so-0-1-0.bb1.a.adl.aarnet.net.au 298.473ms pmtu 1500
12: www.ipv6.aarnet.edu.au 292.893ms reached
Resume: pmtu 1500 hops 12 back 12
[2] PERT Knowledgebase article: http://kb.pert.geant2.net/PERTKB/
JumboMTU
[3] RFC 3649, HighSpeed TCP for Large Congestion Windows, S. Floyd,
December 2003
[4] RFC 4821, Packetization Layer Path MTU Discovery. M. Mathis,
J. Heffner, March 2007