North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

2006.06.07 NANOG-NOTES TCP Anycast--don't spread the FUD!

  • From: Matthew Petach
  • Date: Fri Jun 09 21:12:52 2006
  • Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:sender:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition:x-google-sender-auth; b=mg0mCjGdrAvWyI03gPRHuq6BmRh2fqw9TAmuH9E4iZ/NUE1oxlOW/G26nvIClDJycBLPRfAoZg8ErPUwWA9zDaJmZtMdVZh2qurtMqhS9xXsmNMakzcJk19tPps96tIjCZxyYuquPjCyFueHRA3uSpNerCOybEQ/YBIPqVYDNUw=

(this was one of the coolest talks from the three days, actually,
and has gotten me *really* jazzed about some cool stuff we can
do internally.  Huge props to Matt, Barrett, and Todd for putting
this together!!  --Matt)


2006.06.07 TCP anycast, Matt Levine, Barrett Lyon
with thanks to Todd Underwood
TCP anycast, don't believe the FUD
Todd Underwood is in Chicago
Barrett Lyon starts off.
[slides may eventually be at:
http://www.nanog.org/mtg-0606/pdf/tcp-anycast.pdf

IPv4 anycast
from a network perspective, nothing special
just another route with multiple next-hops
services exist on each next-hop, and respond
from the anycast IP address.

It's the packets, stupid
perceived problem: TCP and anycast don't play together
for long-lived flows.
eg, high-def porn downloads
[do porn streams need to last more than 2 minutes?]
some claim it exists, and works...
yes, been in production for years now.

Anycast at CacheFly
deployed in 2002
prefix announced on 3 continents
3 POPs in US
5 common carriers (transit) + peering
 be sensible to who you peer with
Effective BGP communities from upstreams is key
 keep traffic where you want it.

Proxy Anycast
proxy traffic is easy to anycast!
move HTTP traffic through proxy servers.
customers are isolated on a VIP/virtual address, which
happens to exist in every datacenter.
Virtual address lives over common carriers allowing
even distribution of traffic
state is accomplished with custom hardware to keep
state information synchronized across proxies.

Node geography
anycast nodes that do not keep state must be
 geographically separated
Coasts and countries work really well for keeping
route instability largely isolated.
Nodes that are near by could possibly require state
between them if local routes are unstable.

IP utilization
"Anycast is wasteful"
people use /24's as their service blocks; use 1 /32 out
of a whole /24.
Really?  How much IP space do you need to advertise
from 4 sites via unicast?

Carriers and Peering
for content players, having even peering and carriers
 is key.
 you may cause EU eyeballs to go to CA if you're not
  careful with where you peer with people.
having an EU centric transit provider in the US without
  having the same routes in EU could cause EU traffic
   to home in the US
 Use quality global providers to keep traffic balanced.

When peering...
keep in mind a peer may isolate traffic to a specific
  anycast node
Try to peer with networks where it makes sense; don't
 advertise your anycast to them where they don't have
 eyeballs!
Try to make sure your peers and transit providers know
 your communities and what you're trying to do, and
 make sure you understand their communities well!

Benefits of Anycast.
for content players
moving traffic without major impact or DNS lag
provides buffers for major failures
allows for simplistic traffic management, with a major
 (potential) performance upside.
it's BGP you don't control, though, so not much you
 can do to adjust inbound wins.
HTTP has significant cost to using DNS to try to shift
traffic around; six or more DNS lookups to acquire
content; anycast trims those DNS lookups down
significantly!
Ability to interface tools to traffic management.
No TTL issues!

Data, May 9, 2006
Renesys: monitored changes in atomic-aggregator for
a CacheFly anycast prefix
AS path changes and pop changes
Keynote: monitored availability/performance of 30k file
Revision3: monitored behavour of "longlived" downloads
of DiggNation videocast--over 7TB transferred.

Renesys data:
130BGP updates for may 9th; low volume day
stable prefixes
34 distinct POP changes based on atomic aggregator
property on prefixes
130 updates is considered a stable prefix.

SJC issue:
thirty-five minute window, 0700 to 0735 UTC, saw:
98 updates, 20 actual pop changes based on
 atomic aggregator changes, all from one san jose
 provider, fail from SJC to CHI back to SJC
unable to correlate these shifts with any traffic
changes; mostly likely we don't have a big enough
sample size.
possibly just not a lot of people using those routes.

BGP seems stable--what about TCP flows?

AVG time between SJC and CHI and back again was about
20 seconds; very quick on the trigger to go back to
SJC; would break all TCP sessions happening at the
time.
For the most part, TCP seems stable.

Keynote: 30k download from 31 locations every 5 minutes,
or average of 1 poll per 9.6 seconds
compared against 'Keynote Business 40'
data collected on May 9, 2006
represents short-lived TCP flows, though.

Orange line is Keynote business 40
pegged 100% availability
load time was lower than the business 40.
(0.2s vs 0.7s for business 40)

Revsion 3 data
monitored IPTV downloads for 24 hours (thanks, jay!)
span port; analyzed packet captures
look for new TCP sessions not beginning with SYN
compare that against global active connection table.
looked for sessions that appeared out of nowhere.

Long-lived data
683,204 TCP sessions.
anything less than 10 minutes thrown out
23,795 sessions lasted longer than 10 minutes.
average file size for the day was 300MB
4 TCP sessions moved between POPs.
0.0006% total pop switch 'failure' rate
0.017% for long-lived (more than 10 minutes) sessions.
only looking for sessions that starts in A, moves to B;
they dropped 0 of them, due to their state preservation
mechanism.
you would have dropped 4 out of 23,795 connections
with this without state preservation mechanism.

Anycast gotchas.
large scale changes in provider policies can
impact your traffic, up to you to figure out
what happened.
"Things that are bad" become much, much worse,
notably per-packet load balancing across provider
or topologicl boundries.
eg customer with 2 T1s between Anaheim and Dallas
Sprint POPs.  per-packet load balancing across
the two, before the anycast nodes shared state
information between nodes was probably still
seeing performance issues.

conclusion:
stateful anycast is not inherently unstable, and
failure/disconnect rates are inline with offering
unicast services
this is counter-intuitive to some published
conclusions from previously published data.

some other company did work in this; is there other
failure rate data available? Verisign said to not
do TCP anycast, but they didn't publish failure
rates for TCP.

Need to see where TCP really, really breaks.

"Trust us, it works"

widespread failures cause havok; however,
internet doesn't go crazy *that* often.

Transitioning to IPv6:
as for the move to IPv6, there *is* a plan for it;
plan is to hope they're dead by the time customers
actually demand v6.

what you can do
stop telling people TCP doesn't work if you haven't
tried it yourself!  Just makes them mad to hear it.
if your application doesn't handle TCP/IP failures
gracefully, don't run anycast, in fact don't run it
on the internet at all.
SMTP and HTTP work well; browsers support reset and
retries, for example
Experiment with other applications
share your experiences--want to know if their results
are anomalous, if they're crazy, or if this really
does just work.

Q: Mark Kosters, Verisign: did one presentation.
It depends on client base; their client base was very
far flung, many problems in far reaches of internet
and per-packet load balancing showed up in their
data.
How big a clientele spread do you want to reach,
do you hit core, or far edge?
A: they try to reach customers where they get money
for the material being served; not necessarily geared
for global connectivity.  Yes, unicast in outlying
cases will be more robust, but doesn't scale as well.

Q: Randy Bush, IIJ, 10th anniversary of much of the
Anycast UDP deployment.  Response of "it's good if
you engineer it" applies to most things.  They
wanted to narrow peering and topology; Randy says
there are exactly cases where that is exactly NOT
true.  In 1996, they needed to shed traffic off
their backbone; had to support streaming, they did
TCP from all anycast to all peers, exact opposite
of what they were calling for.
A: This talk is specific to content they deal with
directly, was engineered for their customer's content
and needs.  Main point is that it's not as bad as
people claim, and benefits can be substantial.
And yes, you will need to engineer it to your
particular case.

Q: Bill Woodcock: over past 10-12 years, many have seen
long-lived TCP connections with 0.01% failure rates;
Mark's results with j-root were suprisingly different.
Methodologies were sound, so why might we be seeing
these bimodal results?
A: different results trying to be achieved.  Streaming
video vs universal reachability for DNS.  Perhaps worth
looking at the specific needs being aimed for?
Dalnet runs with very long lived TCP connections, some
last for months, they're not picky about who they get
transit from, and they're anycasting their IRC servers.

Q: Danny McPherson, 10 years ago people were using this
to try to pass stuff faster to try to make Keynote
numbers look better.
Origin AS, announcing prefixes from multiple places,
may trigger security, prefix hijacking alerts; the
state sharing they do may have helped considerably.
A: Not as scary as people thinks it is.
some open source state synchronization tools would
be nice.  If someone can do some work in that area,
they'd love to support it.

Q: Michael ?, UCB: thanks, good to see studies like this!
Not all applications/protocols can handle this; hard to
necessarily then generalize this to "TCP" in general
like that.
A: they meant application more than protocol.
Q: TCP has methods of dealing with out-of-order packets.
not many people use per-packet load balancing; you don't
know what might change in future, though.

Q: Matt Peterson, can you give description of size of
files being moved around?  Also, state mechanisms--you
mentioned open source--is it something you might
consider releasing?
A: content, don't remember exactly, look at Revision3,
see their show.
Average file size was 350Mb, range from 200-650MB.
For their state, they hacked stuff together themselves;
they'd be happier to support something community
based, rather than release their code.

any more Q ask them offline, we HAVE to keep moving
to get out of the convention center ontime!!