North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

2006.06.06 NANOG-NOTES network-level spam behaviour

  • From: Matthew Petach
  • Date: Wed Jun 07 07:19:12 2006
  • Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:sender:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition:x-google-sender-auth; b=HlpXRSjp1b7XAtLoaLtflBodsiBi0etsC23NSKoROqhzTQoGiM66LMRoL/LZMC2IWH9tNMfQ2Ex4E+U2uaualS8ie/9oA+o9B/6Sh8oUAZhn9N24yX0r5dH2zxShkfKw3b0UrRAKgJGhoc0RcelGcK6l7lHro1iDzI6Pk1lzE+A=

2006.06.06 Nick Feamster, Network-level spam behaviour
[slides are at:
http://www.nanog.org/mtg-0606/pdf/nick-feamster.pdf

Spam
unsolicited commercial email
feb 2005, 90% of all email is spam
common filtering techniques are
content based
DNS balcklist queries are significant fraction
 of DNS traffic today.  (DNSbls)

Using IP address based spam black lists isn't so
useful.
How spammers evade blacklists will be discussed
as well.

Problems with content-based filters
...uh oh, some technical glitches...

Content-based properties are malleable
low cost to evasion
altering content based on scripts is too easy
customized emails are easy to generate
content based filters need fuzzy hashes over
 content, etc.
high cost to filter maintainers
as content changes, filters need to be updated.
constantly tweaking spamassasain rules is a pain.

false positives are always an issue.

Content-based filters are applied at the destination
too little, too late -- wasted network bandwidth,
 storage, etc. ;  many users recieve and store the
 same spam content.

Network level spam filtering is robust (hypothesis)
network-level propeerties are more fixed
hosting or upstream ISP (as number)
botnet membership
location in the network
IP address block
country?

are there common ISPs that host the spammers, for
example?
Avoid receiving mail from machines that are part
of botnets.

Challenge--which properties are most useful for
distinguishing spam traffic from legitimate email?

very little if anything is known about these
characteristics yet!

Randy gave a lightning talk last NANOG about some
of this.

Some properties listed.

Spamming techniques
mostly botnets, of course
other techniques too
we're trying to quantify this
coordination
characteristics
how we're doing this
correlations with Bobax victims
 from georgia tech botnet sinkhole
other possilities: heuristics
distance of client IP from the MX record
coordinated, low-bandwidth sending

looked at pcaps coming in from hijacked command
and control station from bots trying to talk to
it; spamming bots, Bobax drone botnet, exclusively
used to send spam.

Collection
two domains instrumented with MailAvenger (both on
the same network)
sinkhole domain 1
 continuous spam collection since aug 2004
 no real email addresses--sink everything
 10 million + pieces of spam
sinkhole domain #2
 recently registered Nov 2005
 "clean control" domain posted at a few places
 not much spam yet--perhaps being too conservative
 contact page with random email contact, look at
  who crawls, and then who spams the unique email
  addresses

Monitoring BGP route advertisments from same network

Also capturing traceroutes, DNSBL results, passive
TCP host fingerprinting, simultaneous with spam arrival
(results in this talk focus on BGP+ spam only)

Mail Avenger, not an MTA, it forks to sendmail or
postfix, it sits in front of MTA, does things
like do DNSBL lookups, add headers, passive OS
fingerprinting, as the spam is arriving.
Also logged BGP routes from same network that got
the spam; see connectivity to the spamming machine
at the time.

Picture of collection up at MIT network.

Mail Collection: MailAvenger
X-Avenger header.
best guess at operating system, POF, DNSBL
lookups, traceroutes back to mail relay at the
time the mail was sent (used for debugging BGP)

distribution across IP space
plot /24 prefix vs how much spam coming from it.
steeper lines mean more spam from that part
of the IP space; you can see where spam is
coming from.  bunch comes from apnic, cable
modem space, etc.
few interesting things to note; still redoing
legitimate mail characteristics.
from georgia tech mail machines, it's legit plus
spam, need to split out better.
between 90.* and 180.*, legitimate mail mainly.

Is IP-based blacklisting enough?
Probably not: more than half of spamming client IPs
appear less than twice.

Roughly 50% of the IPs showed up less than twice;
but that's a single sinkhole domain, would help
more across multiple domains.

emphasizes need to collaborate across multiple
domains to build blacklists; any one domain
won't see repeated patterns of IPs.

Distribution across ASes
40% of spam coming from the US

BGP spectrum agility
Log IP addresses of SMTP relays
Join with BGP route advertisements seen at network
where spam trap is co-located.

A small club of persistent players appears to be using
this technique
61.0.0.0/8 AS4678
66.0.0.0/8 AS21562
82.0.0.0/8 AS8717
somewhere between 1-10% of all spam (some clearly
intentional, others might be flapping)

about 10 minute announcement time of the /8 while
spam is flooded out.
Might be interesting to couple this with route
hijacking alerting to filter out if this is
really a hijacking vs a flapping legitimate route.

A slightly different pattern;
announce-spam-withdraw on a minute-by-minute basis.
really really egregious!

Why such big prefixes?
flexibility: client IPs can be scattered throughout
dark space within a large /8
 same sender usually returns with different IP
  addresses
visibility: route typically won't be filtered (nice
and short prefix length)

Characteristics of IP-agile senders
IP addresses are widely distributed across the /8 spce
IP addresses typically appear only once at the sinkhole
Depending on which /8, 60-80% of these IP addresses
were not reachable by traceroute when we spot-checked
some IP addresses were in allocated, albeit unannounced
space
Some AS paths associated with the routes contained
reserved AS numbers

Odd AS numbers injected, usually well-known to make
it look more legitimate.

Length of short-lived BGP epochs
10% of spam coming from short-lived BGP events

Spam from Botnets
Example: Bobax
approximate size: 100k bots

one sinkhole domain--this is ONLY stuff that is
verifiable as coming from bots via command and
control hijacked IPs, intersect the single sinkhole
domain, so much smaller data subset, but well
correlated and verified.

Proportionally less spam from bots in 61-90
range; that tends to be where BGP route hijacks
happen instead.

Most Bot IP addresses do not return
65% of bots only send mail to a domain once over
18 months.
Some hang around for a *long* time.
About 20% stick around for several months.

collaborative spam filtering seems to be helping
track bot IP addresses.

Most bots send low volumes of spam
most bot IP addresses send very little spam regardless
of how long they have been spamming

Effectiveness of blacklisting:
only about half of the IPs spamming from short-lived
BGP are listed in any blacklist
spam from IP-agile senders tend to be listed in fewer
blacklists

Looking at 8 different spam blacklists, checking when
the spam arrives at the sinkhole.

Known Bobax drones listed in more DNSbls than the
BGP agile senders.
About 90-95% of the Bobax bot drones are listed
in one or more DNSBLs.

Suggests some of the spamming bots are listed more
than other techniques--that is, bots are easier to
identify than BGP-agile spammers or spammers using
other techniques.

Harvesting
tracking web-based harvesting
register domain, set up MX record
post, link to page with randomly generated email addresses

Example Phish:
a flood of email for a phishing attack for paypal.com
all to: addresses harvested in a single crawl on
 January 16th 2006
emails received from IPs different from those who
 crawl.
X-mailer headers totally diffrent.

Lessons for better spam filters:
effective spam filtering requires a btter notion of
end-host identity
distribution of spamming IP addresses is highly
skewed
detection based on network-wide, aggregate behavioru
may be more fruitful than focusing on individual IPs
large, emergent properties.

two critical pieces of the puzzle
botnet detection
securing the internet's routing infrastructure

compare distributions of spam to legitimate mail,
see if certain spaces are more likely to send spam
than legitimate mail.

Questions:
Q: Steve Bellovin, columbia university
bots from strange ASes, is tunnelling taking
place from bots to BGP speakers?
A: Not sure if there's evidence or not; some data
from  TORS??
but TORS latency may be too high.

Q: Fingerprinting to try to identify who is doing
things, see how many hosts are actually doing
this?
Many addresses being used, how many hosts
does it actually represent?
A: Not sure, haven't checked that.
Haven't checked on aliasing, since not much
was seen from a single IP.
NAT'ing?
What about hosts hopping? (same host using multiple
IPs?)
Not sure, they didn't do that correlation.

Q: Randy Bush, IIJ, they did do OS fingerprinting,
so some of that are in the paper.
didn't do anything with the traceroutes, though.

Q: Matt asks what the difference between the two
domains was; was one of them a recognizable word
or name, or were they both random character strings?
A: they were both random character strings, but one
of them had been used to host a real website for a
while, which might explain why it gets such a huge
volume of spam compared to the other.
Q: Matt points out that for some networks, receiving
spam is actually a good thing, as it helps balance
out traffic ratios, which helps during peering
negotiations.

Q: Randy Bush, IIJ, responding to Matt about traffic
ratios: only those backbones who are on ADSL should
they care which way traffic goes.  :P

Curious to work with large networks, see if filters
could be installed to detect it, and possibly take
action.