North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: Research - Valid Data Gathering vs Annoying Others

  • From: Mike Leber
  • Date: Sat Aug 07 02:11:14 2004

On Fri, 6 Aug 2004, John K Lerchey wrote:
> Hi NANOG folks,
> 
> We have a situation (which has come up in the past) that I'd like some 
> opinions on.
> 
> Periodically, we have researchers who develop projects which will do 
> things like randomly port probe off-campus addresses...

Here are some observations based on an internal corporate R&D project we
ran about 4 years ago that crawled all the websites on the Internet for
use with a search engine.

* Lower your impact.  Limit the number of requests sent to a specific IP
within a time period.  Limit how fast you make requests.  Don't assume
adjacent IPs aren't the same server, don't make parallel requests to IPs
within the same /24.  Limit the total number of requests you make to a
specific IP.  Limit the amount of data transferred from each IP.

* Make sure to implement a block list to avoid scanning people that ask
you to stop.

* Make your hostname something that helps explain what you are doing.

* Make sure that other people in your group know that you are running the
experiment and who to forward phone calls to.

* Run a webserver on the IP or IPs that are doing the scanning explaining
what you are doing.

* Honor robots.txt, and other "access denied" type responses or error
codes.

* Don't assume the data returned is valid or nonhostile.  Some people run
search engine traps (infinitely large programmatically generated websites)
to try to salt the search engines with their bogus advertising data.  
Some people want to crash any program that scans them.  Some people will
do things you didn't think of.

* Expect some people to send automated complaints without knowing that 
they are sending them and without understanding the contents of the
complaints they are sending.

* Expect some people to complain about you attacking them on port 53 when
you look up the address for their domain name, even if you never scan
their website or otherwise interact with any of their IPs.  (During the
experiment this was the largest source of complaints.)

* If you run the project 24 x 7, you need to respond 24 x 7.

Mike.

+----------------- H U R R I C A N E - E L E C T R I C -----------------+
| Mike Leber           Direct Internet Connections   Voice 510 580 4100 |
| Hurricane Electric     Web Hosting  Colocation       Fax 510 580 4151 |
| [email protected]                                       http://www.he.net |
+-----------------------------------------------------------------------+