North American Network Operators Group
Date Prev | Date Next |
Date Index |
Thread Index |
Author Index |
Historical
Re: STILL Paging Google...
- From: MH
- Date: Tue Nov 15 21:40:13 2005
Hi there,
Looking at your robots.txt... are you sure that is correct?
On the sites I host.. robots.txt always has:
User-Agent: *
Disallow: /
In /htdocs or wherever the httpd root lives. Thus far it keeps the
spiders away.
GoogleSpider also will obey: NOARCHIVE, NOFOLLOW, NOINDEX placed within
the meta tag inside of the html header.
-M.
With the above for robots.txt I've had no problems th
Still no word from google, or indication that there's anything wrong with the
robots.txt. Google's estimated hit count is going slightly up, instead of
way down.
Why am I bugging NANOG with this? Well, I'm sure if Googlebot keeps ignoring
my robots.txt file, thereby hammering the server and facilitating s pam,
they're doing the same with a google other sites. (Well, ok, not a google,
but you get my point.)
The above page says that
User-agent: Googlebot
Disallow: /*?
will block all standard-looking dynamic content, i.e. URLs with "?" in them.
On Mon, 14 Nov 2005, Matthew Elvey wrote:
Doh! I had no idea my thread would require login/be hidden from general
view! (A robots.txt info site had directed me there...) It seems I fell
for an SEO scam... how ironic. I guess that's why I haven't heard from
google...
Anyway, here's the page content (with some editing and paraphrasing):
Subject: paging google! robots.txt being ignored!
Hi. My robots.txt was put in place in August!
But google still has tons of results that violate the file.
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
doesn't complain (other than about the use of google's nonstandard
extensions described at
http://www.google.com/webmasters/remove.html )
The above page says that it's OK that
#per [[AdminRequests]]
User-agent: Googlebot
Disallow: /*?*
is last (after User-agent: *)
and seems to suggest that the syntax is OK.
I also tried
User-agent: Googlebot
Disallow: /*?
but it hasn't helped.
I asked google to review it via the automatic URL removal system
(http://services.google.com/urlconsole/controller).
Result:
URLs cannot have wild cards in them (e.g. "*"). The following line
contains a wild card:
DISALLOW: /*?
How insane is that?
Oh, and while /*?* wasn't per their example, it was legal, per their
syntax, same as /*? !
The site as around 35,000 pages, and I don't think a small robots.txt to
do what I want is possible without using the wildcard extension.
|