North American Network Operators Group Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical Re: STILL Paging Google...
Ok, the bug is still there. Received replies from helpful folks who missed various parts of my posts. I'll stop posting about this now; it is indeed a bit OT. As I said in my initial post: I'm looking for a fix, not a workaround, and again: See http://www.google.com/webmasters/remove.html The above page says that User-agent: Googlebot Disallow: /*? will block all standard-looking dynamic content, i.e. URLs with "?" in them. On 11/16/05 11:44 AM, Michael Loftis sent forth electrons to convey: I think that maybe googlebot parses robots.txt in order, so it's seeing your USer-Agent: * line before it's more specific line and matching that.Could be, but their documentation, as I mentioned, specifically says otherwise. Michael Dillon wrote: [put dynamic content in cgi-bin and have robots.txt block it]AGAIN, I'm just asking Google to comply with the documentation they provide! In other words, Googlebot is broken; it doesn't do what its documentation it claims it will do. The correct operational response is for Google to fix it. Whether they change the code or the documentation is their choice. I'd say allowing * to be special is a change worth making, despite the robustness principle. (FYI, IETF does from time to time knowingly make changes that are not backwards-compatible.) Oh, and a ? in an URL has been a near-certain sign of dynamic content for a decade Oh, and I'm not a MediaWiki developer... Niels Bakker wrote: robots.txt is about explicitly spidering your site; Google will still follow links from outside towards your website and index pages linked that way.[...]No, the robot.txt is being violated. There aren't ~40,000 links to the site. Only around 130, according to http://www.google.com/search?q=link%3Awiki.fastmail.fm On 11/16/05 8:49 AM, Mike Damm sent forth electrons to convey: Could you please give me the URL to your robots.txt?It was implied, below. (Oh, and they removed it from my webmasterworld forum post; it was in there initially.) On 11/15/05, Matthew Elvey <[email protected]> wrote:(http://www.google.com/search?q=site%3Awiki.fastmail.fm) http://wiki.fastmail.fm/robots.txt On 11/16/05 7:44 AM, Bill Weiss sent forth electrons to convey: Bill: you're right, except that Google has defined and documented an extension, as I mentioned.I attempted to respond on Nanog, but I don't have posting privs there it seems. What I tried to send then: http://www.robotstxt.org/wc/norobots.html Specifically, http://www.robotstxt.org/wc/faq.html#robotstxt covers the problem you're having. To paraphrase: you don't get wildcards in the Disallow section. Fall back on using the META tags that do that sort of thing, or reorg your website to make it possible without wildcards. If you would forward this to the list for me, I would appreciate it. On 11/15/05 5:23 PM, William Yardley sent forth electrons to convey: On Tue, Nov 15, 2005 at 04:56:12PM -0800, Matthew Elvey wrote: Yup. Emailed 'em on my last post. Thanks. I'll hit some google folks directly. I just know someone in the gmail area-pretty far removed.Also, there were some folks from Google at the last NANOG meeting - look near the top of the attendee list, and there is someone whom I believe works on security stuff - googling should turn up her email address pretty quickly.
In particular http://www.searchengineworldDOTcom/robots/robots_tutorial.htm It seems I fell for an SEO scam... how ironic. I guess that's why I haven't heard from google... Anyway, here's the page content (with some editing and paraphrasing): Subject: paging google! robots.txt being ignored! Hi. My robots.txt was put in place in August! But google still has tons of results that violate the file. http://www.searchengineworld.com/cgi-bin/robotcheck.cgi doesn't complain (other than about the use of google's nonstandard extensions described at http://www.google.com/webmasters/remove.html ) The above page says that it's OK that # per [[AdminRequests]] User-agent: Googlebot Disallow: /*?* is last (after User-agent: *) and seems to suggest that the syntax is OK. I also tried User-agent: Googlebot Disallow: /*? but it hasn't helped. I asked google to review it via the automatic URL removal system (http://services.google.com/urlconsole/controller). Result: URLs cannot have wild cards in them (e.g. "*"). The following line contains a wild card: DISALLOW: /*? How insane is that? Oh, and while /*?* wasn't per their example, it was legal, per their syntax, same as /*? ! The site as around 35,000 pages, and I don't think a small robots.txt to do what I want is possible without using the wildcard extension.
|