This week in Bot Oddities
ShopWikiBot still forgets file extensions
The first time ShopWikiBot hit a site of mine I was surprised to see its requests to URLs on our IIS servers for files with no extension. The requests would otherwise be valid 200 responses to real documents, but ShopWikiBot was repeatedly leaving the “.asp” out of the URL.
I asked the ShopWiki people what the heck they were up to, and they provided me with a prompt, personalized and uninteresting response.
Hi Corey,
Thank you for letting us know about our crawler’s behavior. Our crawler tries to find the most efficient path to crawl your site, and occasionally tries invalid paths. It does quickly detect the error and corrects itself, so you should see invalid urls like this only very rarely. We apologize for any inconvenience this has caused and please let us know if this problem persists for an unreasonable length of time.
Regards,
Lauren
The requests have not stopped, and I remain intrigued. I am always guilty of thinking too hard about problems, so I can not resist. What crawling strategy might this be? Is the ShopWikiBot simply drunk? Could a directory matching every file name on a website indicate something more about its structure?
404s at fdfdkll.html
Do you know where to find fdfdkll.html? Googlebot thought he knew where to look, but he was wrong! Perhaps a Googlebot imposter is testing his 404 crawling technique.
What is fdfdkll.html?
If you are reading this, perhaps you also have pondered this question and searched for any mention of fdfdkll.html on the web. I am writing only because I have exhausted that search. Your guess is as good as mine.
Here’s the user agent and IP of the alleged Googlebot that requested fdfdkll.html from two of my sites this week:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.71.206


Yes, I’ve seen it also on my website.
Here’s a quote from Webmasterworld forum (you need to register to view incrediBILL’s post) :
“Google does often test sites with page names designed to deliver a 404 response to see how your server handle the 404 response.
Some sites return a 200 OK with some other page, typically the home page, to stop visitor bounce and Google identifies this so it doesn’t index the same page multiple times and possible make the wrong page supplemental. ”
Hope this helps !
Manu, thanks for your comments. I am familiar with Google’s 404 tests, but they usually look like this:
http://www.domain.com/noexist_c5a1437d8dcd9bcf.html
The user agent is typically “Google-Sitemaps/1.0″ which is why I thought the fdfdkll.html requests were strange.
I have seen it too: fdfdkll.html. This is weird. UA was Googlebot, not the sitemaps bot, IP from Google Inc.