Archive for August, 2008

This week in Bot Oddities

ShopWikiBot still forgets file extensions

The first time ShopWikiBot hit a site of mine I was surprised to see its requests to URLs on our IIS servers for files with no extension. The requests would otherwise be valid 200 responses to real documents, but ShopWikiBot was repeatedly leaving the “.asp” out of the URL.

I asked the ShopWiki people what the heck they were up to, and they provided me with a prompt, personalized and uninteresting response.

Hi Corey,

Thank you for letting us know about our crawler’s behavior. Our crawler tries to find the most efficient path to crawl your site, and occasionally tries invalid paths. It does quickly detect the error and corrects itself, so you should see invalid urls like this only very rarely. We apologize for any inconvenience this has caused and please let us know if this problem persists for an unreasonable length of time.

Regards,

Lauren

The requests have not stopped, and I remain intrigued. I am always guilty of thinking too hard about problems, so I can not resist. What crawling strategy might this be? Is the ShopWikiBot simply drunk? Could a directory matching every file name on a website indicate something more about its structure?


404s at fdfdkll.html

Do you know where to find fdfdkll.html? Googlebot thought he knew where to look, but he was wrong! Perhaps a Googlebot imposter is testing his 404 crawling technique.

What is fdfdkll.html?

If you are reading this, perhaps you also have pondered this question and searched for any mention of fdfdkll.html on the web. I am writing only because I have exhausted that search. Your guess is as good as mine.

Here’s the user agent and IP of the alleged Googlebot that requested fdfdkll.html from two of my sites this week:


Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.71.206

Blacklisting via Ionic’s Isapi Rewrite Filter

In IIS, banning IP addresses from accessing a website is fairly easy. I rarely do this, however, because I prefer to use a combination of an IP address and a user agent string to identify bad bots that are likely scraping my content or attempting to harvest email addresses.

I try to avoid blocking an IP address at all costs. IP addresses can be forged and changed, so I prefer to rely on an IP address and user agent string combination to identify the culprit that I want to exile. This approach is not fool proof, but I find it be much more reliable.

Scalability is also an issue. The use of an ISAPI filter to process requests for every website on the server or a single file sure makes life easy. The Microsoft IIS configuration console is a mouse-click nightmare on a server with a couple hundred websites.

I use Ionic’s Isapi Rewrite Filter to change the URL structure of websites to be more search engine friendly. This filter uses the PCRE library, and the use of regular expressions is always a huge plus. The rewriting rules are maintained inside one .ini file, so tweaks and updates are a breeze.

Here is an Ionic’s rewrite rule that will let you block access to every site on your server based upon an IP address and user agent string match. In this particular case, I am blocking an email address harvester with IP 24.132.226.94 and user agent Java/1.6.0-oem.


RewriteCond %{REMOTE_ADDR} 24.132.226.94
RewriteCond %{HTTP_USER_AGENT} Java/1.6.0-oem
RewriteRule ^/(.*)$ /$1 [F]

The two conditions on this match use server variables to match the user’s IP address and user agent string to an expression match. The final line is the rewrite rule that matches any file on any website. The [F] flag tells the Ionic’s filter to return an appropriate HTTP status code of 403 Forbidden.

Regular expressions provide the capability to block a range of IP addresses and partial user agent matches. If i wanted to match on any version of this Java-based robot, I could expand the second condition to something like this:


RewriteCond %{HTTP_USER_AGENT} Java/\d.\S*

Similarily, wildcard matches on IP addresses can be used to block ranges of IPs instead of a single address.

The Microsoft vs *NIX server debate will never die. I use both everyday, and I find that the biggest advantage that the open source server environment has over Microsoft is the interface. Using the Ionic’s ISAPI filter allows me to control the URL structure and blacklist for all of my websites easily and efficiently.

I see this method of blocking IPs or blacklisting bots based on IP address and user agent as a great way to simulate an .htaccess approach to the same problem on a Microsoft server.

How to report an online scam

If you think you are dealing with or have sent money to an online scammer, here is how to report the incident to state and federal authorities.

Where to report online scams to US authorities

  1. File a complaint at the Internet Crime Complaint Center, ic3.gov
  2. File a complaint using the FTC’s Complaint Assistant at ftccomplaintassistant.gov
  3. Find the website of your state Attorney General and file a report there if possible. A phone call may be required.

Alert website owners via email

  • abuse@craigslist.org
  • spoof@ebay.com
  • fraudwatch@autotrader.com

CatchBot lies

From catchbot.com:

Will my servers or site be affected by CatchBot?

It is unlikely that your servers or site will be affected by CatchBot. As Catchbot typically only crawls a few pages of any website slowly, CatchBot has no material negative impact on the load on your servers or the availability of your website.

…and by slowly they mean only 2 requests per second, even if the target URL is a 404 every time.