User-agent: PRCrawler/Nutch-0.9

This obscure bot popped up on my radar earlier this month. The complete user-agent string is

PRCrawler/Nutch-0.9 (data mining development project; crawler@projectrialto.com)

The description provided in the string contains several clues that this bot is a waste of my bandwidth. First, Nutch is an open source search engine written in Java. ‘Data mining’ is not an exercise to which I am interested in offering my assistance, especially in the form of my server resources. ‘Development’ and ‘project’ are both hints that this crawler is experimental and may do the world no good at all. Here is how the creators of this bot explain its purpose:

Corey,

Project Rialto is a new online security services solution provider that monetizes its infrastructure investment via relevant advertising for its users. We accomplish this in a very unobtrusive and anonymous method. Our bot is crawling in order to understand the contents of web sites our users visit to assist in serving more relevant content.

We are currently in our initial development phases. As Project Rialto approaches its market launch we’ll provide more information about our offering.
We hope this addresses your concerns; please let us know if you have any other questions.

Regards,

Kelvin Edmison
Software Architect
Project Rialto

This loosely translates to, “we scraped your site to serve someone advertisements based on its content.” I found traces of this bot in one of my error database tables, so we are certainly seeing evidence of a development phase. IncrediBILL agrees that this bot will do no good for your site, and has compiled an IP address list in his usual “get lost” fashion.

Here is some robots.txt love from me to you that will block the bot user-agent that hit me:

User-agent: PRCrawler/Nutch-0.9 (data mining development project; crawler@projectrialto.com)
Disallow: /

Using robots.txt exclusion only works for bots that behave properly. Bad bots do not care if you do not want them, and the only way to prevent them from crawling your site is to block the IP addresses the bot uses.

2 Comments so far

  1. Karl on April 15th, 2008

    Just spotted this one in my robots.txt

    Won’t the following just do for blocking it in my robots.txt?

    User-agent: PRCrawler/Nutch-0.9

  2. Corey on April 15th, 2008

    It depends. Each of the major search engines treat robots files differently, and other bots like this one may not obey your robots rules at all.

    According to the unofficial official authority, robotstxt.org, wild cards in the user-agent string like this are not recommended..

    User-agent: PRCrawler/Nutch-0.9*
    Disallow: /

    Here I quote from robotstxt.org..

    “Specifically, you cannot have lines like “User-agent: *bot*”, “Disallow: /tmp/*” or “Disallow: *.gif”. ”

    I deviate from this recommendation and use lines like “Disallow: details.asp*” to block access to any URL at details.asp on sites that use querystring variables like details.asp?something=something&this=that

    ..and those directives work. Like any web standard, they are never universal. I recommend testing your robots files for accuracy when you can, and block the IP addresses of bots that don’t obey your rules.

    I have only been hit by this particular user agent, so I have used the exact longer rule as in my post above.

    I have exchanged emails with Kelvin so more information may come to light, but this is a bot that is in the development phase, so who knows if the user agent will still be what it is today tomorrow.

Leave a reply