<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: User-agent: PRCrawler/Nutch-0.9</title>
	<atom:link href="http://www.tacticaltechnique.com/search-engines/user-agent-prcrawlernutch-09/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.tacticaltechnique.com/search-engines/user-agent-prcrawlernutch-09/</link>
	<description>Web development with Corey Salzano</description>
	<lastBuildDate>Fri, 03 Feb 2012 02:45:21 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Corey</title>
		<link>http://www.tacticaltechnique.com/search-engines/user-agent-prcrawlernutch-09/#comment-294</link>
		<dc:creator>Corey</dc:creator>
		<pubDate>Tue, 15 Apr 2008 21:19:50 +0000</pubDate>
		<guid isPermaLink="false">http://www.tacticaltechnique.com/search-engines/user-agent-prcrawlernutch-09/#comment-294</guid>
		<description>It depends. Each of the major search engines treat robots files differently, and other bots like this one may not obey your robots rules at all. 

According to the unofficial official authority, robotstxt.org, wild cards in the user-agent string like this are not recommended..

User-agent: PRCrawler/Nutch-0.9*
Disallow: /

Here I quote from robotstxt.org..

&quot;Specifically, you cannot have lines like &quot;User-agent: *bot*&quot;, &quot;Disallow: /tmp/*&quot; or &quot;Disallow: *.gif&quot;. &quot;

I deviate from this recommendation and use lines like &quot;Disallow: details.asp*&quot; to block access to any URL at details.asp on sites that use querystring variables like details.asp?something=something&amp;this=that

..and those directives work. Like any web standard, they are never universal. I recommend testing your robots files for accuracy when you can, and block the IP addresses of bots that don&#039;t obey your rules.

I have only been hit by this particular user agent, so I have used the exact longer rule as in my post above.

I have exchanged emails with Kelvin so more information may come to light, but this is a bot that is in the development phase, so who knows if the user agent will still be what it is today tomorrow.</description>
		<content:encoded><![CDATA[<p>It depends. Each of the major search engines treat robots files differently, and other bots like this one may not obey your robots rules at all. </p>
<p>According to the unofficial official authority, robotstxt.org, wild cards in the user-agent string like this are not recommended..</p>
<p>User-agent: PRCrawler/Nutch-0.9*<br />
Disallow: /</p>
<p>Here I quote from robotstxt.org..</p>
<p>&#8220;Specifically, you cannot have lines like &#8220;User-agent: *bot*&#8221;, &#8220;Disallow: /tmp/*&#8221; or &#8220;Disallow: *.gif&#8221;. &#8221;</p>
<p>I deviate from this recommendation and use lines like &#8220;Disallow: details.asp*&#8221; to block access to any URL at details.asp on sites that use querystring variables like details.asp?something=something&#038;this=that</p>
<p>..and those directives work. Like any web standard, they are never universal. I recommend testing your robots files for accuracy when you can, and block the IP addresses of bots that don&#8217;t obey your rules.</p>
<p>I have only been hit by this particular user agent, so I have used the exact longer rule as in my post above.</p>
<p>I have exchanged emails with Kelvin so more information may come to light, but this is a bot that is in the development phase, so who knows if the user agent will still be what it is today tomorrow.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Karl</title>
		<link>http://www.tacticaltechnique.com/search-engines/user-agent-prcrawlernutch-09/#comment-293</link>
		<dc:creator>Karl</dc:creator>
		<pubDate>Tue, 15 Apr 2008 18:46:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.tacticaltechnique.com/search-engines/user-agent-prcrawlernutch-09/#comment-293</guid>
		<description>Just spotted this one in my robots.txt

Won&#039;t the following just do for blocking it in my robots.txt?

User-agent: PRCrawler/Nutch-0.9</description>
		<content:encoded><![CDATA[<p>Just spotted this one in my robots.txt</p>
<p>Won&#8217;t the following just do for blocking it in my robots.txt?</p>
<p>User-agent: PRCrawler/Nutch-0.9</p>
]]></content:encoded>
	</item>
</channel>
</rss>

