<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Blacklisting via Ionic&#8217;s Isapi Rewrite Filter</title>
	<atom:link href="http://www.tacticaltechnique.com/bots/blacklisting-via-iirf/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.tacticaltechnique.com/bots/blacklisting-via-iirf/</link>
	<description>Web Development Observations and Asides by Corey Salzano</description>
	<lastBuildDate>Mon, 30 Aug 2010 07:07:34 -0400</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Corey</title>
		<link>http://www.tacticaltechnique.com/bots/blacklisting-via-iirf/comment-page-1/#comment-32697</link>
		<dc:creator>Corey</dc:creator>
		<pubDate>Mon, 26 Jul 2010 14:35:18 +0000</pubDate>
		<guid isPermaLink="false">http://www.tacticaltechnique.com/robots/blacklisting-via-ionics-isapi-rewrite-filter/#comment-32697</guid>
		<description>debi:

You can definitely use IIRF to block ranges with exclusions. Here&#039;s an example I wrote for the Wget bot:

#wget 
RewriteCond %{HTTP_USER_AGENT} Wget.*
#allow ip range 162.69.226.0 to 162.69.226.24
RewriteCond %{REMOTE_ADDR} ^(?!162\.69\.226\.([0-9]&#124;1[0-9]&#124;2[0-4]))([0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3})(.*)$
#allow 64.170.133.110
RewriteCond %{REMOTE_ADDR} ^(?!64\.170\.133\.110)([0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3})(.*)$
RewriteRule ^/(.*)$ /$1 [F]

These rules will block all Wget bots except when the IP address matches the exceptions.</description>
		<content:encoded><![CDATA[<p>debi:</p>
<p>You can definitely use IIRF to block ranges with exclusions. Here&#8217;s an example I wrote for the Wget bot:</p>
<p>#wget<br />
RewriteCond %{HTTP_USER_AGENT} Wget.*<br />
#allow ip range 162.69.226.0 to 162.69.226.24<br />
RewriteCond %{REMOTE_ADDR} ^(?!162\.69\.226\.([0-9]|1[0-9]|2[0-4]))([0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3})(.*)$<br />
#allow 64.170.133.110<br />
RewriteCond %{REMOTE_ADDR} ^(?!64\.170\.133\.110)([0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3})(.*)$<br />
RewriteRule ^/(.*)$ /$1 [F]</p>
<p>These rules will block all Wget bots except when the IP address matches the exceptions.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: debi</title>
		<link>http://www.tacticaltechnique.com/bots/blacklisting-via-iirf/comment-page-1/#comment-32683</link>
		<dc:creator>debi</dc:creator>
		<pubDate>Mon, 26 Jul 2010 01:57:05 +0000</pubDate>
		<guid isPermaLink="false">http://www.tacticaltechnique.com/robots/blacklisting-via-ionics-isapi-rewrite-filter/#comment-32683</guid>
		<description>Hi, I identify robots by looking at my daily logfiles and seeing what they are doing. My major problem is double-sided. I use twitter to output crime news stuff every minute (during a good crime day peak period) making me a prime target. But it is also how I get the most traffic to the landing page (dynamic URL).

At first it is painful to go through the logfiles everyday but it gets very easy once you have blocked so many of the bots. The pattern is easy to find sorting with Excel. I did write a log analyzer because the free ones were not specific enough for my kind of traffic. But I rather use excel and sort on columns to see the patterns. Because the patterns may change and my script isn&#039;t smart enough to see it.

What I see mostly from my logfiles are:

- same IP address hitting at minute intervals to the same URL.
- several different IP addresses in the range of xxx.xxx.xx.* or xxx.xxx.*.* hitting the server at staggered intervals so in the log file, it looks like a human is hitting the URL. But when looking at the range of ip address also sorted by time, you can tell that its a robot doing the equivelant of a per minute hit to the same URL.

Honeypotting your pages so the log file reflects how the person is hitting the page is important. That way it supports your guess that what you are seeing is more than likely a robot.

I also wanted to caution that all robots are not java. There are a lot of languages out there that&#039;s why I filter by behaviour.

My site is relatively new and so is my use of IIRF. I would like to let one IP address of a range of addresses use my site. Like 127.0.10.11 when I have blocked the range 127.0.10.*.  The reason is that I block a range of IP addresses when I see that the robot is using more than three IP addresses. Does anyone know how to do that. Thank you.</description>
		<content:encoded><![CDATA[<p>Hi, I identify robots by looking at my daily logfiles and seeing what they are doing. My major problem is double-sided. I use twitter to output crime news stuff every minute (during a good crime day peak period) making me a prime target. But it is also how I get the most traffic to the landing page (dynamic URL).</p>
<p>At first it is painful to go through the logfiles everyday but it gets very easy once you have blocked so many of the bots. The pattern is easy to find sorting with Excel. I did write a log analyzer because the free ones were not specific enough for my kind of traffic. But I rather use excel and sort on columns to see the patterns. Because the patterns may change and my script isn&#8217;t smart enough to see it.</p>
<p>What I see mostly from my logfiles are:</p>
<p>- same IP address hitting at minute intervals to the same URL.<br />
- several different IP addresses in the range of xxx.xxx.xx.* or xxx.xxx.*.* hitting the server at staggered intervals so in the log file, it looks like a human is hitting the URL. But when looking at the range of ip address also sorted by time, you can tell that its a robot doing the equivelant of a per minute hit to the same URL.</p>
<p>Honeypotting your pages so the log file reflects how the person is hitting the page is important. That way it supports your guess that what you are seeing is more than likely a robot.</p>
<p>I also wanted to caution that all robots are not java. There are a lot of languages out there that&#8217;s why I filter by behaviour.</p>
<p>My site is relatively new and so is my use of IIRF. I would like to let one IP address of a range of addresses use my site. Like 127.0.10.11 when I have blocked the range 127.0.10.*.  The reason is that I block a range of IP addresses when I see that the robot is using more than three IP addresses. Does anyone know how to do that. Thank you.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Cheeso</title>
		<link>http://www.tacticaltechnique.com/bots/blacklisting-via-iirf/comment-page-1/#comment-5466</link>
		<dc:creator>Cheeso</dc:creator>
		<pubDate>Tue, 30 Jun 2009 01:19:11 +0000</pubDate>
		<guid isPermaLink="false">http://www.tacticaltechnique.com/robots/blacklisting-via-ionics-isapi-rewrite-filter/#comment-5466</guid>
		<description>@daniel, yes it&#039;s possible to do what you want.  You need a RedirectRule, to redirect from /default/page.htm (in case anyone types it in), and then you need a RewriteRule to rewrite mysite.com to /default/page.htm or whatever it was you wanted to be your default page. 

This is all in the readme.</description>
		<content:encoded><![CDATA[<p>@daniel, yes it&#8217;s possible to do what you want.  You need a RedirectRule, to redirect from /default/page.htm (in case anyone types it in), and then you need a RewriteRule to rewrite mysite.com to /default/page.htm or whatever it was you wanted to be your default page. </p>
<p>This is all in the readme.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Corey</title>
		<link>http://www.tacticaltechnique.com/bots/blacklisting-via-iirf/comment-page-1/#comment-4262</link>
		<dc:creator>Corey</dc:creator>
		<pubDate>Wed, 13 May 2009 13:53:14 +0000</pubDate>
		<guid isPermaLink="false">http://www.tacticaltechnique.com/robots/blacklisting-via-ionics-isapi-rewrite-filter/#comment-4262</guid>
		<description>Jim,
&lt;p&gt;&#160;&lt;/p&gt;
These bots were cluttering up my error logging database table for 404 and 500 errors on a server holding about 250 websites. 
&lt;p&gt;&#160;&lt;/p&gt;
Like you, I took small steps initially in case any of the bots turned out to be legit. I have found no trace of any negative consequences.
&lt;p&gt;&#160;&lt;/p&gt;
I am due to write a new post about all the different useless user-agents that I am blocking without using an IP address match as well. The amount of crap out there is astounding.</description>
		<content:encoded><![CDATA[<p>Jim,</p>
<p>&nbsp;</p>
<p>These bots were cluttering up my error logging database table for 404 and 500 errors on a server holding about 250 websites. </p>
<p>&nbsp;</p>
<p>Like you, I took small steps initially in case any of the bots turned out to be legit. I have found no trace of any negative consequences.</p>
<p>&nbsp;</p>
<p>I am due to write a new post about all the different useless user-agents that I am blocking without using an IP address match as well. The amount of crap out there is astounding.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jim</title>
		<link>http://www.tacticaltechnique.com/bots/blacklisting-via-iirf/comment-page-1/#comment-4257</link>
		<dc:creator>jim</dc:creator>
		<pubDate>Wed, 13 May 2009 08:53:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.tacticaltechnique.com/robots/blacklisting-via-ionics-isapi-rewrite-filter/#comment-4257</guid>
		<description>Corey, thanks for the response.

I found a site &quot;Project Honey Pot&quot; that tracks these things and confirms them by seeing if they spam or not, it might be of use (it&#039;s possible I found the link on this blog somewhere).
http://www.projecthoneypot.org/harvester_useragents.php

At this point I&#039;m not getting hit that hard by these &quot;bots&quot;, but if it becomes abusive, to the point it starts taking a lot of CPU then I will have to do something about it and appreciate the info.

Bandwidth for the few K that they take isn&#039;t a problem (yet). What I am way more worried about is losing some possible indexing that could send traffic to my site.

All they seem to do so far is load the first level pages and go no deeper, they load no images.

Yes, they should properly identify themselves, but maybe they don&#039;t want to because people might use that ID to &quot;game&quot; their system.

If I ever do start to block, I&#039;m probably going to start with the IP list from that &quot;honey pot&quot; place since those are confirmed and see how that goes.

Plus, if these guys who do this get a clue, they can just change the agent string randomly to any number of known browser types. After that the only way you could tell is by their behavior, which may not be the best indication, or by the honey pot method.</description>
		<content:encoded><![CDATA[<p>Corey, thanks for the response.</p>
<p>I found a site &#8220;Project Honey Pot&#8221; that tracks these things and confirms them by seeing if they spam or not, it might be of use (it&#8217;s possible I found the link on this blog somewhere).<br />
<a href="http://www.projecthoneypot.org/harvester_useragents.php" rel="nofollow">http://www.projecthoneypot.org/harvester_useragents.php</a></p>
<p>At this point I&#8217;m not getting hit that hard by these &#8220;bots&#8221;, but if it becomes abusive, to the point it starts taking a lot of CPU then I will have to do something about it and appreciate the info.</p>
<p>Bandwidth for the few K that they take isn&#8217;t a problem (yet). What I am way more worried about is losing some possible indexing that could send traffic to my site.</p>
<p>All they seem to do so far is load the first level pages and go no deeper, they load no images.</p>
<p>Yes, they should properly identify themselves, but maybe they don&#8217;t want to because people might use that ID to &#8220;game&#8221; their system.</p>
<p>If I ever do start to block, I&#8217;m probably going to start with the IP list from that &#8220;honey pot&#8221; place since those are confirmed and see how that goes.</p>
<p>Plus, if these guys who do this get a clue, they can just change the agent string randomly to any number of known browser types. After that the only way you could tell is by their behavior, which may not be the best indication, or by the honey pot method.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Corey</title>
		<link>http://www.tacticaltechnique.com/bots/blacklisting-via-iirf/comment-page-1/#comment-4228</link>
		<dc:creator>Corey</dc:creator>
		<pubDate>Tue, 12 May 2009 13:01:56 +0000</pubDate>
		<guid isPermaLink="false">http://www.tacticaltechnique.com/robots/blacklisting-via-ionics-isapi-rewrite-filter/#comment-4228</guid>
		<description>Jim, it is impossible to predict why Java bots are hitting your sites. Many java developers who do not specify a custom user-agent while designing their crawlers are using these user-agents when requesting documents from your server.
&lt;p&gt;&#160;&lt;/p&gt;
I think we agree that the extra load on our web servers caused by java bots is useless, and blocking these robots is so easy that it simply makes sense. 
&lt;p&gt;&#160;&lt;/p&gt;
I am about to update the post with some simplified rewrite rules that I have been using for a couple weeks.</description>
		<content:encoded><![CDATA[<p>Jim, it is impossible to predict why Java bots are hitting your sites. Many java developers who do not specify a custom user-agent while designing their crawlers are using these user-agents when requesting documents from your server.</p>
<p>&nbsp;</p>
<p>I think we agree that the extra load on our web servers caused by java bots is useless, and blocking these robots is so easy that it simply makes sense. </p>
<p>&nbsp;</p>
<p>I am about to update the post with some simplified rewrite rules that I have been using for a couple weeks.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jim</title>
		<link>http://www.tacticaltechnique.com/bots/blacklisting-via-iirf/comment-page-1/#comment-4221</link>
		<dc:creator>jim</dc:creator>
		<pubDate>Tue, 12 May 2009 05:55:47 +0000</pubDate>
		<guid isPermaLink="false">http://www.tacticaltechnique.com/robots/blacklisting-via-ionics-isapi-rewrite-filter/#comment-4221</guid>
		<description>I don&#039;t have any e-mails for them to find on the site, but they keep hitting. Is it really a problem? I&#039;m more worried about the CPU load a lot of extra checking might cause.
And is that all these Java bots are doing? Maybe they are part of some sort of blog site that lists stuff, or other search/sort type of programs.
At first I thought maybe they were some sort of cache system for ISPs so they could keep a local copy and save bandwidth on their network.</description>
		<content:encoded><![CDATA[<p>I don&#8217;t have any e-mails for them to find on the site, but they keep hitting. Is it really a problem? I&#8217;m more worried about the CPU load a lot of extra checking might cause.<br />
And is that all these Java bots are doing? Maybe they are part of some sort of blog site that lists stuff, or other search/sort type of programs.<br />
At first I thought maybe they were some sort of cache system for ISPs so they could keep a local copy and save bandwidth on their network.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Corey</title>
		<link>http://www.tacticaltechnique.com/bots/blacklisting-via-iirf/comment-page-1/#comment-2496</link>
		<dc:creator>Corey</dc:creator>
		<pubDate>Sat, 14 Feb 2009 16:33:50 +0000</pubDate>
		<guid isPermaLink="false">http://www.tacticaltechnique.com/robots/blacklisting-via-ionics-isapi-rewrite-filter/#comment-2496</guid>
		<description>Daniel, I have not had time to examine MCMS. You may want to post at the IIRF home page on Microsoft&#039;s Codeplex to see if anyone is running the two together. http://www.codeplex.com/IIRF</description>
		<content:encoded><![CDATA[<p>Daniel, I have not had time to examine MCMS. You may want to post at the IIRF home page on Microsoft&#8217;s Codeplex to see if anyone is running the two together. <a href="http://www.codeplex.com/IIRF" rel="nofollow">http://www.codeplex.com/IIRF</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: daniel</title>
		<link>http://www.tacticaltechnique.com/bots/blacklisting-via-iirf/comment-page-1/#comment-2494</link>
		<dc:creator>daniel</dc:creator>
		<pubDate>Sat, 14 Feb 2009 07:43:56 +0000</pubDate>
		<guid isPermaLink="false">http://www.tacticaltechnique.com/robots/blacklisting-via-ionics-isapi-rewrite-filter/#comment-2494</guid>
		<description>Any ideas? It would be great if this is possible, if not, I will have to find some other method to accomplish this. Thanks!</description>
		<content:encoded><![CDATA[<p>Any ideas? It would be great if this is possible, if not, I will have to find some other method to accomplish this. Thanks!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: daniel</title>
		<link>http://www.tacticaltechnique.com/bots/blacklisting-via-iirf/comment-page-1/#comment-2484</link>
		<dc:creator>daniel</dc:creator>
		<pubDate>Fri, 13 Feb 2009 03:48:09 +0000</pubDate>
		<guid isPermaLink="false">http://www.tacticaltechnique.com/robots/blacklisting-via-ionics-isapi-rewrite-filter/#comment-2484</guid>
		<description>Hi Corey, 

Thanks again for your reply. Yes that&#039;s what i want to accomplish, to show users only my domain. However, the home/default.html path is not a &#039;physical&#039; file in my server, it is based on MCMS. I think your approach will only work if there is an existing physical file of home/default.html. I saw the generated logs and it is clearly looking for the physical file. Is my goal still possible with my current setup? Kindly advice. Thanks.</description>
		<content:encoded><![CDATA[<p>Hi Corey, </p>
<p>Thanks again for your reply. Yes that&#8217;s what i want to accomplish, to show users only my domain. However, the home/default.html path is not a &#8216;physical&#8217; file in my server, it is based on MCMS. I think your approach will only work if there is an existing physical file of home/default.html. I saw the generated logs and it is clearly looking for the physical file. Is my goal still possible with my current setup? Kindly advice. Thanks.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
