How to Block Java user-agents
A variety of user-agents that begin with “Java” are likely visiting your website. Visits providing this type of user-agent are programs created in Java by developers who did not choose to change the default user-agent string value. Here is a list of the Java user-agents I have encountered:
Java/1.4.1_04
Java/1.5.0_02
Java/1.5.0_06
Java/1.5.0_14
Java/1.6.0_02
Java/1.6.0_03
Java/1.6.0_04
Java/1.6.0_07
Java/1.6.0_11
Java/1.6.0_12
Java/1.6.0-oem
I will maintain this list simply for kicks. There is no need to collect an exhaustive list of these user-agent strings in order to block them. As I have mentioned before, I prefer to ban non-human visitors based on a combination of an IP address and a user-agent string.
Here are some URL rewriting conditions and rules that will match a list of IP addresses and any user-agent that begins with “Java” and deliver a 403 Forbidden response for any HTTP request to your server:
RewriteCond %{REMOTE_ADDR} 24\.132\.226\.94 [OR]
RewriteCond %{REMOTE_ADDR} 24\.182\.45\.28 [OR]
RewriteCond %{REMOTE_ADDR} 62\.163\.80\.226 [OR]
RewriteCond %{REMOTE_ADDR} 72\.167\.115\.65 [OR]
RewriteCond %{REMOTE_ADDR} 72\.167\.249\.81 [OR]
RewriteCond %{REMOTE_ADDR} 72\.167\.251\.32 [OR]
RewriteCond %{REMOTE_ADDR} 80\.57\.190\.67 [OR]
RewriteCond %{REMOTE_ADDR} 89\.122\.29\.31 [OR]
RewriteCond %{REMOTE_ADDR} 89\.122\.29\.32 [OR]
RewriteCond %{REMOTE_ADDR} 89\.122\.29\.61 [OR]
RewriteCond %{REMOTE_ADDR} 89\.122\.29\.76 [OR]
RewriteCond %{REMOTE_ADDR} 89\.122\.29\.77 [OR]
RewriteCond %{REMOTE_ADDR} 89\.122\.29\.79 [OR]
RewriteCond %{REMOTE_ADDR} 89\.122\.29\.81 [OR]
RewriteCond %{REMOTE_ADDR} 89\.122\.29\.82 [OR]
RewriteCond %{REMOTE_ADDR} 89\.122\.224\.52 [OR]
RewriteCond %{REMOTE_ADDR} 89\.123\.0\.95 [OR]
RewriteCond %{REMOTE_ADDR} 208\.69\.125\.202 [OR]
RewriteCond %{REMOTE_ADDR} 208\.109\.127\.182 [OR]
RewriteCond %{REMOTE_ADDR} 213\.93\.196\.155
RewriteCond %{HTTP_USER_AGENT} Java.*
RewriteRule ^/(.*)$ /$1 [F]
The first twenty rewrite conditions require the request to originate from one of the listed IP addresses that are joined with OR. This IP address or this IP address or this IP address, etc. The last condition matches any user-agent string that begins with “Java” no matter what comes later.
Finally, the rewrite rule returns any location that was requested with a 403 Forbidden response code. There will be no change made to the URL and no document delivered.
To maintain this blacklist as new Java bots are encountered, simply add another rewrite condition to match the new IP address of the latest Java bot hitting your server.
Why block Java bots?
Bots with a well-defined purpose will typically identify themselves with a unique name. These Java user-agents are either not interested in identifying their purpose or not ready to publish their name and take ownership of the crawling activities. Both cases are a waste of bandwidth. Test your new application on someone else’s website. Play with your shady crawler on someone else’s website. Come back when you are willing to identify yourself.
Comments(3)
Thanks for Java-bot explanations. Recently I too have found such and other robots on a site.
Many robots steal contents of pages of a site, so the decision to block them is correct.
But I do it a little in another way, because
the Rewrite Enginerules rules is not convenient for blocking of ranges of IP-addresses, therefore it does a script on PHP, likely:
$block= array(
“84.120.0.0-84.123.255.255″,
“122.198.0.0-122.198.255.255″,
“205.209.128.0-205.209.191.255″
);
function checkIP($ip) {
for ($i=0; $i= $b_IP && $IP <= $e_IP) return true;
}
return false;
}
“Manually” blocking IP and UserAgent is not the best practice, so I use robots detection by pseudo-picture loading and JavaScrips evaluating. But Java-bots loaded all pseudo-pictures and evaluate JavaScrips!
One way to detect Java-bots – by UserAgent’s field, but it is not so difficult тo change this fieled.
What to do in this case?
Yuri, rewrite rules can be implemented to block IP address ranges:
RewriteCond %{REMOTE_ADDR} 213\.93\.196\.\d\d?\d?RewriteCond %{HTTP_USER_AGENT} Java.*
RewriteRule ^/(.*)$ /$1 [F]
\d represents a single digit in regular expressions, and a question mark ? makes that character optional
[...] access my web site? I’ve also decided to block access to my web site by Java user agents. See How To Block Java User-Agents for someone else’s similar approach to the Java [...]