How to Block Java user-agents
A variety of user-agents that begin with “Java” are likely visiting your website. Visits providing this type of user-agent are programs created in Java by developers who did not choose to change the default user-agent string value. Here is a list of the Java user-agents I have encountered:
Java/1.4.1_04
Java/1.5.0_02
Java/1.5.0_06
Java/1.6.0_02
Java/1.6.0_03
Java/1.6.0_04
Java/1.6.0_07
Java/1.6.0-oem
I will maintain this list simply for kicks. There is no need to collect an exhaustive list of these user-agent strings in order to block them. As I have mentioned before, I prefer to ban non-human visitors based on a combination of an IP address and a user-agent string.
Here are some URL rewriting conditions and rules that will match a list of IP addresses and any user-agent that begins with “Java” and deliver a 403 Forbidden response for any HTTP request to your server:
RewriteCond %{REMOTE_ADDR} 24.132.226.94 [OR]
RewriteCond %{REMOTE_ADDR} 24.182.45.28 [OR]
RewriteCond %{REMOTE_ADDR} 62.163.80.226 [OR]
RewriteCond %{REMOTE_ADDR} 89.122.29.82 [OR]
RewriteCond %{REMOTE_ADDR} 208.69.125.202
RewriteCond %{HTTP_USER_AGENT} ^Java.*
RewriteRule ^/(.*)$ /$1 [F]
The first five rewrite conditions require the request to originate from one of the five IP addresses that are joined with OR. This IP address or this IP address or this IP address, etc. The last condition matches any user-agent string that begins with “Java” no matter what comes later.
Finally, the rewrite rule returns any location that was requested with a 403 Forbidden response code. There will be no change made to the URL and no document delivered.
To maintain this blacklist as new Java bots are encountered, simply add another rewrite condition to match the new IP address of the latest Java bot hitting your server.
Why block Java bots?
Bots with a well-defined purpose will typically identify themselves with a unique name. These Java user-agents are either not interested in identifying their purpose or not ready to publish their name and take ownership of the crawling activities. Both cases are a waste of bandwidth. Test your new application on someone else’s website. Play with your shady crawler on someone else’s website. Come back when you are willing to identify yourself.

