Bot-trap - A Bad Web-Robot Blocker

This package will enable your web site to automatically ban bad web robots (aka web spiders) that ignore the robots.txt file. This does not include Googlebot and other well-behaved robots. The software requirements are Apache and PHP, but the principles would work with any web server setup.

Three of the most common bad robots are:

Email harvesters: they want to spam you.
Corporate tattletales: they report back to corporations if you use their trademark, criticize them, violate their copyrights, and so on.
Scrapers: they copy your whole site, then set it up somewhere else and put Adwords on it. Bot-trap can't protect against these bots because they usually follow robots.txt.

This package will protect against email-harvesting robots whether they follow robots.txt or not:

Exclude your contact page in robots.txt.
Email harvesting bots that follow robots.txt won't get your email.
Email harvesting bots that don't follow robots.txt will quickly get banned and won't get your email.

Update, December 2006:
So many people have started using bot-trap and other bad robot banners that many email harvesting robots appear to be following robots.txt! This means that simply placing your contact page in robots.txt as I do will drastically cut the number of spammers that get your email, even if you don't run bot-trap, and even if you use a direct mailto: link. A partial victory!

If you speak German, there is a nice repackaged version of bot-trap with a few additions at www.spider-trap.de.

Demo

To see bot-trap in action, go to the page on my site where bad robots go: http://danielwebb.us/bot-trap/index.php. You'll be banned for going where you weren't supposed to go (didn't you read the robots.txt file!?) Then go back to this page with the back button or type in the main page URL in the URL bar of your browser. Reload the page or you'll probably get the old page cached by your browser. The 403 page will allow you to unban yourself. Bad robots shouldn't be going to that link, because my robots.txt forbids it. You were a very bad robot indeed.

How It Works

You place a small "web-bug" strategically in your web pages. This bug is just a tiny image link that says to go to /bot-trap/index.php. Normal people don't see this link, but web bots do.
You create a /robots.txt file that tells web bots not to go to the /bot-trap directory.
When the bad robot visits /bot-trap/index.php anyway, /bot-trap/index.php adds the IP address of the bad bot to a block list in /.htaccess. They are blocked from access to the site from then on. You can also be emailed when this happens.

Safeguards

It is possible that someone is banned who shouldn't be. Perhaps a previous user of an IP address in a DHCP pool was a naughty user and ran a bad bot, but now the new user is banned. Not to worry, the custom "403 Forbidden" page allows any user to unban themselves by typing a requested word into a form box. Real people can easily do this, but bots can't!

Installation

Unpack the tarball in your web page root directory:
# tar -xzf bot-trap-x.x.tar.gz
Make the bot-trap directory in your web root directory owned by the same user or group as the web server (www-data on Debian GNU/Linux). Either way, the web server user needs read access to the bot-trap directory, but it doesn't have to have write access to it.
Either add a line to your root .htaccess file like:
ErrorDocument 403 /bot-trap/forbid.php
or copy the premade one (bot-trap/htaccess-root-example). Notice that since once an IP is banned, it can't access anything in /, so the 403 page should be in /bot-trap, and /bot-trap/.htaccess should only say "Allow from all". Look at the forbid.php file in the distribution to see how to do this, or just use it as-is.
Create the empty file blacklist.dat in your web root directory. The bot-trap system stores a log of bans here.
Make blacklist.dat and .htaccess writable by the web server user.
Make sure .htaccess controls are allowed in your Apache configuration (especially the "AllowOverride" directive). This allows bot-trap to ban IP addresses using the htaccess mechanism.
Edit bot-trap/settings.php to hold the correct email addresses to send alerts to.
Add "web-bugs" to your main web page to catch the bad bots. This is the XHTML code:

<!-- Bad robot trap: Don't go here or your IP will be banned! -->
<a href="/bot-trap/index.php"><img src="bot-trap/pixel.gif" border="0"
alt=" " width="1" height="1"/></a>

Add the bot-trap directory to your robots.txt file, or copy the example robots.txt file (bot-trap/robots.txt.example) to the root directory.
Make sure /.htaccess and all other files have the correct permissions and ownership for your site.

WARNING

Warning! Don't mess with this if you don't have the ability to fix things if this breaks them! If you mess up /.htaccess, your whole site could go down.

BUGS

The directory for bot-trap is hard-coded as /bot-trap. To change this, you have to change all the instances of '/bot-trap' to your new directory.

I used the file locking mechanism flock(), and according to the PHP page comments on flock, you're bound to get a race condition eventually when two processes set the bot trap at the same time or two processes unban at the same time. Unfortunately, if the comments there are to be believed (and I don't know one way or the other), there is no way around this. I guess I figure if you have so many bots tripping the trap that you get race collisions, you've got bigger problems than race collisions.

LICENSE

bot-trap is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

bot-trap is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

AUTHOR

I can be contacted at danielwebb.us.

This package was inspired by http://www.kloth.net/internet/bottrap.php.

FAQ

Bot-trap looks old. Is it being updated?

Yes.
Even good bots such as Googlebot are sometimes temporarily broken and ignore robots.txt, how do I keep them from getting banned?

Add IP addresses of known good bots to the top of the .htaccess file with the "Allow" keyword:
Allow from 72.14.192.0/18
Allow from 64.233.160.0/19
whitelists some of the Google IP address space. Unfortunately, some webmasters practice something called "cloaking" where they give Google their real content, but show you a signup or subscription screen. This is fraud, and Google sometimes tries to detect it by searching from secret IP addresses, so there's really no way to know if a hit is really Google based on anything!
WARNING!Google has a service called Google Wireless Transcoder (GWT) that may allow spammers to ignore bot-trap via GWT if you whitelist Google!
Bot-trap now shows the hostname in the alert email, so you can not whitelist Google but check your email alerts occasionally to make sure someone important like Google or Yahoo hasn't been banned. If you want to see the hostnames of all the IPs banned in your .htaccess, this one-line shell script will do it (Unix only):
for f in $(cat .htaccess | grep 'Deny from' | perl -pe 's/Deny from //'); do dig -x $f | grep --after-context=2 "ANSWER SECTION"; done
Why does my local machine keep getting banned when I use link-checkers?

Link-checkers may ignore robots.txt since they want to check the whole site. See the previous question and add:
Allow from 127.0.0.1
to whitelist your local machine.
Can't a bad bot get around bot-trap by just following robots.txt?

Yes. Bot-trap was originally designed to stop email harvesters. For this task bot-trap is extremely good. Another kind of bad bot is the scraper, where the bot copies all your pages and content to their own site to get ad revenue. If these scum follow robots.txt, bot-trap is powerless against them, and you'll need a different method to defend yourself.
What's to stop a bot on a rotating IP from filling your .htaccess with thousands or millions of "banned" user-agents just to get back at you?

I'm not sure. I think bot-trap is probably underpowered for this kind of determined scumbag. Another bot trapping implementation that would be more immune to this type of attack is this mod_perl implementation. Instead of banning IP addresses by altering .htaccess, it has the Apache access routine call a Perl handler which holds the banned IP addreses in memory. Each time the Perl handler is restarted the ban list is cleared.
Bot-trap uses a hidden link behind a 1x1 pixel gif. Won't this incur some hidden link penalty with search engines?

Probably. Bot-trap uses 1x1 image links, which could theoretically hurt your site's rank in search engines because link-farmers often use this technique. A workaround is to create a 20x10 image with "Bot trap do not click", or create a text link that indicates it should not be clicked. Curious users will just get banned and then unban themselves. I have been using bot-trap since 2004 and I currently (December 2006) have the 4th result for the 3 terms "bad", "web", and "bot" in Google, so make of that what you will. All of my pages have two hidden links per page and most also have a PageRank of 4, which I think is respectable for a very low-traffic site. Search engine ranking algorithms are secret for obvious reasons, so I can't fully answer this question.