Ignorant Web Crawlers

Thursday, February 17th, 2005 at 2:10 am | 1,686 views | trackback url

I’m very pedantic about maintaining a clean server and working environment, server-side. This means making sure things are running at top-speed, optimized, without anything that could slow down or harm the user experience. After all, I don’t have the bandwidth that Google has at their disposal, so speed and responsiveness is important. With over 50,000 total hits a day to all of the domains we host, it really is important to keep all of the sites quick and snappy.

For example, the Plucker website used to have a little iconic images next to each horizontal menu option. That’s 18 little icons that are sent to the clients when they visit the page. In most cases, these are already in their local cache, so its not a problem, but that’s also 18 extra round-trips to the webserver per-client. Removing those images from being sent increased the response time of the servery by a LARGE amount, and reduced the number of trips the clients have to make to the server to get the full site. It was just a small tweak, and there are more coming in other parts of the sites and domains I host.

But one thing that has been really irritating me, are the overly-abusive spiders, crawlers, harvesters, and robots that slam the site daily, nightly and at all hours. I have scripts set up that parse the logs and find people who are running spiders that don’t read robots.txt, or those which slam me too fast… and I check the logs every single day.

A lot of people think they’re smart by forging their UserAgent string to match that of a “real” browser, but it can clearly be shown that the hits to 20 pages in 3 seconds, proves that they’re not real humans reading that content. For those people, they get a nice big fat firewall rule to block them… only for a few hours/days, until they shape up, or fix their broken spider.

I found another one today, one from Korea apparently, calling itself “W3CRobot/5.4.0 libwww/5.4.0”. Over 3,577 hits in one day from that one beast. Obviously I blocked that entire /24 CIDR also. 221.148.44.0/24 gone. Plonk!

Yahoo!

But two of the biggest “commercial” abusers are Yahoo!’s “Slurp” crawler, and Microsoft’s own “msnbot” crawler.

I specifically restrict Yahoo!’s crawler from reaching many parts of sites I host (such as the online CVS repositories, deep-linking into the mailing list archives, and other places). Of course, it reads, parses, and ignores my robots.txt file for each of these domains. Nice.

msnbot is an even worse abuser. It reads, parses, and follows restricted links in robots.txt. I’ve heard a rumor that they do this so they can get “more pages” than Google has in their index, so they can claim they index “more” of the Web. I have a trap set up in robots.txt specifically to catch them:

# Do NOT visit the following pathname or your host will be
# blocked from this site. This is a trap for mal-configured
# bots which do not follow RFCs.
User-agent: *
Disallow: /cgi-bin/block_crawler.pl

If they decide to ignore that, and follow that link, they’ll trigger the script to add a “Deny From” rule in .htaccess, and log it with the date and time of the block in the .htaccess file, so I can check it later to see who and what caused the denial rule.

Its pathetic.

Between both of these spiders (and its not just 2 instances, each of these two have several dozen parallel instances hitting the server simultaneously), they hit me about every 2 seconds, 24 hours a day, every day.

Until today…

I blocked them both, at the firewall, the whole /24 CIDR. Now they can’t even get to port 80. I’ll let them stew on that for a couple of weeks, and remove the block. They were hitting me so hard, that they were over 80% of my total traffic… and all of the domains I host get a LOT of hits.

There are over 400,000 posts on msnbot’s abusive behavior, as returned by Google. I fully expect, and get, ineptitude from Microsoft, but I would have thought Yahoo!’s staff was more clued than that… but I guess not. I may just end up throttling both of those back with some UserAgent detection and some sleep() calls on various pages, to keep them from hitting me so fast.

The fun never stops. I hope the users appreciate all of the extra work I go through, to keep their browsing experience fresh, fast, and snappy.

Last Modified: Thursday, February 17th, 2005 @ 02:10

Leave a Reply

You must be logged in to post a comment.

Bad Behavior has blocked 3391 access attempts in the last 7 days.