More abuse from msnbot and ignoring robots.txt

Saturday, November 24th, 2007 at 12:43 pm | 2,133 views | trackback url

msnbot continues to ignore robots.txt

For years, I’ve been watching and throttling many spiders and crawlers from abusing the services on hundreds of domains and subdomains that I host.

One of these crawlers is Microsoft’s msnbot. This particular crawler parses the robots.txt on these sites, ignore it entirely, and then indexes and follows the links forbidden within it anyway. Apparently 686 other people are having trouble with the same exact problem.

To try to combat the msnbot abuse, I’ve set the following structure in robots.txt over a year ago:

User-agent: msnbot
Crawl-delay: 3000

And here’s a snippet of some of their crawling from today’s logs:

65.55.209.99 lists.plkr.org - - [24/Nov/2007:11:18:35 -0500] "GET
/pipermail/plucker-list/2003-September/003245.html HTTP/1.0" 200 3739 "-" "msnbot/1.0
(+http://search.msn.com/msnbot.htm)"

65.55.209.99 lists.plkr.org - - [24/Nov/2007:11:18:35 -0500] "GET
/pipermail/plucker-list/2003-November/003813.html HTTP/1.0" 200 6633 "-" "msnbot/1.0
(+http://search.msn.com/msnbot.htm)"

65.55.209.99 lists.plkr.org - - [24/Nov/2007:11:18:35 -0500] "GET
/pipermail/plucker-list/2003-November/003746.html HTTP/1.0" 200 11074 "-" "msnbot/1.0
(+http://search.msn.com/msnbot.htm)"

65.55.209.99 lists.plkr.org - - [24/Nov/2007:11:18:35 -0500] "GET
/pipermail/plucker-list/2003-December/003824.html HTTP/1.0" 200 3811 "-" "msnbot/1.0
(+http://search.msn.com/msnbot.htm)"

65.55.209.99 lists.plkr.org - - [24/Nov/2007:11:18:35 -0500] "GET
/pipermail/plucker-list/2004-August/005386.html HTTP/1.0" 200 3884 "-" "msnbot/1.0
(+http://search.msn.com/msnbot.htm)"

65.55.209.99 lists.plkr.org - - [24/Nov/2007:11:18:35 -0500] "GET
/pipermail/plucker-list/2004-January/004013.html HTTP/1.0" 200 4432 "-" "msnbot/1.0
(+http://search.msn.com/msnbot.htm)"

65.55.209.99 lists.plkr.org - - [24/Nov/2007:11:18:36 -0500] "GET
/pipermail/plucker-list/2003-May/002431.html HTTP/1.0" 200 6210 "-" "msnbot/1.0
(+http://search.msn.com/msnbot.htm)"

65.55.209.99 lists.plkr.org - - [24/Nov/2007:11:18:36 -0500] "GET
/pipermail/plucker-list/2003-May/002412.html HTTP/1.0" 200 4020 "-" "msnbot/1.0
(+http://search.msn.com/msnbot.htm)"

65.55.209.99 lists.plkr.org - - [24/Nov/2007:11:18:36 -0500] "GET
/pipermail/plucker-list/2003-June/002814.html HTTP/1.0" 200 5505 "-" "msnbot/1.0
(+http://search.msn.com/msnbot.htm)"

65.55.209.99 lists.plkr.org - - [24/Nov/2007:11:18:36 -0500] "GET
/pipermail/plucker-list/2003-June/002687.html HTTP/1.0" 200 4033 "-" "msnbot/1.0
(+http://search.msn.com/msnbot.htm)"

65.55.209.99 lists.plkr.org - - [24/Nov/2007:11:18:36 -0500] "GET
/pipermail/plucker-list/2003-June/002741.html HTTP/1.0" 200 4594 "-" "msnbot/1.0
(+http://search.msn.com/msnbot.htm)"

They’ve hit that same site 333 times today, with the crawl delay set to 3,000 (50 minutes per-request). They’re hitting that site above, at 19 requests/second.

I’m about to block them outright now, if they can’t even adhere to their own exclusion declarations. There are only 233 separate uniques for msnbot, and I’m happy to block them all.

Last Modified: Saturday, November 24th, 2007 @ 12:43

Leave a Reply

You must be logged in to post a comment.

Bad Behavior has blocked 914 access attempts in the last 7 days.