Rulebreaker Bingbot and MSN


Bingbot and MSN bots are Rulebreakers

I’m tired of watching Bingbot and msnbot breaking rules and crawling disallowed files, folders and paths. Microsoft say their bots obey robots.txt rules – They don’t. Bingbot/msnbot occasionaly read the robots.txt file, then immediately afterwards continue on to crawl items specifically listed e.g;

1st Rule Breaker Example – Comments

image of bing and msn bot logoComment paths are disallowed:

  • Disallow: /comment/
  • Disallow: /*/comment/
  • Disallow: /comment/reply/

And the result, Bing crawls these paths

  • 157.56.93.219   /comment/169   2004/07/13  14:24
  • 157.56.93.219   /comment/179   2004/07/13  14:25
  • 157.56.93.219   /comment/201   2004/07/13  14:26

2nd Rule Breaker Example – Feeds

Feed paths are disallowed:

  • Disallow: /feed/
  • Disallow: /*/feed/
  • Disallow: /*/feed/*

And the result, Bing crawls these paths

  • 157.56.93.211   /taxonomy/term/39/feed    2004/07/13  14:03
  • 157.55.35.99   /taxonomy/term/90/feed   2004/07/13  14:25

Ignores HTML nofollow, no index Markup

Whether “nofollow, noindex” robots mark-up is set in page headers, or links in content are marked rel=nofollow, SearchMSN and msnbot bot ignore the rules and crawl these pages.

By ignoring the rules Microsoft bots (mainly msnbot)  have found the Project Honeypot trap files, even though the path to this file is disallowed by filename and folder.

Persists in looking for PHP Functions

Some have even looked for PHP functions e.g.; URL/function.require and URL/function.parse-url

We see an increasing number of times  the SearchMSN bot appends PHP functions to the end of existing URLs. This sort of behaviour is not acceptable.

Appends

Ignores robots.txt, What to Do?

What to do if Bing/MSN ignores robots.txt rules? Try going to Bing Webmaster control panel and setting folder and file exclusions.

There are however a few problems here. First of all, it’s a time-consuming process to add most of your robots file field by field. Then you discover the instructions only last for 90 days.

On the brighter side, the information does say you will get notified 4 days before your rules expire, and have the option to renew. No doubt line by line again. And if, like me,  you have many sites, this can take ages. What a waste of time!

At least Bing and MSN seem to honour these settings – well, I managed to reduce the crawl rate to acceptable levels. Instead of 10 to 20 bots hitting my sites at the same time, they now come to the site at a lower rate.

The last resort is deny access to all Bing/MSN IPs using .htaccess – but the bots seem to use thousands so it may take some time to discover all of them (there’s a partial list below for MSNbot). You could also block all Microsoft IPs – there’s a lot of comment spam (bots) and other bad behaviour coming from their IPs recently.

Final Opinion on Bing/MSN Bots

A set of really badly scripted bots. These search engine spiders use too many IP addresses spread over too many ranges. Their behaviour is more consistent with commercial spy bots than a quality public search search engine.

Advertisements

About Mike

Web Developer and Techno-geek Saltwater fishing nut Blogger

Posted on April 8, 2013, in Microsoft and tagged , , , , , . Bookmark the permalink. Leave a comment.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: