Go Away Baidu and Yandex
Baidu and Yandex Bots Forbidden Access
That’s it folk, I have denied access to the Baidu and Yandex web spiders. I don’t want them crawling my sites, I don’t want them crawling my clients’ sites (unless the client wants them to of course). Both these bots do not follow advanced robots.txt disallow rules, and crawl areas of the sites I don’t want indexed… In particular I don’t want them continually searching my sites for non-existent RSS feeds and /trackback urls thus generating excessive page not found errors.
I am becoming stricter with web bots that do not comply with the more advanced robots.txt rules, eg “disallow /feed” and wildcards. Google obeys these rules, Bing obeys these rules, any other worthwhile search engine should also obey these rules.
Baiduspider bot is the main Chinese search engine. It uses multiple IPs so has been blocked by user agent, known IP range e.g. 200.*.*.* and domain – overkill perhaps, but I really don’t want these bots on my sites
Baiduspider has been implicated in numerous instances of content plagiarism, where full articles are posted on Chinese sites without credit given to the author or site of origin. In any case, all my sites are English language, and I can say with confidence that the continual crawling of my sites by Baidu bot has resulted in no worthwhile traffic ever. So the bot is nothing but an irritant and a resource waster.
The Yandex spider is the major Russian Federation search engine
The Yandex bot is another resource waster. In common with Baidu bot it is continually spidering my sites, yet the only traffic I have ever received from this region of the world is comment spammers, trackback spammers and hacking attempts. And once again my sites are English language not Russian or Eastern European.
Non-English IP ranges getting excluded.
No, I’m not biased against internet users who don’t speak English. I have found some Russian Federation and Eastern European (e.g. Ukraine and Estonia) IP ranges are used mainly by spammers and hackers – at least the traffic I get from these regions. So when I find a consistent use of an IP range for spamming and hacking, the entire range gets denied access.
There’s also a UK range and a local South African sub-range denied access, as well as a small handful of US based IP ranges on my watchlist…
Local Search Engines On Shortlist
Before I am also accused of being biased against these nations, note that a new South African search engine is included on my shortlist of bots to block from crawling my sites. This one uses the zenbot user agent and also disobeys robots.txt rules. The webmaster has been notified (at least the bot has a very well identified home page with information and contact forms etc). I have allowed access at the moment to monitor the bot for a few weeks or months, and see if the webmaster brings his spider into compliance with modern robots.txt rules. If not – Good-By zenbot!
Search Engine Webmasters Get Your Houses in Order
That’s it in a nutshell. Either get your spiders to comply fully with the standards set by Google bot, or get out of my websites, I don’t need your traffic. I am definitely not starved for web traffic. With so many of the pages on my websites (even my e-commerce site) getting Google Page One SERP, the miniscule additional traffic that may result from these rule breakers has no value whatsoever.
I can only ask, why are these ‘major’ search engines adding /feed and /trackback onto URLs that have no existing RSS feed or trackback links – the answer is simple – PLAGIARISM and SPAMMING
At best they are an annoyance, a waste of server resources and bandwidth, a waste of time going through the page not found errors. At worse they simply drive more hackers and spammers to my sites – and with an average of 50 spam comment bots getting blocked by CAPTCHA modules every day, enough is enough thank you…