How to block unwanted sites from crawling your page
Have you ever had a site crawl your page without permission? Have you noticed in your Analytics that your data is skewed by an annoying website that just keeps crawling your pages and messing up all of your performance indicators? We were recently hit by one of these sites and it can be a real pain in the backside to deal with. We’ve written to the site on multiple occasions asking them to remove our site from their directory of sites to be crawled, we’ve tried to just remove their stats from our Analytics program and yet they still show up but under different names. It was getting to a point where their site crawler had completely skewed our analytics and was giving us false readings for our blog.
It took a lot of searching to find a way to block this specific site (Semalt.Semalt) but we finally found some code from the WordPress blog that helped us stop the Semalt spider from causing any further harm to our page. For those interested, the code must be placed in your .htaccess file and the end of the file after it says #End WordPress. If done correctly it should read:
SetEnvIfNoCase Via evil-spam-proxy spammer=yes
SetEnvIfNoCase Referer evil-spam-domain.com spammer=yes
SetEnvIfNoCase Referer evil-spam-keyword spammer=yes
SetEnvIfNoCase Via pinappleproxy spammer=yes
SetEnvIfNoCase Referer semalt.com spammer=yes
SetEnvIfNoCase Referer poker spammer=yes
Allow from all
Deny from env=spammer
There are several other sites that you may want to block as well, and you can use this code and just change the URL in this line of code SetEnvIfNoCase Referer PUT URL HERE.com spammer=yes. This can be used for international search engines too and will help you keep your website functioning for your audience to enjoy. If you are only targeting a market in a specific country, then why waste your server’s resources on keeping up with requests from hundreds or thousands of international search engine crawlers that want to go through all of your pages and ultimately will overload your site and cause your real viewers to experience unnaturally long loading times? If you guys have any questions or comments, let us know and we will see what we can do to help. Remember, if you want to know how to block unwanted sites from crawling your page, this is a great tool to use for WordPress websites. We will research other methods and also post them up here for you to see how to block unwanted sites crawling your page if it isn’t hosted on WordPress.
Resources: WordPress blog post: https://wordpress.org/support/topic/how-to-block-semaltcom-from-visiting-your-wordpress-website