I spend most of my time telling people how to best allow spiders into a website, however it can sometimes be just as important knowing how to keep them out.
Why would you want to want your content kept out of the search results? There could be a number of reasons:
- Unwanted duplication – if you provide printer-friendly versions of some pages for people who need a hard copy of your content without extraneous images or navigation buttons, then you dont want those spidered as they would appear as duplicate content. The search engines don’t like showing their customers two versions of the same material so make it easy for them to tell which one they should use by excluding the printer version.
- If you keep any sensitive data on your site such as wholesale prices for trade customers then you want to make sure that isn’t made public.
- If you have any large images or large numbers of moderately sized images then you may wish to avoid high bandwidth usage or high server loads by stopping search engines from indexing these images.
- If you find bandwidth is being taken up by spiders which are no good to you – link checking spiders, or academic plagiarism spiders for instance – then you may want to keep them out altogether.
- If you’re troubled by scraper spiders that simply come to steal your content.
The remedies depend on the situation and your site structure. An individual page can be kept out of the indexes using a simple robots meta-tag set to noindex (the major search spiders obey this but more specialist ones may not). Larger sets of files in directories can be isolated using the robots.txt file and specific spiders excluded from part or all of a site by the same method.
Rogue spiders can be more of a problem since by their very nature they usually don’t abide by robots.txt instructions, so you may need to detect them in your log files, identify their ip addresses, and then ban then in your server settings. This is a very specialist area and we recommend that you research it thoroughly before embarking on action. It’s all too easy to ban a wider range of people than intended and it’s also easy to spend an enormous amount of time chasing down rogues who then change ip addresses and reappear.