A Tale of Edinburgh University Press and search engines
This weekend I was doing some research for additional content for my Scottish Books site and had occasion to do a Google search for Edinburgh University Press. To my surprise their site didn’t appear in the first page of results, or the second, or the third.
Intrigued, I found a link to it on one of the sites that did rank, (it’s http://www.euppublishing.com/) and then viewed the source code (always my first action when I want to check a site’s setup and quality). First thing I noticed (apart from acres of whitespace) was lang=”en-US” in the html tag – not the best indication especially for a .com. That gave me an idea and I went back to Google and clicked on the “web results” link (I had searched on UK-only as I usually do for UK-based queries). Low and behold up came the site in number 1 spot.
So Google thinks that Edinburgh University Press isn’t a UK site. Could it just be that language setting? Let’s dig a little further, I next activated my Netcraft toolbar – ahha, they are on IBM servers in the USA, another poor signal and almost certainly a rather more important one. (I’ve seen many many .com sites failing to rank in the UK because of being hosted elsewhere)
Since it doesn’t look as if there has been any SEO done on the site – poor and duplicate title tags and no meta-descriptions – it’s a fair bet that they haven’t got a Webmaster Tools account where they could have told Google the site was UK, although that isn’t the whole solution by any means.
Robots.txt puzzle
While musing on this situation of one of Scotland’s most important academic publishers not showing up in UK searches and whether I should try and contact the webmaster about it, I cast around pretty much on SEO autopilot checking various data, and having seen that there is a robots-noarchive setting on the home page, I checked the robots.txt file:
User-agent: *
Disallow: /
Oops! Seems they either don’t want indexed or are being somewhat badly advised!
Hang on, they were listed in the Google worldwide results…
So how are the search engines handling that? run a few site: commands:
Bing only lists 1 page with no details (although as usual they can’t count their totals – 2/2 of 150??)
Yahoo only lists 1 page with no details.
Blekko says there are 250 pages but doesn’t list any of them.
Google lists 901!! (and gives another nonsensical total of 45,000) and includes page content in the short descriptions. (At least they aren’t cacheing it)
Hmmmmm!!!
So much for Google obeying robots.txt – seems they make their own minds up (not the first time I’ve seen this)
So the moral of this story is, be careful about your domain name suffix, be careful where you host your site, don’t tell people you speak American when you’re British, and don’t expect Google to follow standards or stay out of your website when you tell it to.
There are a few times that Robots.txt “fails” in some ways. Most of the time it’s the overriding link factor, links going to pages within sections that are disallowed in robots.txt. That said, if there is a result, it’s normally just the page URL, without any snippet data.
Interesting though that not many people know about using noindex within the robots.txt… since we know crawling and indexing are 2 different functions.
As you said, Google has the final call in most cases
Hi Brett, yes I was surprised that there was snippet data in this case. Clearly this site should have been indexed and the robots.txt entry was a mistake, so in a way I almost agree with Google deciding to index, but it’s still worrying that they feel they can disregard an unambiguous message to stay out.
It really means that if you do have a genuine reason for not wanting a site indexed then you have to use stronger methods such as .htaccess.