Your Ad Here

Tuesday, August 25, 2009

I, the Web Robot, help you search the internet

Probably all of you already heard about web robots. If you imagined them as in SF movies, then forget about that. A robot is a program that automatically traverses the Web’s hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. Normal Web browsers are not robots, because they are operated by a human, and don’t automatically retrieve referenced documents (other than inline images). Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them.

Are web robots good or bad?

There are a few reasons people believe robots are bad for the Web:

  • Certain robot implementations can (and have in the past) overloaded networks and servers. This happens especially with people who are just starting to write a robot; these days there is sufficient information on robots to prevent some of these mistakes.
  • Robots are operated by humans, who make mistakes in configuration, or simply don’t consider the implications of their actions. This means people need to be careful, and robot authors need to make it difficult for people to make mistakes with bad effects
  • Web-wide indexing robots build a central database of documents, which doesn’t scale too well to millions of documents on millions of sites.

If professionally designed and operated, robots are good, because they make search possible, because they bring relevant web pages into the attention of readers seeking for that kind of information they offer.

Nevertheless, there may be situations when indexing a web page is not desirable by its author. In such cases, robots are not good, because they will index that page against our will.

Can we prevent web robots from visiting a web page?

No, we can’t keep them away from our pages, but there is a weapon that can prevent them from indexing our pages. This weapon is a file called robots.txt, which specifies an access policy for robots. This file must be accessible via HTTP on the local URL “/robots.txt“. The contents of this file are specified below.

This approach was chosen because it can be easily implemented on any existing WWW server, and a robot can find the access policy with only a single document retrieval. Yet, the protocol is purely advisory. It relies on the cooperation of the web robot, so that marking an area of your site out of bounds with robots.txt does not guarantee privacy.

Making a robots.txt file for your website is very simple, once you know the syntax to be used. There are so called robots.txt checkers, which would parse your file and let you know if it is malformed.

If you want to know what a robots meta tag is, take a quick look at this page. You’ll also find there a link to a list of web robots and crawlers.

0 comments: