Spiders and Robots Exclusion
|
Web Robots
are programs that automatically traverse the Web's hypertext
structure by retrieving a document, and recursively
retrieving all documents that are referenced.
This page explains how you can control what these robots do
when visiting your site.
|
The Robots Exclusion Protocol is a method that allows Web site
administrators to indicate
to visiting robots which parts of their site should not be visited by
the robot.
When a Robot visits a web site, it first checks for the file
robots.txt in the root directory; e.g.
http://Stars.com/robots.txt.
If it can find this file, it will analyze its contents
to see if it may retrieve further documents (files).
You can customize the robots.txt file to apply only to specific robots,
and to disallow access to specific directories or files.
Here is a sample robots.txt file that prevents all robots from visiting
the entire site:-
# Tells Scanning Robots Where They Are And Are Not Welcome
# User-agent: can also specify by name; "*" is for everyone
# Disallow: if this matches first part of requested path,
# forget it
User-agent: * # applies to all robots
Disallow: / # disallow indexing of all pages
The record starts with one or more User-agent lines, specifying
which robots the record applies to, followed by "Disallow" and
"Allow" instructions to that robot.
To evaluate if access to a URL is allowed, a robot must attempt to
match the paths in Allow and Disallow lines against the URL, in the
order they occur in the record. The first match found is used. If no
match is found, the default assumption is that the URL is allowed.
For example:
User-agent: webcrawler
User-agent: infoseek
Allow: /tmp/ok.html
Disallow: /tmp
WebCrawler and InfoSeek are not allowed access to the /tmp/ directory,
except to access ok.html. All other robots are allowed unrestricted
access.
Sometimes robots get stuck in CGI programs, trying to evoke all
possible outputs. This keeps robots out of the cgi-bin directory,
and disallows execution of a specific CGI program in the /Ads/
directory:-
User-agent: *
Disallow: /cgi-bin/
Disallow: /Ads/banner.cgi
The Robots META tag allows HTML authors to indicate to visiting robots
if a document may be indexed, or used to harvest more links. No server
administrator action is required.
Note that currently only a few robots implement this.
<HTML>
<Head> <Title>Rossum's Universal Robots</Title>
<META name="robots" content="noindex,nofollow">
<META name="description" content="This page ....">
</Head>
<Body> ...
A robot should neither index this document, nor analyse it for links.
The list of terms in the content is
| ALL
| = INDEX, FOLLOW
|
| NONE
| = NOINDEX, NOFOLLOW
|
| INDEX
| Index this page.
|
| FOLLOW
| Follow links from this page.
|
| NOINDEX
| Don't index this page.
|
| NOFOLLOW
| Don't follow links from this page.
|
|---|
The Robots META tag name and values are case-insensitive.
Note: In early 1997 only a few robots implemented this, but this is
expected to change as more public attention is given to controlling
indexing robots.
Making Your Pages Easy for Search Engines to Index
Achieving high rankings for Web sites on search engines has
become a black art. But there is an alternative approach. How
about creating pages that present search engine spiders with the
information they want? Manipulating meta tags and image Alt
descriptions can saddle you with a ranking penalty. It's the text
near the top of the page that you need to spend your time on.
A Standard for Robot Exclusion
The original 1994 protocol description, as currently deployed.
It is not an official standard backed by a standards body, or owned by
any commercial organisation. It is not enforced by anybody,
and there no guarantee that all current and future robots will use it.
Consider it a common facility the majority of robot authors offer
the WWW community to protect WWW server against unwanted accesses by
their robots.
The robots.txt file
in HTML 4.0 draft spec.
The Web Robots FAQ
Frequently Asked Questions about Web Robots,
from Web users, Web authors, and Robot implementors.
A List of Active Robots
A database of currently known robots,
with descriptions and contact details.
Robots Mailing List
An archived mailing list for discussion of technical aspects of
designing, building, and operating Web Robots.
The robots@webcrawler.com mailing-list is intended as a technical forum
for authors, maintainers and administrators of WWW robots.
Its aim is to maximise the benefits WWW robots can offer while
minimising drawbacks and duplication of effort. It is intended to
address both development and operational aspects of WWW robots.
Watching the Robots
With the rapid increase of search engines and their deployment of
robots which roam the World Wide Web finding and indexing content to
add to their databases, there has developed a certain amount of
concern amongst web site administrators about exactly how much of
their precious server time and bandwidth is being used to service
requests from these engines.
This site features
Botwatch, a Perl utility to analyse log files for bot activity,
and
a robots.txt syntax checker.
BotSpot
"The Spot for all Bots on the Net" is designed to be a search index
for those interested in all aspects of Bots and Internet Agents and
to categorize Bots by potential utilization.
BotSpot will also have the latest FAQ's, articles and related
information about the field of Bots and Internet Agents.
|