Web Developer's Virtual Library: Encyclopedia of Web Design Tutorials, Articles and Discussions


WDVL Newsletter

Active Server Pages
JSP/Java Servlets
Microsoft SQL Server
Daily Backup
Dedicated Servers
Streaming Audio/Video
24-hour Support    

jobs.webdeveloper.com

Hiermenus


e-commerce
Partner With Us















Developer Channel
FlashKit.com
JavaScript.com
JavaScriptSource
Developer Jobs
ScriptSearch
StreamingMediaWorld
Web Developer's Journal
Web Developer's Virtual Library
WebDeveloper.com
Webreference
Web Hosts
XMLfiles.com

internet.com
IT
Developer
Internet News
Small Business
Personal Technology

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers


Spiders and Robots Exclusion

Web Robots are programs that automatically traverse the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. This page explains how you can control what these robots do when visiting your site.

Robot Exclusion

The Robots Exclusion Protocol is a method that allows Web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot. When a Robot visits a web site, it first checks for the file robots.txt in the root directory; e.g. http://Stars.com/robots.txt. If it can find this file, it will analyze its contents to see if it may retrieve further documents (files). You can customize the robots.txt file to apply only to specific robots, and to disallow access to specific directories or files. Here is a sample robots.txt file that prevents all robots from visiting the entire site:-
# Tells Scanning Robots Where They Are And Are Not Welcome
# User-agent:	can also specify by name; "*" is for everyone
# Disallow:	if this matches first part of requested path,
#			forget it

User-agent: *    # applies to all robots
Disallow: /      # disallow indexing of all pages
The record starts with one or more User-agent lines, specifying which robots the record applies to, followed by "Disallow" and "Allow" instructions to that robot. To evaluate if access to a URL is allowed, a robot must attempt to match the paths in Allow and Disallow lines against the URL, in the order they occur in the record. The first match found is used. If no match is found, the default assumption is that the URL is allowed. For example:
User-agent: webcrawler
User-agent: infoseek
Allow:    /tmp/ok.html
Disallow: /tmp
WebCrawler and InfoSeek are not allowed access to the /tmp/ directory, except to access ok.html. All other robots are allowed unrestricted access.

Sometimes robots get stuck in CGI programs, trying to evoke all possible outputs. This keeps robots out of the cgi-bin directory, and disallows execution of a specific CGI program in the /Ads/ directory:-

User-agent: *
Disallow: /cgi-bin/
Disallow: /Ads/banner.cgi

The Robots META tag

The Robots META tag allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links. No server administrator action is required. Note that currently only a few robots implement this.

<HTML>
<Head>	<Title>Rossum's Universal Robots</Title>
	<META name="robots" content="noindex,nofollow">
	<META name="description" content="This page ....">
</Head>
<Body>	...
A robot should neither index this document, nor analyse it for links. The list of terms in the content is
ALL = INDEX, FOLLOW
NONE = NOINDEX, NOFOLLOW
INDEX Index this page.
FOLLOW Follow links from this page.
NOINDEX Don't index this page.
NOFOLLOW Don't follow links from this page.
The Robots META tag name and values are case-insensitive. Note: In early 1997 only a few robots implemented this, but this is expected to change as more public attention is given to controlling indexing robots.

Making Your Pages Easy for Search Engines to Index
Achieving high rankings for Web sites on search engines has become a black art. But there is an alternative approach. How about creating pages that present search engine spiders with the information they want? Manipulating meta tags and image Alt descriptions can saddle you with a ranking penalty. It's the text near the top of the page that you need to spend your time on.

A Standard for Robot Exclusion
The original 1994 protocol description, as currently deployed. It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it. Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots.

The robots.txt file
in HTML 4.0 draft spec.

The Web Robots FAQ
Frequently Asked Questions about Web Robots, from Web users, Web authors, and Robot implementors.

A List of Active Robots
A database of currently known robots, with descriptions and contact details.

Robots Mailing List
An archived mailing list for discussion of technical aspects of designing, building, and operating Web Robots. The robots@webcrawler.com mailing-list is intended as a technical forum for authors, maintainers and administrators of WWW robots. Its aim is to maximise the benefits WWW robots can offer while minimising drawbacks and duplication of effort. It is intended to address both development and operational aspects of WWW robots.

Watching the Robots
With the rapid increase of search engines and their deployment of robots which roam the World Wide Web finding and indexing content to add to their databases, there has developed a certain amount of concern amongst web site administrators about exactly how much of their precious server time and bandwidth is being used to service requests from these engines. This site features Botwatch, a Perl utility to analyse log files for bot activity, and a robots.txt syntax checker.

BotSpot
"The Spot for all Bots on the Net" is designed to be a search index for those interested in all aspects of Bots and Internet Agents and to categorize Bots by potential utilization. BotSpot will also have the latest FAQ's, articles and related information about the field of Bots and Internet Agents.



Up to => Home / Location / Search




Jupiter Online Media: internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and Jupiter Online Media

Jupitermedia Corporate Info


Legal Notices, Licensing, & Permissions, Privacy Policy.

Web Hosting | Newsletters | Tech Jobs | Shopping | E-mail Offers