Google webmaster guidelines > Technical guidelines > Guideline four of five in this category states...
"Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it's current for your site so that you don't accidentally block the Googlebot crawler. Visit http://www.robotstxt.org/wc/faq.html to learn how to instruct robots when they visit your site. You can test your robots.txt file to make sure you're using it correctly with the robots.txt analysis tool available in Google Sitemaps."
Definitions-
robot.txt file - the file that instructs robots how to behave.
web server - the computer (and the software on that computer) that hosts your Web site
crawlers/ robots/ Googlebot - Also known as a "bot" or a "spider", a search engine spider follows links to web pages and then reads and retains the information it finds. This information eventually becomes the "copy" of a website in a search engine index. This process is often referred to as "crawling" the web. "Googlebot" is the name of the search engine crawler that is most used by Google.
Examples and Explanations
The content of your robots.txt file tells search engine crawlers how they should visit your site.
If there are files and directories you do not want indexed by search engines, you can use a robots.txt file to define where the robots should not go. These files are very simple text files that are placed on your web server. They must be placed on the root folder, as an example...
www.yourwebsite.com/robots.txt
If you want to see any websites' robot.txt file you can just add "/robots.txt" to their domain name.
Here for example is the robots.txt file I use on this site - http://www.feedthebot.com/robots.txt
What do they do exactly?
Robot.txt files tell your instructions to a search engine robot..
The first thing a search engine spider looks at when it is visiting a page is the robots.txt file. It looks for it because it wants to know what it should do. If you have instructions for a search engine robot, you must tell it those instructions.
The most common problem people have with robot.txt files is that they don't know how to make them.
If you can make web pages, you can also make a robot.txt file. The file is a text file, which means that you can use notepad, wordpad, or any other plain text editor. You can also make them in Frontpage or Dreamweaver by using the "code" view. You can even "copy and paste" them.
So instead of thinking "I am making a robot.txt file", just think, "I am writing a note" they are the exact same process. However you would write a note or a letter on your computer will work for the robot.txt file.
What should the robot.txt say?
That depends on what you want it to do.
Most people want robots to visit everything in their website. If this is the case with you, and you want the robot to index all parts of your site, there are three options to let the robots know that they are welcome.
1) Do not have a robot.txt file
If your website does not have a robot.txt file then this is what happens -
A robot comes to visit. It looks for the robot.txt file. It does not find it because it isn't there. The robot then feels free to visit all your web pages and content because this is what it is programmed to do in this situation.
2) Make an empty file and call it robots.txt
If your website has a robot.txt file that has nothing in it then this is what happens -
A robot comes to visit. It looks for the robot.txt file. It finds the file and reads it. There is nothing to read, so the robot then feels free to visit all your web pages and content because this is what it is programmed to do in this situation.
3) Make a file called robots.txt and write the following two lines in it... (these are "instructions" for the robot to follow)
User-agent: *
Disallow:
If your website has a robot.txt with these instructions in it then this is what happens -
A robot comes to visit. It looks for the robot.txt file. It finds the file and reads it. It reads the first line. Then it reads the second line. The robot then feels free to visit all your web pages and content because this is what it is what you told it to do.
What do the robot instructions mean?
Here is an explanation of what the different words mean in a robot.txt file
User-agent:
The "User-agent" part is there to specify directions to a specific robot if needed. There are two ways to use this in your file.
If you want to tell all robots the same thing you put a " * " after the "User-agent" It would look like this...
User-agent: * (This line is saying "these directions apply to all robots")
If you want to tell a specific robot something (in this example Googlebot) it would look like this...
User-agent: Googlebot (this line is saying "these directions apply to just Googlebot")
Disallow:
The "Disallow" part is there to tell the robots what folders they should not look at.
This means that if, for example you do not want search engines to index the photos on your site then you can place those photos into one folder and exclude it.
Lets say that you have put all these photos into a folder called "photos". Now you want to tell search engines not to index that folder.
Here is what your robot.txt file should look like:
User-agent: *
Disallow: /photos
The above two lines of text in your robots.txt file would keep robots from visiting your photos folder. The "User-agent *" part is saying "this applies to all robots". The "Disallow: /photos" part is saying "don't visit or index my photos folder".
Googlebot specific instructions
The robot that Google uses to index their search engine is called Googlebot. It understands a few more instructions than other robots. The instructions it follows are well defined in the Google help pages (see resources below).
In addition to the "User-name" and "Disallow" Googlebot also uses the...
Allow:
The "Allow:" instructions lets you tell a robot that it is okay to see a file in a folder that has been "Disallowed" by other instructions.
To illustrate this, let's take the above example of telling the robot not to visit or index your photos. We put all the photos into one folder called "photos" and we made a robot.txt file that looked like this...
User-agent: *
Disallow: /photos
Now let's say there was a photo called mycar.jpg in that folder that you want Googlebot to index. With the Allow: instruction, we can tell Googlebot to do so, it would look like this...
User-agent: *
Disallow: /photos
Allow: /photos/mycar.jpg
This would tell Googlebot that it can visit "mycar.jpg" in the photo folder, even though the "photo" folder is otherwise excluded.
Testing your robot.txt file
If you are using a Google sitemap as part of their webmaster tools, then you can log in and see if Google is having any issues crawling your site. There is also a robot.txt tool that allows you to experiment a little, letting you know if their are any problems with your file prior to putting it online.
Key Concept:
- If you use a robots.txt file, make sure it is correctly written because an incorrect robots.txt file can block the bots that index your website.
Resources
From Google:
Google help pages -
Article from Vanessa Fox detailing how Google interprets a robot.txt file
From Google employees:
Matt Cutts shows an example of how to use the Google robot.txt analyzer
From other sources:
More examples of different ways to use the robots.txt file
An in depth reference:
Managing Robot's Access To Your Website
A good article from Brown that includes details on meta tag instructions for robots
Excluding search engine robots from your site
A tool to create robots.txt files:
A tool that checks your robot.txt file and gives input and suggestions, neat tool: