Robots.txt Best Practices For Beginners

A robots.txt file is a file on your web server used to control bots like Googlebot, Google’s web crawler. You can use it to block Google and Bing from crawling parts of your site.

My friend Sebastian was also nice enough to help me create an idiot’s guide to Robots.txt. Q&A below:

Well, the “idiot’s version” will lack interesting details, but it will get you started. Robots.txt is a plain text file. You must not edit it with HTML editors, word processors, nor any applications other than a plain text editor like vi (Ok, notepad.exe is allowed too). You shouldn’t embed images and such, also any other HTML code is strictly forbidden.

Why shouldn’t I edit the robots.txt file with my HMTL FTP client, for instance?

Because all those fancy apps insert useless crap like formatting, HTML code and whatnot. Most probably search engines aren’t capable to interpret a robots.txt file like:
DOCTYPE text/plain PUBLIC 
"-//W3C//DTD TEXT 1.0 Transitional//Swahili" 
"http://www.w3.org/TR/text/DTD/plain1-transitional.dtd"> 
{\b\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid11089941 
User-agent: Googlebot}
{ \lang2057\langfe1031\langnp2057\insrsid6911344\charrsid11089941 \line 
Disallow: / \line Allow: }{\cs15\i\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid2903095
{\i\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid2903095 content}{ \cs15\i\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid2903095 /} ...
(Ok Ok, I’ve made up this example, but it represents the raw contents of text files saved with HTML editors and word processors.)

Where do I put the robots.txt file on a website?

Robots.txt resides in the root directory of your Web space, that’s either a domain or a subdomain, for example
"/web/user/htdocs/example.com/robots.txt"
resolving to
http://example.com/robots.txt.

Can I use Robots.txt in subdirectories?

Of course you’re free to create robots.txt files in all your subdirectories, but you shouldn’t expect search engines to request/obey those. If you for some weird reasons use subdomains like crap.example.com, then the example.com/robots.txt is not exactly a suitable instrument to steer crawling of subdomains, hence ensure each subdomain serves its own robots.txt. When you upload your robots.txt then make sure to do it in ASCII mode, your FTP client usually offers “ASCII|Auto|Binary” – choose “ASCII” even when you’ve used an ANSI editor to create it.

Why should I create my robots.txt file in ASCII content only?

Because plain text files contain ASCII content only. Sometimes standards that say “upload *.htm *.php *.txt .htaccess *.xml files in ASCII mode to prevent them from inadvertently corruption during the transfer, storing with invalid EOL codes, etc.” do make sense. (You’ve asked for the idiot version, didn’t you?)

Can I use the Robots.txt file if I am on a Free Host?

If you’re on a free host, robots.txt is not for you. Your hosting service will create a read-only robots.txt “file” that’s suitable to steal even more traffic than its ads that you can’t remove from your headers and footers. Now, if you’re still interested in the topic, you must learn how search engines work to understand what you can archive with a robots.txt file and what’s just myths posted on your favorite forum.

What do I put in a robots.txt file?

Your robots.txt file contains useful but pretty much ignored statements like
 # Please don't crawl this site during our business hours!
(the crawler is not aware of your time zone and doesn’t grab your office hours from your site), as well as actual crawler directives. In other words, everything you write in your robots.txt is a directive for crawlers (dumb Web robots that can fetch your contents but nothing more), not indexers (high sophisticated algorithms that rank only brain farts from Matt and me).

Currently, there are only three statements you can use in robots.txt:
Disallow: /path
Allow: /path
Sitemap: http://example.com/sitemap.xml
Some search engines support other directives like “crawl-delay”, but that’s utterly nonsense, hence safely igore those.

The content of a robots.txt file consists of sections dedicated to particular crawlers. If you’ve nothing to hide, then your robots.txt file looks like:
 User-agent: *
 Disallow:
 Allow: /
 Sitemap: http://example.com/sitemap.xml 
If you’re comfortable with Google but MSN scares you, then write:
 User-agent: *
 Disallow:
User-agent: Googlebot
 Disallow:
User-agent: msnbot
 Disallow: /
Please note that you must terminate every crawler section with an empty line. You can gather the names of crawlers by visiting a search engine’s Webmaster section.

From the examples above you’ve learned that each search engine has its own section (at least if you want to hide anything from a particular SE), that each section starts with a
 User-agent: [crawler name]
line, and that each section is terminated with a blank line. The user agent name “*” stands for the universal Web robot, that means that if your robots.txt lacks a section for a particular crawler, it will use the “*” directives, and that when you’ve a section for a particular crawler, it will ignore the “*” section. In other words, if you create a section for a crawler, you must duplicate all statements from the “all crawlers” (“User-agent: *”) section before you edit the code.

Now to the directives. The most important crawler directive is
 Disallow: /path
“Disallow” means that a crawler must not fetch contents from URIs that match “/path”. “/path” is either a relative URI or an URI pattern (“*” matches any string and “$” marks the end of an URI). Not all search engines support wildcards, for example MSN lacks any wildcard support (they might grow up some day).

URIs are always relative to the Web space’s root, so if you copy and paste URLs then remove the http://example.com part but not the leading slash.
Allow: path/
refines Disallow: statements, for example
 User-agent: Googlebot 
 Disallow: / 
 Allow: /content/
allows crawling only within http://example.com/content/
Sitemap: http://example.com/sitemap.xml
points search engines that support the sitemaps protocol to the submission files.

Please note that all robots.txt directives are crawler directives that don’t affect indexing. Search engines do index disallow’ed URLs pulling title and snippet from foreign sources, for example ODP (DMOZ – The Open Directory) listings or the Yahoo directory. Some search engines provide a method to remove disallow’ed contents from their SERPs on request.

Say I want to keep a file/folder out of Google. Exactly what would I need to do?

You’d check each HTTP request for Googlebot and serve it a 403 or 410 HTTP response code. Or put a “noindex,noarchive” Googlebot meta tag.
(*meta name=”Googlebot” content=”noindex,noarchive” /*). Robots.txt blocks with Disallow: don’t prevent from indexing. Don’t block crawling of pages that you want to have deindexed, as long as you don’t want to use Google’s robots.txt based URL terminator every six months.

If someone wants to know more about robots.txt, where do they go?

Honestly, I don’t know a better resource than my brain, partly dumped here. I even developed a few new robots.txt directives and posted a request for comments a few days ago. I hope that Google, the one and only search engine that seriously invests in REP evolvements, will not ignore this post caused by the sneakily embedded “Google bashing”. I plan to write a few more posts, not that technical and with real world examples.

Can I auto-generate and mask a robots.txt?

Of course you can ask, and yes, it’s for everybody and 100% ethical. It’s a very simple task, in fact it’s plain cloaking. The trick is to make the robots.txt file a server sided script. Then check all requests for verified crawlers and serve the right contents to each search engine. A smart robots.txt even maintains crawler IP lists and stores raw data for reports. I recently wrote a manual on cloaked robots.txt files on request of a loyal reader.

If you enjoyed this step by step guide for beginners – you can take your knowledge to the next level at http://sebastians-pamphlets.com/

What Google says about Robots txt files

A robots.txt file restricts access to your site by search engine robots that crawl the web. These bots are automated, and before they access pages of a site, they check to see if a robots.txt file exists that prevents them from accessing certain pages. (All respectable robots will respect the directives in a robots.txt file, although some may interpret them differently. However, a robots.txt is not enforceable, and some spammers and other troublemakers may ignore it. For this reason, we recommend password protecting confidential information.)

If you want search engines to index everything in your site, you don’t need a robots.txt file (not even an empty one).While Google won’t crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web.

As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site…. can appear in Google search results.