Robots.txt Generator - McAnerin International Inc. International English Site Map | Search International Home SEM | Search Engine Marketing SEO | Search Engine Optimization Search Resources About + Contact Us Back | McAnerin Networks Inc. > Robots.txt Tool SEO Tools Robot Control Code Generation Tool If you know of a robot that should be added to this list please contact us and we will verify and add it. Newest Additions: XML Sitemap Auto Discovery directive Recent Additions: Baidu, image and blog crawlers, Alexa/wayback, and crawl-delay directives. Default - All Robots are: Allowed Refused Crawl-Delay: Default - No Delay 5 Seconds 10 Seconds 20 Seconds 60 seconds 120 Seconds Sitemap: (leave blank for none) Specific Search Robots:Google Same as Default Allowed Refused googlebot MSN Search Same as Default Allowed Refused msnbot Yahoo Same as Default Allowed Refused yahoo-slurp Ask/Teoma Same as Default Allowed Refused teoma GigaBlast Same as Default Allowed Refused gigabot Scrub The Web Same as Default Allowed Refused scrubby DMOZ Checker Same as Default Allowed Refused robozilla Nutch Same as Default Allowed Refused nutch Alexa/Wayback Same as Default Allowed Refused ia_archiver Baidu Same as Default Allowed Refused baiduspider Specific Special Bots:Google Image Same as Default Allowed Refused googlebot-image Yahoo MM Same as Default Allowed Refused yahoo-mmcrawler MSN PicSearch Same as Default Allowed Refused psbot SingingFish Same as Default Allowed Refused asterias Yahoo Blogs Same as Default Allowed Refused yahoo-blogs/v3.9 Restricted Directories:The path is relative to root and must contain a trailing "/" # robots.txt generated at http://www.mcanerin.com User-agent: * Disallow: Disallow: /cgi-bin/ Sitemap: http://www.ruskingtongardencentre.co.uk/sitemap.xml Now, copy and paste this text into a blank text file called "robots.txt" (don't forget the "s" on the end of "robots") and put it in your root directory. Like all other files on your server, make sure it's permissions are set so that visitors (such as search engines) can read it. This Generator is Copyright © 2004-2007 Ian McAnerin & McAnerin Networks Inc. Introduction to Robots.txt The robots.txt is a very simple text file that is placed on your root directory. An example would be www.yourdomain.com/robots.txt. This file tells search engine and other robots which areas of your site they are allowed to visit and index. You can ONLY have one robots.txt on your site and ONLY in the root directory (where your home page is): OK: www.yourdomain.com/robots.txt BAD - Won't work: www.yourdomain.com/subdirectory/robots.txt All major search engine spiders respect this, and naturally most spambots (email collectors for spammers) do not. If you truly want security on your site, you will have to actually put the files in a protected directory, rather than trusting the robots.txt file to do the job. It's guidance for robots, not security from prying eyes. What does a Robots.txt look like? At it's most simple, a robots.txt file looks like this: User-agent: * Disallow:This one tells all robots (user agents) to go anywhere they want (disallow nothing). This one, on the other hand, keeps out all compliant robots: User-agent: * Disallow: /As you can see, the only difference between them is a single slash ( "/" ). But if you accidentally use that slash when you didn't mean to, you could find your search engine rankings disappear. Be very careful. One important thing to know if you are creating your own robots.txt file is that although the wildcard (*) is used in the user-agent line, it is not allowed in the disallow line. For example, you can't have something like: # Broken robots.txt - can't use the * symbol in the disallow line, even if you really want to and it makes sense to have one (Google and MSN are an exception to this - more information below) User-agent: * Disallow: /presentations/*.pptHere is the official information on the subject: RobotsTxt.org You may also be interested in: robots.txt validator and Robot Cop (Server module that enforces bot behaviour) UPDATE: If you use Google Sitemaps (and you should), they have now included a robots.txt validator in it - which will make certain that your robots.txt file is understood properly by Google. Pre-Made Robots.txt Files If you want a simple file already pre-made and ready to drop into your website root, you can get them here (right click and choose "save as"): Allow All Robots Refuse All Robots Allow All Robots everywhere EXCEPT the cgi-bin and the images directory Only Allow Known Major Search Engines (note: this will disallow some good robots used by some directories to check your listings - be careful) After you upload these to your server, make sure you set the permissions on the file so that visitors (like search engines) can read it. If you need more control than this, there is a free robots.txt generator at the top of this page that should help you out. This is some commercial software helpful to people with very complicated robots.txt needs: RoboGen Major Known Spiders Googlebot (Google), Googlebot-Image (Google Image Search), MSNBot (MSN), Slurp (Yahoo), Yahoo-Blogs, Mozilla/2.0 (compatible; Ask Jeeves/Teoma), Gigabot (Gigablast), Scrubby (Scrub The Web), Robozilla (DMOZ) Search Engine Specific Commands Google Google allows the use of asterisks. Disallow patterns may include "*" to match any sequence of characters, and patterns may end in "$" to indicate the end of a name. To remove all files of a specific file type (for example, to include .jpg but not .gif images), you'd use the following robots.txt entry: User-agent: Googlebot-Image Disallow: /*.gif$This applies to both googlebot and google-image spiders. Source: http://www.google.com/webmasters/remove.html Apparently does NOT support the crawl-delay command. Yahoo Yahoo also has a few specific commands, including the: Crawl-delay: xx instruction, where "xx" is the minimum delay in seconds between successive crawler accesses. Yahoo's default crawl-delay value is 1 second. If the crawler rate is a problem for your server, you can set the delay up to up to 5 or 20 or a comfortable value for your server. Setting a crawl-delay of 20 seconds for Yahoo-Blogs/v3.9 would look something like: User-agent: Yahoo-Blogs/v3.9 Crawl-delay: 20Source: http://help.yahoo.com/help/us/ysearch/crawling/crawling-02.html Ask / Teoma Supports the crawl-delay command. MSN Search Supports the crawl-delay command Also allows wildcard behavior User-agent: msnbot Disallow: /*.[file extension]$(the "$" is required, in order to declare the end of the file) Examples: User-agent: msnbot Disallow: /*.PDF$ Disallow: /*.jpeg$ Disallow: /*.exe$Source: http://search.msn.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_REF_RestrictAccessToSite.htm Why do I want a Robots.txt? There are several reasons you would want to control a robots visit to your site: It saves your bandwidth - the spider won't visit areas where there is no useful information (your cgi-bin, images, etc) It gives you a very basic level of protection - although it's not very good security, it will keep people from easily finding stuff you don't want easily accessible via search engines. They actually have to visit your site and go to the directory instead of finding it on Google, MSN, Yahoo or Teoma. It cleans up your logs - every time a search engine visits your site it requests the robots.txt, which can happen several times a day. If you don't have one it generates a "404 Not Found" error each time. It's hard to wade through all of these to find genuine errors at the end of the month. It can prevent spam and penalties associated with duplicate content. Lets say you have a high speed and low speed version of your site, or a landing page intended for use with advertising campaigns. If this content duplicates other content on your site you can find yourself in ill-favor with some search engines. You can use the robots.txt file to prevent the content from being indexed, and therefore avoid issues. Some webmasters also use it to exclude "test" or "development" areas of a website that are not ready for public viewing yet. It's good programming policy. Pros have a robots.txt. Amateurs don't. What group do you want your site to be in? This is more of an ego/image thing than a "real" reason but in competitive areas or when applying for a job can make a difference. Some employers may consider not hiring a webmaster who didn't know how to use one, on the assumption that they may not to know other, more critical things, as well. Many feel it's sloppy and unprofessional not to use one. Robots.txt FAQ - Issues, Facts and Fiction By itself, a robots.txt file is harmless and actually beneficial. However, it's job is to tell a search engine to keep away from parts of your website. If you misconfigure it, you can accidentally prevent your site from being spidered and indexed. This has happened to people both due to an error in the robots.txt file and also after a site redesign where the directory structure of the site has changed and the robots.txt has not been updated. Always check the robots.txt after a major site redesign. A robots.txt file and, for that matter, the robots metatag (related: free robots meta tag generator), has NO EFFECT on speeding up the spidering and indexing of a website, and no effect of the depth or breadth of the spidering of a site. You cannot issue a search engine spider a command to do something - you can only tell it not to do something. Security Issue: A robots.txt is not intended to provide security for your website - humans ignore them. Additionally, there is actually an additional possible security issue with them. Lets say you have a secret directory in your site called "secretsauce'. You don't want it spidered so you add this directory to your robots.txt. The problem now is that anyone can look up your robots.txt file and see that you don't want people looking at that directory. Obviously, if you were a hacker, this would be your first stop. Additionally, if the path you were excluding was "/secretfiles/secretsauce/" the same hacker now knows that you have another directory called "secretfiles", as well. It's never a good idea to tell a hacker details about your site structure and design. If you are trying to keep people away from information, you need to use real file and folder level security on your site, which will prevent robots from visiting just like people, even if the robots.txt file says it's ok. I recommend you set your robots.txt to only deal with non-critical and normal directories, such as images, cgi-bin, etc and then use file security for the rest. That way, even though the robots are not specifically excluded from the folders and files, they are effectively excluded by the the file permissions. Only use robots.txt (and robots metatags) to exclude files, pages and directories that are intended to be available to people but not to robots, such as duplicate pages, test pages and demos. Rule of thumb: If you want to restrict robots from entire websites and directories, use the robots.txt file. If you want to restrict robots from a single page, use the robots metatag. If you are looking to restrict the spidering of a single link, you would use the link "nofollow" attribute. GranularityBest Method Websites or Directoriesrobots.txt Single Pagesrobots metatag Single Linksnofollow attribute Unless otherwise noted, all articles written by Ian McAnerin, BASc, LLB. Copyright © 2002-2004 All Rights Reserved. Permission must be specifically granted in writing for use or reprinting anywhere but on this site, but we do allow it and don't charge for it, other than a backlink. Contact Us for more information. Content Copyright © 1994 - 2007 McAnerin International Inc. & McAnerin Networks Inc. All Rights Reserved. < Legal Notice >