What is Robots.txt and How to Create a Robots.txt File
Search engines use what they call a robot to crawl your web pages and parse its data for search engine inclusion and valuation. These robots are also called bots and spiders, and they visit your website on a regular basis — or at least they should — and comb your website for new content, pages and so on. These bots will usually crawl every page they can find on a website unless you tell them now to.
This is where the robots.txt file comes in. What is robots.txt? It’s a file you put on your server to tell some or all of the search engine robots not to parse particular files or directories. Why would you want to do this? Maybe you don’t want the admin directory of your CMS to show up in search engine results, or maybe you have a development area meant only for internal eyes to see. Whatever the case, using a robots.txt file will tell the search engines to avoid these specific files and directories, and are the accepted standard of robot control for your own site.
So how do you create a robots.txt file? It’s fairly simple once you understand the needed syntax. First, you have to tell the robots.txt file which user agents (their term for search engine bots) to talk to. This could look like this:
This says that this particular portion of the file is talking to all robots. You can also specific which bots to talk to, such as Google in this example:
Once you have specified your user agents, you then use the “disallow” command to tell the robots which files or directories to avoid. For example:
This tells the search engine spiders that the entire website is open to them. Let’s say, however, you wanted a search engine spider not to parse your admin directory. Then your file would look like this:
Dissallow: /admin/ “
This will then tell search engine robots not to parse that directory. This also works for files, such as temporary page you might now want engines to see, such as temp.html. Then your file would look like this:
You can also mix and match these terms in one robots.txt file, have separate sections for different engines, and so on. Once you have created your robots.txt file, it needs to be placed in the root directory of your website, such as:
So the search engines can find it. This gives you control over what the search engines can and can’t see, and is a great way to make sure files and directories you don’t want to be seen by the general public won’t be.