Correct use of robots.txt to allow all crawlers

Getting your Trinity Audio player ready…

Robots.txt file is used on web server to to give instructions about the website to web robots (also called bots or spiders or crawlers), such as those from search engines. The complete list of available robots can be found here.

The web robots scan the information and index the websites accordingly. So if you do not want web robots (or crawlers) to index your website, you can disallow everything using the robots.txt file. Similarly if you want all the content to be indexed, then you can allow all web robots using the robots.txt file. Generally it is a good idea to allow indexing to your publicly visible pages and disallow to publicly invisible files, folders or pages (such as those inside your control panel or the pages you get to after logging in). The robots.txt file should be placed inside the home directory of your website. For example www.yourwebsite.com/robots.txt

As for the robots.txt file, search engines such as Google can determine whether there are useful information and crawl and index your websites accordingly. If you don’t want to disallow any web robots to crawl your site, there could be three options for letting the search engines know.

Not have a robots.txt file
Having a blank robots.txt file
Adding following lines in the robots.txt

User-agent: *
Disallow:

While all three should allow all crawlers, the safest way to ensure that your website is indexed correctly is using the third format. That is use the robots.txt and put the above code to ensure that all user agents (or robots) are allowed.

There is also a good explanation about this in the Google Webmaster Central Video.

Comments

Leave a Reply Cancel reply