Robots.txt

What Is Robots.txt?

Robots.txt is a text file with instructions that search engine crawlers can read. These instructions define which areas of a website the crawlers should be allowed to search. The Robots.txt file is a part of the robots exclusion standard, also knowns as the robots exclusion protocol (REP). That is a group of web standards that regulates how robots crawl the web. It also defines how they access and index content. The REP also includes other directives such as meta robots.

The robots.txt  file doesn’t only control crawling, you can also integrate a link to your sitemap, which gives search engine crawlers an overview of all existing URLs on your domain. In practice, robots.txt files indicate whether certain user agents can or can’t crawl parts of the website. 

Robots.Txt

How Does Robots.txt Work

Search engines have two primary jobs, crawling the web to discover content and indexing that content so it can be served up to searchers who are looking for information. In order to crawl sites, search engines follow links to get from one site to another. Just like spiders, they end up crawling across many billions of links and websites.

After arriving at a website, the crawler will try to locate a robots.txt file. If it finds one, the crawler will read that file before continuing through the page. The robots.txt file is crucial because it contains information about how the search engine should crawl the site. The information located in the robots.txt file will instruct any further crawler action on this specific website.

If the robots.txt file doesn’t contain any directives that can disallow the crawler’s activity, it will proceed to crawl other information on the site.

In practice, robots.txt can be used for more than a single type of file. For instance, if you use it for image files it prevents these files from appearing in the Google search results. Unimportant files like style files, script files, and image files can be disallowed with robots.txt.

How to Create a Robots.txt File

If you don’t have a robots.txt file, don’t worry, creating one is easy. Open a blank text document and begin typing directives. Continue to build up the directives until you are satisfied with what you have. Alternatively, you can use a robots.txt generator that minimizes syntax errors. That’s really useful because a mistake could result in an SEO catastrophe for your site.

Robots.txt Syntax

If you have never seen a robots.txt file before, it may seem daunting. It’s all about the syntax, which is quite simple actually. You give directives to bots by stating their user-agent followed by the directive itself. Let’s explore this in more detail.

User-agent is the specific web crawler to which you’re giving crawl instructions. There are a lot of user-agents but those useful for SEO are:

  • Google: Googlebot
  • Google images: Google-bot image
  • Yahoo: Slurp
  • Bing: Bingbot
  • Baidu: Baiduspider
  • DuckDuckGo: DuckDuckBot

Disallow is the command used to tell all user-agents not to crawl a particular URL. Don’t forget that only one ‘’Disallow” line is allowed for each URL.

Allow is the command that’s used to tell crawlers they can access a page or subfolder.

Crawl-delay is used to tell you how many seconds a crawler should wait before loading and crawling the page content. You should know that Googlebot doesn’t acknowledge this command, but Google Search Console does.

A sitemap is used to list the location of an XML sitemap associated with this URL. This command is supported by Google, Ask, Yahoo, and Bing.

There is also something that is called a pattern-matching syntax. When it comes to URL blocking or allowing, robots.txt files are a pretty handy tool. They enable the use of pattern matching to cover a variety of URL options. Identifying pages or subfolders that SEO wants to be excluded is not a problem thanks to pattern matching. There are two regular expressions that you can use to do just that — a dollar sign ($) and asterisk (*).

Where Does Robots.txt Go On-Site?

We clarified that whenever search engines and other web crawling robots come to a site they look for the robots.txt file. But where will they look? 

They will always look in one specific place which is the main directory. If the crawler doesn’t find it there, it will assume that the site doesn’t have a robots.txt file and your site will be treated like it doesn’t. Always include your robots.txt in your main directory or root domain.

Why Do You Need a Robots.txt File?

Robots.txt file controls access to certain areas of your site and helps prevent duplicate content issues from emerging. It keeps entire sections of a website private and prevents search engines from indexing certain files on your website. Also, it keeps internal search results pages from showing up on public SERPs. 

All in all, it gives you more control over where search engines can and can’t go on your site.

On the other hand, be very careful not to accidentally disallow Googlebot from crawling your entire site, as this can be very dangerous.

Make sure you’re not blocking any content of your website you want to be crawled.

Don’t use robots.txt to prevent sensitive data such as your personal information from appearing in SERPs. It may still get indexed.

Conclusion

As you can see, there are three different types of robots and they all serve different functions. Robots.txt controls crawl behavior, while meta and x-robots influence indexation behavior at the individual page level.

Learn On-page SEO

Learn Technical SEO

Learn off-page SEO

Backlinks

What Is a Backlink? Backlinks are essentially another term for inbound links, as in, links pointing from other websites to yours. They have an...

read more