Robots.txt files are elements on your websites that you don’t want to throw out. They allow and block entrance to unwanted bot visitors trying to “snoop” through your website content.
This is more or less a simple way to define robots.txt files.
In this post, I’ll get into the robots.txt for SEO basics.
You’ll learn:
- When you should use them
- How to implement them
- Mistakes to avoid
Bots used by search engines are spiders that crawl the web to index website content from all over the internet. This information lets search engines learn about the content on web pages so that it may be retrieved when needed.
Once you understand the process of web crawling you’ll also understand why robots.txt files are beneficial for your website. They are here to protect you from visitors snooping around. They will only give out the information you wish to show about your site.
To better understand robots.txt files, let us take a closer look at what they are and how they all blend in together.
What Are robots.txt Files?
Robots.txt. files, also known as the Robots Exclusion Protocol, are files read by search engines that contain rules on granting or denying access to all or certain parts of your website. Search engines like Google or Bing send web crawlers to access your website and collect information they can use so your content can appear in search results.
To picture how robot.txt files work, try imagining bots or small spiders crawling through your website in search of information. Reflect on those Sci-Fi movies when a million robot spiders crawl the place and snoop around to find even the slightest possible evidence of the imposter’s presence.
These simple text files are used for SEO by issuing commands to search engines indexing bots that a page may or may not be crawled. Robots.txt files are primarily used to manage web crawlers’ budget and come in handy when you don’t wish these crawlers to access a part of your site.
Robots.txt. files are very important because they let search engines know where they are allowed to crawl. Basically, what they do is they block your website partially or in full, or they index your website. In other words, it’s a way to allow your website to be discovered by search engines.
The Crawling Process at Work
The process of crawling websites for content is known as spidering. The main task of search engines is to crawl the web to discover and index content by following millions of links. When a robot accesses a site, the first thing they do is look for the robots.txt files to get information of how much “snooping” they can do.
Search engines do abide by the rules set in your robots.txt files. If there is no robot.txt file or the website has no prohibited activity, the bots will crawl all information. However, some search engines like Google don’t support all directives given and we will elaborate this further down.
Why Use robots.txt Files?
Robots.txt files allow websites to do several things like:
- Block access to the entire site
- Block access to a portion of the site
- Block access to one URL or specific URL parameters
- Block access to a whole directory
- Allows setting up of wildcards
Robots.txt files control the crawler’s activity on your site by allowing them access to certain areas. There are always reasons why you wouldn’t grant Google or other search engines access to certain parts of your website. One could be that you are still developing your website or you wish to protect confidential information.
Although websites can function without a robots.txt file, it’s important to remember a few benefits of using them:
- Prevent search engines from crawling through private folders or subdomains
- Prevent crawling of duplicate content and visiting pages you consider insignificant
- Prevent indexing of some images on your site
- Prevent and manage server overload
- Prevent slowing down of the website
Note that telling bots not to crawl a page doesn’t mean it won’t get indexed. The URL will appear in the search engine, but it will appear without a meta description.
How to Find, Create and Test robots.txt Files?
The robots.txt is always found in the website’s root domain. For example, you can find it as https://www.example.com/robots.txt. If you wish to edit it, you can access the File Manager in the host’s CPanel.
If your website doesn’t have a robots.txt file, creating one is rather straightforward because it is a basic text file created in a text editor. Simply open a blank .txt document and insert your directives. When you are finished, just save the file as “robots.txt” and there you have it.
If you generally make a lot of mistakes typing, maybe it’s wise to use a robots.txt generator to avoid SEO disasters and minimize syntax errors. Remember that even the slightest mistake of missing or adding one letter or number can bring about trouble.
Once the robots.txt file is created, put it in the appropriate domain root directory. Make sure to test the file before going live to be certain that it’s valid. To do this, you need to go to the Google Support page and click the button “open robots.txt tester”. Unfortunately, this testing option is available only on the old version of Google Search Console.
Select the property you wish to test, remove anything that may be in the box, and paste your robots.txt file. If your file receives the OK then you have a fully functional robots.txt file. If not, you need to go back and look for the mistake.
Implementing Crawl Directives
Each robots.txt file is made up of directives, giving the search engines access to information. Each directive begins by specifying the user-agent and then setting the rules for that user-agent. Below we have compiled two lists; one contains supported directives and the other unsupported directives by user-agents.
Supported Directives
- User-agent – a directive used to target certain bots. Search engines look for user agents and blocks that apply to them. Every search engine has a user-agent mark. Due to case sensitivity, make sure you enter the correct form of the user-agents.
For example:
User-agent: Googlebot
User-agent: Bingbot
- Disallow – use this directive if you want to keep search engines from crawling certain areas of the website. You can do the following:
block access to a directory as a whole for all user-agents:
user-agent: *
Disallow: /
Block a certain directory in particular for all user-agents
user-agent: *
Disallow: /portfolio
Block access to PDF or any other files for all user agents. Just use the appropriate file extension.
user-agent: *
Disallow: *.pdf$
- Allow – This directive allows search engines to crawl the page or directory. A good note to remember is that you can override a disallowed directive. Let’s say you don’t want search engines to crawl a portfolio directory, but you’ll allow them to access a specific one.
user-agent: *
Disallow: /portfolio
Allow: /portfolio/allowed-portfolio
- Sitemap – giving search engines the sitemap location makes it easier for them to crawl it.
Unsupported Directives
- Crawl Delay – this is a good directive to use when you want bots to slow down and delay between crawls in order not to overwhelm your servers. This directive is quite helpful for small websites rather than big ones. Just a note that the crawl delay directive is no longer supported by Google and Baidu, but Yandex and Bing do still support it.
- Noindex – a directive used to exclude a website or a file from search engines. This command was never supported by Google. So, if you want to avoid search engines, you need to use x-robots HTTP header or meta tag robots.
- Nofollow – another directive never supported by Google and used to command search engines not to follow links on pages. Use x-robots header or meta tag robots to use the nofollow directive on all links.
- Host directive – it’s used to decide whether you wish to show www. before a URL (example.com or www.example.com). This directive is currently supported only by Yandex, so it’s advised not to rely on it.
Use of Wildcards
Wildcards are characters used to simplify robots.txt instructions. The wildcards may be used to address and apply directives to all user-agents or to address specific user-agents individually. Here are the wildcards commonly used:
- Asterix (*) – in directives, it corresponds to “apply to all user-agents”. It may also be used to correspond to “match URL patterns or any sequence of characters”. If you have URLs that follow the same pattern, then this will make your life much easier.
- A dollar sign ($) – is used to mark the end of a URL.
Let’s see how this will look in an example. If you decide that all search engines should not have access to your PDF files, then the robots.txt should look like this:
user-agent: *
Disallow: /*.pdf$
So URLs that end with .pdf will not be accessible. But take note that if your URL has additional text after the .pdf ending, then that URL will be accessible. Thus, when writing your robots.txt files, make sure you have considered all aspects.
Mistakes to Avoid
Using robot.txt files is useful and there are many ways to operate them. But let’s dive deeper and go through the mistakes that need to be avoided when using the robots.txt file.
The benefits are immense, but there is also a lot of damage that might be done if robot.txt files are not utilized in the right way.
- New line – use a new line for each directive in order not to confuse search engines
- Pay attention to case sensitivity – create the robots.txt files properly as they are case sensitive. Pay close attention to this or they won’t work
- Avoid blocking content – Make sure to go over disallow and noindex tags several times because they might be hurting the SEO results. Be careful not to block good content that should be presented publicly
- Protect private data – to secure private information, it’s wise to ask visitors to log in. This way you’ll be sure that PDFs or other files will be secure
- Overuse of crawl delay – a bit of good advice is not to overuse any directive, especially the crawl delay. If you are running a large website, the use of this directive may be counterproductive. You’ll be limiting the bots’ crawling to the maximum number of URLs a day, which is not advisable.
Duplicate Content
There are several reasons why your site may contain duplicate content. It may be a printer-friendly version, a page accessible from multiple URLs, or different pages that have similar content. Search engines are unable to recognize if that is a duplicate version or not.
In cases like these, the user needs to mark the URL as canonical. This tag is used to inform the search engine what the original location of the duplicate is. If the user doesn’t do this, then the user-agent will choose which is canonical, or what’s even worse, they might label both contents as canonical. Another way to avoid this is to rewrite the content.
Let Crawling Eyes Index
When search engines do web crawling or spidering of your website, they go through all the content on the website to index it. This process allows crawled websites to appear in the results section of the search engines.
By using robots.txt, you tell search engines where they have or don’t have access. You are basically limiting them by setting appropriate rules. The use of robots.txt is rather simple and useful. Once you learn the rules of assigning the directives, then there are many things you can do with your website.
It’s recommended that you keep an eye on your robots.txt files to make sure that they are set up correctly and performing as coded. If you notice any malfunction, react quickly to avoid disasters.
Consider robots.txt files to be an essential tool for successfully controlling the indexing of your website.