Noindex vs Disallow

Page content

What’s the difference between the noindex meta tag and the disallow directive in robots.txt?

In terms of SEO, you are required to make some critical decisions regarding how your content gets indexed and presented in search results. While you may want your posts and important pages indexed by Google, you would not want unimportant pages getting indexed. Now, you have to decide whether certain pages have to be indexed or not.

Another related critical decision is whether certain pages must be crawled by the search engine bot or not. There might be certain pages you would aant to hide from search bot not just because you do not want it getting indexed, but because these are private pages that must not appear in either search results or be available to search engines.

You have two options in this regard. The first option is to mark the page as noindex, follow, which will allow search engine bots to crawl the page but not index it so they do not appear before visitors in the form of search results. Users may come across these pages while browsing your website but they will not find them included in search results.

For example, you noindex your archive pages. A user browsing your website might go to the archive pages yet. However, he would not see these pages in the search results. It is because you have marked these pages as no index and therefore search bots like Google would keep the archive pages off their index. The robots.txt file has also got a disallow rule which can prevent search bots from crawling certain pages. However, the disallow directive is fundamentally different from the noindex directive and the reasons behind using either of them are also very different.

In this post, we are going to discuss the difference between no index and the disallow directive in the robots.txt, the role and use of each as well as the circumstance under which you might use either of them.

What is the Noindex tag?

The noindex meta tag can be applied in the head using a tag or as an HTTP response header (X Roots tag). However, the primary role of the noindex tag is to direct search engines to not index the webpage. The search engine bots will still be able to crawl the page but they will not index it.

The most common method of implementing a noindex tag is to apply the meta tag in the page’s head.

It looks like the following and gets added to the head section inside the html of your page.

<meta name="robots" content="noindex">

When search crawlers see the above tag, they avoid indexing that page. Another server side method of implementing these tags is to apply them via X Robots tag which can also be done via the .htaccess file. Both the meta tag and HTTP header have the same impact and you can select the one that is convenient to you.

Instead of a tag, you can return an X-Robots-Tag HTTP header with a value ofnoindex in the response. You can use a response header for non-HTML resources, such as PDFs, video files, and image files. Moreover, any of the meta tags can also be applied as an X-Robots-Tag header and you can combine multiple headers within the http response. Here’s an example of an HTTP response with an X-Robots-Tag header instructing search engines not to index a page:

HTTP/1.1 200 OK
(...)
X-Robots-Tag: noindex
(...)

What is Robots.txt?

The Robots.txt file is placed at the root of a website and is used to tell search engines which pages to crawl and which to not. You can use allow and disallow directives in the Robots.txt file to tell search engines to crawl or not to crawl certain pages.

For example, a directive like the following in the robots.txt file will allow the search bot to crawl everything on the website:

User-Agent: *
Disallow: 

Whereas, the following directive in the robots.txt file will block everything on the website for the searchbots:

User-Agent: *
Disallow: /

Therefore, it is important to be cautious with the use of directives in robots.txt since a syntax error can prevent your entire website from getting crawled and indexed. Robots.txt errors can prove costly and therefore, it is suggested to be very cautious when using the disallow directives in robots.txt.

When to use the noindex tag

It might have become partly clear to you that the scenarios in which you use the noindex tag or the robots.txt disallow directive are generally different. You must use the noindex tags in the following scenarios:

  • To prevent duplicate results from appearing in searches.

  • To prevent your author or other archive pages from appearing in searches and competing with post urls.

  • To prevent useless pages from appearing in searches.

  • To prevent certain private pages or pages with sensitive information from appearing in searches.

  • Keep temporary pages that will be removed in future from appearing in searches.

So, these are the main scenarios where you must use the noindex tag to inform search engines that this page must not be indexed. You can keep a track of the pages you have marked as noindex from inside the Google search console also. Go to the Google search console and pages section where you can check out the urls that arre not indexed because of the noindex rule.

Additionally, you can use the url inspection tool to check if particular pages are available to Google or not and if they are marked as index or noindex. If you are using Wordpress, it is pretty easy to apply the noindex tag using a dedicated SEO plugin like Yoast, Rank Math, AIOSEO or SEO Framework.

When you should use the disallow directive inside the robots.txt file

It is a good practice to keep certain pages off the reach of the searchbots since it helps protect sensitive information from being exposed to the public. However, it also serves another important benefit which is to help manage the crawl budget which you do not want to be wasted on unnecessary urls.

You can use allow and disallow directives inside the robotx.tct to allow or prevent teh search bot from crawling specific pages. For example, you can prevent an entire directory from being crawled by search engines using the disallow rule. However, you can use the allow directive to allow them to crawl an important file in the same directory.

User-Agent: *
Disallow: /important-directory?
Allow: /a-page-inside-important-directory/

So, while we are blocking an entire directory on our website from being crawled with the disallow directive, we are allowing a page inside the same directory to get crawled using all the allow directive.
In Wordpress, the robots.txt settings look like the following generally.

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

You can see that the wp-admin folder is blocked but the admin-ajax.php file inside the same directory is allowed to crawl. The rest of the directory is off limits to the crawlers.

So, you nuts have understood when you can use the robots.txt disallow rule as opposed to using a noindex tag.

You can physically add a robots.txt file to your wordpress root folder or you can use SEO plugins like Yoast or Rank Math to generate a virtual robots.txt file.

Main differences between noindex and disallow

  • Noindex prevents indexation but the disallow directive prevents crawling of pages.

  • Noindex is more effective when trying to prevent indexation.

  • Noindex is a meta tag that lives inside the website head whereas the disallow directive is used in the robots.txt file in the root folder.

  • The disallow rule prevents crawling but might not be able to prevent indexation because of external pages linking to that page.

Please note that the noindex tag and the disallow rule play different roles in terms of Seo and should not be mistaken to be similar. They must not be used simultaneously for preventing indexation. If you block a page from crawling, the bots might not be able to see the noindex tag and the page might end up getting indexed because of some external page linking to it.

Use the noindex meta tag to prevent indexation and the disallow directive to prevent crawling but not both. If you want a page out of Google’s index, you can apply a noindex rule to it and wait for Google to crawl the page again.

Suggested Reading

How to fix Discovered Currently not Indexed

Learn about Google Policies Regarding Spam

How to Host Google Fonts Locally in Wordpress

Fix Crawled Currently Not Indexed

How to Diagnose and Recover from a Drop in Website Traffic/Impressions

How to enable mod_headers on Apache server Ubuntu 22.04

How to enable mod_expires in Apache Server Ubuntu 22.04