Google’s John Mueller answered a question about the curious circumstance of Search Console reporting thousands of URLs as indexed despite being blocked by robots.txt. Mueller helped explain how this happens and what to do about it.
Content Indexed Despite Being Blocked By Robots.txt
A Redditor asked for advice because Google Search Console was reporting more than 51,000 pages under the status “Indexed, though blocked by robots.txt.” The affected URLs were primarily WooCommerce product URLs containing add-to-cart URL parameters like “?add-to-cart=”.
Because the issue appeared suddenly, the site owner questioned whether the robots.txt rules themselves were responsible for creating the problem. They also wanted to know whether removing the rules would help Google process the canonical signals and eliminate the reported URLs from Search Console.
The person asked:
“I have WooCommerce site and suddenly since past month we are facing this issue: “Indexed, though blocked by robots.txt”
there are total “Affected pages 51K pages”
in the end of url I see mostly ?page&post_type=product&product=slug&add-to-cart=98063,
After inspecting those urls I found they have index tag setup and robots.txt had
* Disallow: /*?add-to-cart=
* Disallow: /*?*add-to-cart=I removed those two rules from robots.txt and hoping those pages fixed cause they have canonical set to correct product, will that fix issue?
or should I also setup noindex rules? will that cause us our crawl budget? it is pretty big woocommerce site, let me know guys your thoughts if someone has experience fixing such issue? and what will be the right method without preventing our SEO or functionality loss.”
Google Says Add-To-Cart URLs Don’t Need To Be Indexed
Mueller responded that the add-to-cart URLs do not need to be indexed and that blocking them through robots.txt is an acceptable approach.
He explained that even when Google reports those URLs as indexed, they are unlikely to appear in normal search results because they are blocked by robots.txt. According to Mueller, users generally do not search for those URLs directly, making them poor candidates for search visibility.
John Mueller responded:
“You don’t need the add-to-cart URLs indexed. Blocking them with robots.txt is fine. Even if they get “indexed” since they’re blocked by robots.txt, it’s unlikely that they’ll be shown in search (unless you do specific queries for those URLs, which users don’t do).”
I’m kind of on the fence about what Mueller said about “robots.txt” making it “unlikely” that the URLs will be shown in Search. The reason is because robots.txt does not prevent a web page from showing in Google Search. It just prevents Googlebot from crawling those pages. So technically, that’s not quite correct and I’m a little surprised Mueller would say that.
Noindex Is Probably Not A Solution
One of the Redditors who responded to that question suggested the solution of adding a noindex robots tag to the parameterized URLs. But that may not be a viable solution because the pages with and without the URL parameters are essentially the same thing. They’re rendered using the same template for a specific page. So unless WooCommerce treats them differently and can render the parameterized URLs with a noindex and the regular page without the noindex, that’s not a real solution.
Why Google Reports Indexed URLs That It Can’t Crawl
Another Redditor offered a possible explanation for why so many URLs appeared in Search Console. They suggested that Google likely discovered links containing the add-to-cart parameters somewhere on the site and added those URLs to its systems.
My suggestion for the person who originally asked that question is to crawl the website with Screaming Frog, review the internal linking to identify where those pages are being linked from, and then take some action, like removing those links or adding a rel=”nofollow” link attribute to them.
Likely, the best solution is to use the robots.txt block to prevent crawling, as long as it’s understood that this is all it does. If the person wants to be extra sure, they can also identify where those links exist and then add the nofollow link attribute as an extra layer, a hint to Google. Nofollow is not a directive, but it is a strong hint.
Search Console Warnings Don’t Always Indicate A Search Problem
One of the recurring challenges with Search Console reports is that they can expose technical conditions that look distressing but actually have little to zero effect on search performance. For example, the 404 error reports are useful for a variety of reasons, but many times a 404 server response is the right response, and it’s not really an “error” that needs fixing.
Takeaway
Mueller’s response reinforces the takeaway that not every Search Console warning requires taking action to fix something, although in this specific case there may be something to fix in the form of internal links to webpages that use the shopping cart URL parameters. If those links with the shopping cart URL parameters are absolutely necessary, then using a rel=”nofollow” link attribute will give Google a strong hint not to follow that link. The joy of technical SEO!
Featured Image by Shutterstock/Orange Line Media