Google’s John Mueller answered a question about Search Console and 404 error reporting, suggesting that repeated crawling of pages with a 404 status code is a positive signal.
404 Status Code
The 404 status code, often referred to as an error code, has long confused many site owners and SEOs because the word “error” implies that something is broken and needs to be fixed. But that is not the case.
404 is simply a status code that a server sends in response to a browser’s request for a page. 404 is a message that communicates that the requested page was not found. The only thing in error is the request itself because the page does not exist.
Although typically referred to as a 404 Error, technically the formal name is 404 Not Found. That name accurately reflects the meaning of the 404 status code: the requested page was not found.
Screenshot Of The Official Web Standard For 4o4 Status Code
Google Keeps Crawling 404 Pages
Someone on Reddit posted that Google Search Console keeps reporting that pages that no longer exist keep getting found via sitemap data, despite the sitemap no longer listing the missing pages.
The person claims that Search Console is crawling the missing pages, but it’s really Googlebot that’s crawling them; Search Console is merely reporting the failed crawls.
They’re concerned about wasted crawl budget and want to know if they should send a 410 response code instead.
They wrote:
“Google Search Console is still crawling a bunch of non-existent pages that return 404. In the Page Inspection tool and Crawl Stats, it says they are “discovered via” my page-sitemap.xml.
The problem:
When I open the actual page-sitemap.xml in the browser right now, none of those 404 URLs are in it.
The sitemap only contains 21 good, live pages.
…I don’t want to delete or stop submitting the sitemap because it’s clean and only points to good pages. But these repeated crawls are wasting crawl budget.
Has anyone run into this before?
Does Google eventually stop on its own?
Should I switch the 404s to 410 Gone?
Or is there another way to tell GSC “hey, these are gone forever”?”
About Google’s 404 Page Crawls
Google has a longstanding practice of crawling 404 pages just in case those pages were removed by accident and have been restored. As you’ll see in a moment, Google’s John Mueller strongly indicates that repeated 404 page crawling indicates that Google’s systems may regard the content in a positive light.
About 404 Page Not Found Response
The official web standard definition of the 404 status code is that the requested resource was not found, and that is it, nothing more. This response does not indicate that the page is never returning. It simply means that the requested page was not found.
About 410 Gone Response
The official web standard for 410 status code is that the page is gone and that the state of being gone is likely permanent. The purpose of the response is to communicate that the resources are intentionally gone and that any links to those resources should be removed.
Google Essentially Handles 404 And 410 The Same
Technically, if a web page is permanently gone and never coming back, 410 is the correct server message to send in response to requests for the missing page. In practice, Google treats the 410 response virtually the same as it does the 404 server response. Similar to how it treats 404 responses, Google’s crawlers may still return to check if the 410 response page is gone.
Googlers have consistently said that the 410 server response is slightly faster at purging a page from Google’s index.
Google Confirms Facts About 404 And 410 Response Codes
Google’s Mueller responded with a short but information-packed answer that explained that 404s reported in Search Console aren’t an issue that needs to be fixed, that sending a 410 response won’t make a difference in Search Console 404 reporting, and that an abundance of URLs in that report can be seen in a positive light.
Mueller responded:
“These don’t cause problems, so I’d just let them be. They’ll be recrawled for potentially a long time, a 410 won’t change that. In a way, this means Google would be ok with picking up more content from your site.”
Misunderstandings About 4XX Server Responses
The discussion on Reddit continued. The moderator of the r/SEO subreddit suggested that the reason Search Console reports that it discovered the URL in the sitemap is because that is where Googlebot originally discovered the URL, which sounds reasonable.
Where the moderator got it wrong is in explaining what the 404 response code means.
The moderator incorrectly explained:
“404 essentially means – page broken, we’ll fix it soon, check back: and that’s what Google is doing – checking back to see if you fixed it.”
The moderator makes two errors in their response.
1. 404 Means Page Not Found
The 404 status code only means that the page was not found, period. Don’t believe me? Here is the official web standard for the 404 status code:
“The 404 (Not Found) status code indicates that the origin server did not find a current representation for the target resource or is not willing to disclose that one exists. A 404 status code does not indicate whether this lack of representation is temporary or permanent…”
2. 404 Is Not An Error That Needs Fixing
People commonly refer to the 404 status code as an error response. The reason it’s an error is because the browser or crawler requested a URL that does not exist, which means that the request was the error, not that the page needs fixing, as the moderator insisted when they said “404 essentially means – page broken,” which is 100% incorrect.
Furthermore, the Reddit moderator was incorrect to insist that Google is “checking back to see if you fixed it.” Google is checking back to see if the page went missing by accident, but that does not mean that the 404 is something that needs fixing. Most of the time, a page is supposed to be gone for a reason, and Google recommends serving a 404 response code for those times.
This Is Not New
This isn’t a matter of the Reddit moderator’s information being out of date. This has always been the case with Google, which generally follows the official web standards.
Google’s Matt Cutts explained how Google handles 404s and why in a 2014 video:
“It turns out webmasters shoot themselves in the foot pretty often. Pages go missing, people misconfigure sites, sites go down, people block Googlebot by accident, people block regular users by accident. So if you look at the entire web, the crawl team has to design to be robust against that.
So with 404s… we are going to protect that page for twenty four hours in the crawling system. So we sort of wait, and we say, well, maybe that was a transient 404. Maybe it wasn’t really intended to be a page not found. And so in the crawling system it’ll be protected for twenty four hours.
…Now, don’t take this too much the wrong way, we’ll still go back and recheck and make sure, are those pages really gone or maybe the pages have come back alive again.
…And so if a page is gone, it’s fine to serve a 404. If you know it’s gone for real, it’s fine to serve a 410.
But we’ll design our crawling system to try to be robust. But if your site goes down, or if you get hacked or whatever, that we try to make sure that we can still find the good content whenever it’s available.”
The Takeaways
- Googlebot crawling for 404 pages can be seen as a positive signal that Google likes your content.
- 404 status codes do not mean that a page is in error; it means that a page was not found.
- 404 status codes do not mean that something needs fixing. It only means that a requested page was not found.
- There’s nothing wrong with serving a 404 response code; Google recommends it.
- Search Console shows 404 responses so that a site owner can decide whether or not those pages are intentionally gone.
Featured Image by Shutterstock/Jack_the_sparow