Google’s John Mueller answered a question on Reddit about why Google picks one web page over another when multiple pages have duplicate content, also explaining why Google sometimes appears to pick the wrong URL as the canonical.
Canonical URLs
The word canonical was previously mostly used in the religious sense to describe what writings or beliefs were recognized to be authoritative. In the SEO community, the word is used to refer to which URL is the true web page when multiple web pages share the same or similar content.
Google enables site owners and SEOs to provide a hint of which URL is the canonical with the use of an HTML attribute called rel=canonical. SEOs often refer to rel=canonical as an HTML element, but it’s not. Rel=canonical is an attribute of the element. An HTML element is a building block for a web page. An attribute is markup that modifies the element.
Why Google Picks One URL Over Another
A person on Reddit asked Mueller to provide a deeper dive on the reasons why Google picks one URL over another.
They asked:
“Hey John, can I please ask you to go a little deeper on this? Let’s say I want to understand why Google thinks two pages are duplicate and it chooses one over the other and the reason is not really in plain sight. What can one do to better understand why a page is chosen over another if they cover different topics? Like, IDK, red panda and “regular” panda 🐼. TY!!”
Mueller answered with about nine different reasons why Google chooses one page over another, including the technical reasons why Google appears to get it wrong but in reality it’s someetimes due to something that the site owner over SEO overlooked.
Here are the nine reasons he cited for canonical choices:
- Exact duplicate content
The pages are fully identical, leaving no meaningful signal to distinguish one URL from another. - Substantial duplication in main content
A large portion of the primary content overlaps across pages, such as the same article appearing in multiple places. - Too little unique main content relative to template content
The page’s unique content is minimal, so repeated elements like navigation, menus, or layout dominate and make pages appear effectively the same. - URL parameter patterns inferred as duplicates
When multiple parameterized URLs are known to return the same content, Google may generalize that pattern and treat similar parameter variations as duplicates. - Mobile version used for comparison
Google may evaluate the mobile version instead of the desktop version, which can lead to duplication assessments that differ from what is manually checked. - Googlebot-visible version used for evaluation
Canonical decisions are based on what Googlebot actually receives, not necessarily what users see. - Serving Googlebot alternate or non-content pages
If Googlebot is shown bot challenges, pseudo-error pages, or other generic responses, those may match previously seen content and be treated as duplicates. - Failure to render JavaScript content
When Google cannot render the page, it may rely on the base HTML shell, which can be identical across pages and trigger duplication. - Ambiguity or misclassification in the system
In some cases, a URL may be treated as duplicate simply because it appears “misplaced” or due to limitations in how the system interprets similarity.
Here’s Mueller’s complete answer:
“There is no tool that tells you why something was considered duplicate – over the years people often get a feel for it, but it’s not always obvious. Matt’s video “How does Google handle duplicate content?” is a good starter, even now.
Some of the reasons why things are considered duplicate are (these have all been mentioned in various places – duplicate content about duplicate content if you will :-)): exact duplicate (everything is duplicate), partial match (a large part is duplicate, for example, when you have the same post on two blogs; sometimes there’s also just not a lot of content to go on, for example if you have a giant menu and a tiny blog post), or – this is harder – when the URL looks like it would be duplicate based on the duplicates found elsewhere on the site (for example, if /page?tmp=1234 and /page?tmp=3458 are the same, probably /page?tmp=9339 is too — this can be tricky & end up wrong with multiple parameters, is /page?tmp=1234&city=detroit the same too? how about /page?tmp=2123&city=chicago ?).
Two reasons I’ve seen people get thrown off are: we use the mobile version (people generally check on desktop), and we use the version Googlebot sees (and if you show Googlebot a bot-challenge or some other pseudo-error-page, chances are we’ve seen that before and might consider it a duplicate). Also, we use the rendered version – but this means we need to be able to render your page if it’s using a JS framework for the content (if we can’t render it, we might take the bootstrap HTML page and, chances are it’ll be duplicate).
It happens that these systems aren’t perfect in picking duplicate content, sometimes it’s also just that the alternative URL feels obviously misplaced. Sometimes that settles down over time (as our systems recognize that things are really different), sometimes it doesn’t.
If it’s similar content then users can still find their way to it, so it’s generally not that terrible. It’s pretty rare that we end up escalating a wrong duplicate – over the years the teams have done a fantastic job with these systems; most of the weird ones are unproblematic, often it’s just some weird error page that’s hard to spot.”
Takeaway
Mueller offered a deep dive into the reasons why Google chooses canonicals. He described the process of choosing canonicals as like a fuzzy sorting system built from overlapping signals, with Google comparing content, URL patterns, rendered output, and crawler-visible versions, while borderline classifications (“weird ones”) are given a pass because they don’t pose a problem.
Featured Image by Shutterstock/Garun .Prdt