Duplicate Content Issues: Exact Duplicates & Near Duplicates

When faced with any type of duplicate content, search engines get confused about which version of the page (URL) to crawl, index and rank for a query in the SERPs. Despite both issues usually stemming from accidental technical errors and are not penalized, exact duplicates and near duplicates can cause a lot of damage if not handled appropriately.

To avoid flanking in search engine rankings and wasting your crawl budget and link equity, detecting and fixing these types of duplicate content issues as soon as possible is key to preventing any negative consequences and boosting your organic traffic by pointing it the right way.

Here’s how to swiftly find and resolve them.

Exact Duplicates vs Near Duplicates

The two main categories duplicate content typically falls into are pretty much self-explanatory. Exact duplicates are two URLs with identical content, while near duplicates are pages that are “nearly identical” – multiple versions of the same piece of content with minor differences. 

Contrary to popular belief, content doesn’t need to be an exact match to be perceived as duplicate — if it’s similar enough, it will be considered as such, even though some things may differ.

Duplicate Content Detection: Finding Exact & Near Duplicates

While both exact duplicate and near-duplicate content can cause search ranking and visibility issues, each must be approached and handled specifically.

Detecting Exact Duplicates

Pages that are exact duplicates (often due to plagiarism, syndicated and scraped content, or mirroring) are easy to identify by standard checksumming techniques. You can use free versions of various SEO audit tools such as Screaming Frog and Siteliner to crawl your website and detect any exact duplicate pages in real time. The number of searches or results available in the free mode is usually limited, but you can always sign up for the premium versions if needed.

Detecting Near Duplicates

Identifying near-duplicates is a bit more tricky. Most tools, such as Screaming Frog, will look for exact duplicates by default, so you’ll probably need to enable near-duplicate checks manually. You can change the similarity threshold (usually set to 90% by default) if you want to find content with a lower percentage of similarity. Near duplicate content also requires performing a crawl analysis to populate it with usable data.

Another thing to keep in mind is that data is only pulled from indexable URLs, so if you have canonical URLs, those pages won’t be included in the reports, even if they are exact or near duplicates. 

Resolving Exact Duplicate & Near Duplicate Content Issues

The first step to fixing these types of duplicate content issues is to decide which version of the page you want to keep – opting for the better-performing one is considered best practice.

  • 301 Redirects

You can kill the duplicate pages and still have them boost the SEO of the primary page you chose to keep by combining exact duplicates or near duplicates into a single URL with permanent 301 redirects, so they consolidate their link equity.

  • Rel Cannonical

Another way to consolidate duplicate content is to place the rel=canonical attribute inside of the <head> HTML tag to mark the lower-performing exact or near duplicate content page. It’s a way to tell search engines that all its link juice and ranking power should be attributed to the original (higher-performing) page marked with a self-referential canonical tag. It’s similar to a 301 redirect, but easier to implement.

  • De-Indexing URLs

You can remove URLs that you don’t want indexed altogether from your XML sitemap or modify the meta robots tags of the pages you wish to exclude from search results to “noindex, follow” manually. Another helpful short-term strategy is to mark duplicate URLs as passive in Google’s Search Console parameters. They will be ignored by crawlers and won’t show up in search, however, the URL that you are keeping around also won’t receive any SEO benefits from them

Learn On-page SEO

Robots.txt

What Is Robots.txt? Robots.txt is a text file with instructions that search engine crawlers can read. These instructions define which areas of a...

read more

Meta Description

A meta description is a brief snippet you see underneath a site’s title in search engine results. The meta description snippet is also shown on...

read more

Learn Technical SEO

Broken redirects

Half of all internet users will encounter a 404 error page at some point in their life, whether it's because they've clicked on a broken link, typed...

read more

Learn off-page SEO

No Results Found

The page you requested could not be found. Try refining your search, or use the navigation above to locate the post.