One of the most talked about and disputed topics in search engine optimization is how search engines handle duplicate content. It’s exciting when we find new research from a Google or Yahoo that discusses the topic.
If you have a Web page, and someone scrapes your site, and copies your page, but with ads upon it, how does a search engine find out? Does the search engine even care? If you write an article, and syndicate it, with links back to your web page, will there be problems?
News wire services send out copies of the same story to hundreds of news papers, many of which are online. The titles to those stories may change from paper to paper, and the stories may be edited – most commonly shortened rather than added to, but some papers will include more information in some stories after doing some investigative reporting. Is there a good reason to keep some of those near duplicates and index them?
Frankly, we don’t know the answers to these questions for certain. We don’t even know how well search engines can identify near duplicate content. Or maybe we do:
For web crawling, issues like freshness and efficient resource usage have previously been addressed. However, the problem of elimination of near-duplicate web documents in a generic crawl has not received attention.
Near Duplicate Content Penalties?
Exact copies of Web pages are easy for search engines to identify (at least, that’s what they tell us). It’s more difficult for them to recognize Web pages that are almost duplicates of each other – for instance, pages where an advertisement appears on one version, and not on another that contains the same main content.
If a search engine could identify pages that are extremely close near duplicates of each other, they may possibly take steps such as stopping the crawling of one version, and ignoring the links from it to other pages.
Most algorithms for near-duplicate detection run in batchmode over the entire collection of documents. For web crawling, an online algorithm is necessary because the decision to ignore the hyper-links in a recently-crawled page has to be made quickly.
Testing Near Duplicate Content Detection
A new paper from Google researchers (the one where the quotes above come from) explores the concept of near duplicate content, and describes a study which attempted to identify near duplicates amongst 8 billion pages.
The paper is Detecting Near Duplicates for Web Crawling (PDF), and it uses a fingerprinting technique to locate duplicates on the Web. The fingerprinting technique involved in a study from this paper was developed by Moses Samson Charikar, who is a researcher at Princeton University.
Dr. Charikar was a Google researcher, and a patent he worked upon for Google on identifing duplicates and near duplicates was granted in early January of this year – Methods and apparatus for estimating similarity. One of the methods used by the Google researchers from Dr. Charikar is described in a paper he wrote at Princeton – Similarity Estimation Techniques from Rounding Algorithms (PDF).
Why Detect Duplicates and Near Duplicates?
The authors of this paper state that the reason they are looking for near duplicates is to speed up the time that it takes to crawl pages by getting rid of duplicates during the crawling process to save on bandwidth use and reduce the need for storage space.
Along the way, they provide some interesting discussion of issues around duplicate content. One that they tackle is the value of detecting duplicates and near duplicates in different computer systems. Another list they come up with involves a number of different ways that have been used to identify duplication of documents. They also list a number of possible future steps to make the detection of near duplicate documents easier.
Conclusion
A post at Search Engine Land from a couple of days ago, The Duplicate Content Penalty Myth, attacks the idea that search engines might penalize pages that contain duplicate content. One of the issues that the article doesn’t address is that finding documents that are near duplicates of each other may be a more difficult task than we realize.
What that article and this new Google paper doesn’t address is how a search engine would determine which page is the duplicate or near duplicate and which page is the original if they were to decide to stop crawling one and stop following links from it.





If there is such thing as duplicate content penalty or filter or what so ever, than how can a huge site like for example “booking.com” be indexed with more than 20 extension like: bookings.nl, bookings.net, bookings.org, bookings.de etc etc.
All there sites are indexed and are ranking high while having 100% exactly the same content.
So if this filter/penalty would be used than, this is one of the sites that for sure would be penalized.
Please give your comment
Comment by Kees J — March 19, 2007 @ 10:34 pm
First of all, this post doesn’t actually say that there is a duplicate content penalty or filter – it simply talks about the ways in which search engines attempt to identify duplicate content.
Second, those sites you mentioned are different in far more ways than just having different top level domains. They are in different languages: although they essential substance may be different, they are significantly different content because of the language issue.
I’d suggest going into the forums to search for our past discussions on duplicate content or start a new one!
Comment by Joe Dolson — March 20, 2007 @ 1:28 am
Hi Kees,
I think those are excellent questions.
What I was trying to point out with my post is that maybe we give too much credit to the search engines when pages are near duplicates as opposed to exact duplicates, in thinking that they might penalize them for duplicate content. The paper does explore different ways to identify duplicate content, but notes how difficult detection of near duplicates can be.
You probably know the difference between a penalty and a filter. A penalty would be when one site wouldn’t rank as high as it should because there was another site that duplicated its content. A filter is when a duplicate doesn’t appear in search results because a duplicate is showing instead.
It’s possible that if there was two sites, one with a .com, and another with a .co.uk, that if I searched in the US I might see the .com version, and if I searched in the UK, I might see the .co.uk version. But if their content was the same, I wouldn’t see both appearing in the search results. Neither has been penalized in that instance, but both have been filtered.
With booking.com and bookings.net, when I search for “booking” or “bookings” I’m seeing the .com version, and the other is being filtered.
As for the same content in different languages, I believe that both Vanessa Fox and Adam Lasnik of Google have both said that if the content is in different languages, then Google doesn’t perceive the sites as duplicates.
On my search for “bookings” I get the .com version first, and then I see the .nl version right after it. It isn’t being filtered or penalized, just as Vanessa and Adam suggested.
I’d be happy to discuss duplicate content in a new thread at the forums if you would like. I have more new material to discuss on the topic that I think is pretty interesting, including how search engines might treat sites that use templates.
Comment by Bill — March 20, 2007 @ 8:02 am