One of the most talked about and disputed topics in search engine optimization is how search engines handle duplicate content. It’s exciting when we find new research from a Google or Yahoo that discusses the topic.

If you have a Web page, and someone scrapes your site, and copies your page, but with ads upon it, how does a search engine find out? Does the search engine even care? If you write an article, and syndicate it, with links back to your web page, will there be problems?

News wire services send out copies of the same story to hundreds of news papers, many of which are online. The titles to those stories may change from paper to paper, and the stories may be edited – most commonly shortened rather than added to, but some papers will include more information in some stories after doing some investigative reporting. Is there a good reason to keep some of those near duplicates and index them?

Frankly, we don’t know the answers to these questions for certain. We don’t even know how well search engines can identify near duplicate content. Or maybe we do:

For web crawling, issues like freshness and efficient resource usage have previously been addressed. However, the problem of elimination of near-duplicate web documents in a generic crawl has not received attention.

Near Duplicate Content Penalties?

Exact copies of Web pages are easy for search engines to identify (at least, that’s what they tell us). It’s more difficult for them to recognize Web pages that are almost duplicates of each other – for instance, pages where an advertisement appears on one version, and not on another that contains the same main content.

If a search engine could identify pages that are extremely close near duplicates of each other, they may possibly take steps such as stopping the crawling of one version, and ignoring the links from it to other pages.

Most algorithms for near-duplicate detection run in batchmode over the entire collection of documents. For web crawling, an online algorithm is necessary because the decision to ignore the hyper-links in a recently-crawled page has to be made quickly.

Testing Near Duplicate Content Detection

A new paper from Google researchers (the one where the quotes above come from) explores the concept of near duplicate content, and describes a study which attempted to identify near duplicates amongst 8 billion pages.

The paper is Detecting Near Duplicates for Web Crawling (pdf), and it uses a fingerprinting technique to locate duplicates on the Web. The fingerprinting technique involved in a study from this paper was developed by Moses Samson Charikar, who is a researcher at Princeton University.

Dr. Charikar was a Google researcher, and a patent he worked upon for Google on identifing duplicates and near duplicates was granted in early January of this year – Methods and apparatus for estimating similarity. One of the methods used by the Google researchers from Dr. Charikar is described in a paper he wrote at Princeton – Similarity Estimation Techniques from Rounding Algorithms (pdf).

Why Detect Duplicates and Near Duplicates?

The authors of this paper state that the reason they are looking for near duplicates is to speed up the time that it takes to crawl pages by getting rid of duplicates during the crawling process to save on bandwidth use and reduce the need for storage space.

Along the way, they provide some interesting discussion of issues around duplicate content. One that they tackle is the value of detecting duplicates and near duplicates in different computer systems. Another list they come up with involves a number of different ways that have been used to identify duplication of documents. They also list a number of possible future steps to make the detection of near duplicate documents easier.

Conclusion

A post at Search Engine Land from a couple of days ago, The Duplicate Content Penalty Myth, attacks the idea that search engines might penalize pages that contain duplicate content. One of the issues that the article doesn’t address is that finding documents that are near duplicates of each other may be a more difficult task than we realize.

What that article and this new Google paper doesn’t address is how a search engine would determine which page is the duplicate or near duplicate and which page is the original if they were to decide to stop crawling one and stop following links from it.