Detecting Near-Duplicates for Web Crawling 问题背景: 在互联网中有很多的网页的内容(content)是一样的,但是他们的网页元素却不是完全相同的,因为每个域名下的网页总会有一些自己的东西,比如广告(advertisement)、导航栏、网站版权之类的东西,但是对于搜索引擎来讲,只有内容部分才是有意义的,而后面的那些虽
Manku GS, Jain A, Das Sarma A. Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web. ACM; 2007. p. 141-50.Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of ...
Performance and comparative analysis of the two contrary approaches for detecting near duplicate web documents in web crawling. Int J Electr Comput Eng (IJECE) 2012;2(6):819e30.V. A. Narayana, P. Premchand, and A. Govardhan. "Performance and comparative analysis of the two contrary ...
Efficient techniques to detect documents that are exact duplicates exist. Detecting whether or not documents are near-duplicates is more difficult, particularly in large collections of documents. For example, the Internet, collectively, includes literally billions of “Web site” documents. ...
(advertisement)、导航栏、网站版权之类的东西,但是对于搜索引擎来讲,只有内容部分才是有意义的,而后面的那些虽然不同,但是对搜索结果没有任何影响,所以在判定内容是否重复的时候,应该忽视后面的部分,当新爬取的content和数据库中的某个网页的content一样的时候,就称其为Near-Duplicates,这比传统的网页比对又智能了...
Detecting query-specific duplicate documents - Gomes, Smith - 2003 () Citation Context ...ably because search engines automatically and progressively filter out some matching results because they are duplicates, near duplicates, or come from the same web site as too many previous matches (=-=...