Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
skip to main content
10.1145/1076034.1076066acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Detecting phrase-level duplication on the world wide web

Published: 15 August 2005 Publication History

Abstract

Two years ago, we conducted a study on the evolution of web pages over time. In the course of that study, we discovered a large number of machine-generated "spam" web pages emanating from a handful of web servers in Germany. These spam web pages were dynamically assembled by stitching together grammatically well-formed German sentences drawn from a large collection of sentences. This discovery motivated us to develop techniques for finding other instances of such "slice and dice" generation of web pages, where pages are automatically generated by stitching together phrases drawn from a limited corpus. We applied these techniques to two data sets, a set of 151 million web pages collected in December 2002 and a set of 96 million web pages collected in June 2004. We found a number of other instances of large-scale phrase-level replication within the two data sets. This paper describes the algorithms we used to discover this type of replication, and highlights the results of our data mining.

References

[1]
Amitay, E., Carmel, D., Darlow, A., Lempel, R., and Soffer, A. The Connectivity Sonar: Detecting Site Functionality by Structural Patterns. In 14th ACM Conference on Hypertext and Hypermedia (Aug. 2003), 38--47.Bharat, K., Chang, B., Henzinger.
[2]
M., and Ruhl, M. Who Links to Whom: Mining Linkage between Web Sites. In 2001 IEEE International Conference on Data Mining (Nov. 2001), 51--58.
[3]
Broder, A. Some applications of Rabin's fingerprinting method. In Capocelli, R., De Santis, A., and Vaccaro, U., editors, Sequences II: Methods in Communications, Security, and Computer Science, 143--152, Springer Verlag, 1993.
[4]
Broder, A., Glassman, S., Manasse, M., and Zweig, G. Syntactic Clustering of the Web. In 6th International World Wide Web Conference (Apr. 1997), 393--404.
[5]
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J. Graph Structure in the Web. In 9th International World Wide Web Conference (May 2000), 309--320.
[6]
Davison, B. Recognizing Nepotistic Links on the Web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search (July 2000).
[7]
Fetterly, D., Manasse, M., Najork, M., and Wiener, J. A large-scale study of the evolution of web pages. In 12th International World Wide Web Conference (May 2003), 669--678.
[8]
Fetterly, D., Manasse, M., and Najork, M. On the Evolution of Clusters of Near-Duplicate Web Pages. In 1st Latin American Web Congress (Nov. 2003), 37--45.
[9]
Fetterly, D., Manasse, M., and Najork, M. Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages. In 7th International Workshop on the Web and Databases (June 2004), 37--45.
[10]
M. Henzinger, R. Motwani, C. Silverstein. Challenges in Web Search Engines. SIGIR Forum 36(2), 2002.
[11]
Rabin, M. Fingerprinting by random polynomials. Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.

Cited By

View all
  • (2022)Efficient and Privacy Preserving Approximation of Distributed Statistical QueriesIEEE Transactions on Big Data10.1109/TBDATA.2021.30525168:5(1399-1413)Online publication date: 1-Oct-2022
  • (2022)Towards Forecasting Internet Financial Frauds based on Advertising2022 8th International Conference on Big Data and Information Analytics (BigDIA)10.1109/BigDIA56350.2022.9874049(5-11)Online publication date: 24-Aug-2022
  • (2021)An Improved Framework for Content- and Link-Based Web-Spam DetectionComplexity10.1155/2021/66257392021Online publication date: 1-Jan-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
August 2005
708 pages
ISBN:1595930345
DOI:10.1145/1076034
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 August 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. content duplication
  2. data mining
  3. web characterization
  4. web pages
  5. web spam

Qualifiers

  • Article

Conference

SIGIR05
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)2
Reflects downloads up to 20 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Efficient and Privacy Preserving Approximation of Distributed Statistical QueriesIEEE Transactions on Big Data10.1109/TBDATA.2021.30525168:5(1399-1413)Online publication date: 1-Oct-2022
  • (2022)Towards Forecasting Internet Financial Frauds based on Advertising2022 8th International Conference on Big Data and Information Analytics (BigDIA)10.1109/BigDIA56350.2022.9874049(5-11)Online publication date: 24-Aug-2022
  • (2021)An Improved Framework for Content- and Link-Based Web-Spam DetectionComplexity10.1155/2021/66257392021Online publication date: 1-Jan-2021
  • (2020)A Survey of Fake NewsACM Computing Surveys10.1145/339504653:5(1-40)Online publication date: 28-Sep-2020
  • (2020)GT2FS-SMOTE: An Intelligent Oversampling Approach Based Upon General Type-2 Fuzzy Sets to Detect Web SpamArabian Journal for Science and Engineering10.1007/s13369-020-04995-546:4(3033-3050)Online publication date: 15-Oct-2020
  • (2019)Detecting Cyberbullying and Cyberaggression in Social MediaACM Transactions on the Web10.1145/334348413:3(1-51)Online publication date: 14-Oct-2019
  • (2018)Statistical Approach for Combating Web Spamming Using Fisher Technique2018 International Conference on Inventive Research in Computing Applications (ICIRCA)10.1109/ICIRCA.2018.8597230(63-66)Online publication date: Jul-2018
  • (2017)Detecting Negative Deceptive Opinion from TweetsMobile and Wireless Technologies 201710.1007/978-981-10-5281-1_36(329-339)Online publication date: 17-Jun-2017
  • (2016)Analysis of Web Spam for Non-English Content: Toward More Effective Language-Based ClassifiersPLOS ONE10.1371/journal.pone.016438311:11(e0164383)Online publication date: 17-Nov-2016
  • (2016)WSF2Scientific Programming10.1155/2016/60913852016(1-1)Online publication date: 1-Jan-2016
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media