Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
skip to main content
10.1145/2389936.2389950acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity

Published: 02 November 2012 Publication History
  • Get Citation Alerts
  • Abstract

    Automatically extracting the headline of online web articles has many applications in web mining and information retrieval. In this paper, we developed a content-based and domain-and language-independent approach, TitleFinder, for unsupervised extraction of the headline of web articles. TitleFinder starts by using a heuristic to select a candidate headline. In a second step the contents of each text fragment in the HTML file are compared to the candidate headline. We implemented four types of similarity for this comparison: two variations of the cosine similarity based on tf and tf-idf weighting schemata, an overlap scoring similarity and an aggregated metric combining the scores of the previous three similarities. Our method achieves high performance in terms of effectiveness and efficiency and outperforms approaches operating on structural and visual features on a test set consisting of 11,218 news web pages from 15 different domains.

    References

    [1]
    S. Changuel, N. Labroche, and B. Bouchon-Meunier. A general learning method for automatic title extraction from html pages. In 6th International Conference of Machine Learning and Data Mining in Pattern Recognition, pages 704--718. Springer, 2009.
    [2]
    C. Fairon, H. Naets, A. Kilgarriff, and G.-M. de Schryver, editors. WAC3: Proceedings of the 3rd web as corpus workshop, incorporating cleaneval. Presses universitaires de Louvain, Sept. 2007.
    [3]
    J. Fan, P. Luo, and P. Joshi. Title identification of web article pages using html and visual features. Proc. SPIE 7879, 78790K (2011).
    [4]
    J. Fan, P. Luo, S. H. Lim, S. Liu, J. Parag, and J. Liu. Article clipper: a system for web article extraction. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 743--746. ACM, 2011.
    [5]
    T. Gottron. Evaluating content extraction on HTML documents. In ITA '07: Proceedings of the 2nd International Conference on Internet Technologies and Applications, pages 123--132, Sept. 2007.
    [6]
    T. Gottron. Bridging the gap: From multi document template detection to single document content extraction. In EuroIMSA '08: Proceedings of the IASTED Conference on Internet and Multimedia Systems and Applications 2008, pages 66--71. ACTA Press, Calgary, Mar. 2008.
    [7]
    T. Gottron. Content code blurring: A new approach to content extraction. In DEXA '08: 19th International Workshop on Database and Expert Systems Applications, IEEE Computer Society, pages 29--33. IEEE Computer Society, Sept. 2008.
    [8]
    Y. Hu, H. Li, Y. Cao, L. Teng, D. Meyerzon, and Q. Zheng. Automatic extraction of titles from general documents using machine learning. ACM/IEEE Joint Conference on Digital Libraries, JCDL 2005, pages 145--154, 2005.
    [9]
    Y. Hu, G. Xin, R. Song, G. Hu, S. Shi, Y. Cao, and H. Li. Title extraction from bodies of html documents and its application to web page retrieval. In SIGIR 2005: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information, pages 250--257. ACM, August 2005.
    [10]
    H. Ibrahim, K. Darwish, and A.-R. Madany. Automatic extraction of textual elements from news web pages. In Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 2008.
    [11]
    C. Manning, P. Raghavan, and H. Schütze. An Introduction to Information Retrieval. 2009.
    [12]
    H. Mohammadzadeh, T. Gottron, F. Schweiggert, and G. Nakhaeizadeh. A fast and accurate approach for main content extraction based on character encoding. In TIR'11: Proccedings of the 8th International Workshop on Text-based Information Retrieval (DEXA'11). IEEE Computer Society, pages 167--171, 2011.
    [13]
    H. Mohammadzadeh, T. Gottron, F. Schweiggert, and G. Nakhaeizadeh. The impact of source code normalization on main content extraction. In WEBIST'12: 8th International Conference on Web Information Systems and Technologies, pages 677--682, 2012.
    [14]
    J. Moreno, K. Deschacht, and M. Moens. Language independent content extraction from web pages. In Proceeding of the 9th Dutch-Belgian Information Retrieval Workshop, pages 50--55, 2009.
    [15]
    G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18:613--620, 1975.
    [16]
    F. Sun, D. Song, and L. Liao. Dom based content extraction via text density. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, SIGIR '11, pages 245--254, New York, NY, USA, 2011. ACM.
    [17]
    T. Weninger, W. H. Hsu, and J. Han. Cetr: content extraction via tag ratios. In Proceedings of the 19th International Conference on World Wide Web, pages 971--980. ACM, 2010.
    [18]
    Y. Xue, Y. Hu, G. Xin, R. Song, S. Shi, Y. Cao, C.-Y. Lin, and H. Li. Web page title extraction and its application. Inf. Process. Manage., 43(5):1332--1347, 2007.
    [19]
    Z. Zhang, M. Sun, and S. Liu. Automatic content based title extraction for chinese documents using support vector machine. In Proceedings of 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering, pages 553--558. IEEE, 2005.

    Cited By

    View all
    • (2022)Methods for Subheading Recognition in Recruitment Information2022 International Conference on Management Engineering, Software Engineering and Service Sciences (ICMSS)10.1109/ICMSS55574.2022.00012(28-35)Online publication date: Jan-2022
    • (2020)Identification of network behavioral characteristics of high-expertise users in interactive innovation: The case of forum autohomeAsia Pacific Management Review10.1016/j.apmrv.2020.06.002Online publication date: Sep-2020
    • (2017)Using linguistic features to automatically extract web page titleExpert Systems with Applications: An International Journal10.1016/j.eswa.2017.02.04579:C(296-312)Online publication date: 15-Aug-2017

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    WIDM '12: Proceedings of the twelfth international workshop on Web information and data management
    November 2012
    90 pages
    ISBN:9781450317207
    DOI:10.1145/2389936
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 November 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cosine similarity
    2. headline extraction
    3. html web pages
    4. information retrieval
    5. overlap scoring similarity
    6. title extraction
    7. vector space model

    Qualifiers

    • Research-article

    Conference

    CIKM'12
    Sponsor:

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Methods for Subheading Recognition in Recruitment Information2022 International Conference on Management Engineering, Software Engineering and Service Sciences (ICMSS)10.1109/ICMSS55574.2022.00012(28-35)Online publication date: Jan-2022
    • (2020)Identification of network behavioral characteristics of high-expertise users in interactive innovation: The case of forum autohomeAsia Pacific Management Review10.1016/j.apmrv.2020.06.002Online publication date: Sep-2020
    • (2017)Using linguistic features to automatically extract web page titleExpert Systems with Applications: An International Journal10.1016/j.eswa.2017.02.04579:C(296-312)Online publication date: 15-Aug-2017

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media