Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
skip to main content
research-article

Combining URL and HTML Features for Entity Discovery in the Web

Published: 04 December 2019 Publication History
  • Get Citation Alerts
  • Abstract

    The web is a large repository of entity-pages. An entity-page is a page that publishes data representing an entity of a particular type, for example, a page that describes a driver on a website about a car racing championship. The attribute values published in the entity-pages can be used for many data-driven companies, such as insurers, retailers, and search engines. In this article, we define a novel method, called SSUP, which discovers the entity-pages on the websites. The novelty of our method is that it combines URL and HTML features in a way that allows the URL terms to have different weights depending on their capacity to distinguish entity-pages from other pages, and thus the efficacy of the entity-page discovery task is increased. SSUP determines the similarity thresholds on each website without human intervention. We carried out experiments on a dataset with different real-world websites and a wide range of entity types. SSUP achieved a 95% rate of precision and 85% recall rate. Our method was compared with two state-of-the-art methods and outperformed them with a precision gain between 51% and 66%.

    References

    [1]
    T. W. Anderson and J. D. Finn. 1996. The New Statistical Analysis of Data. Springer.
    [2]
    Akari Asai, Sara Evensen, Behzad Golshan, Alon Y. Halevy, Vivian Li, Andrei Lopatenko, Daniela Stepanov, Yoshihiko Suhara, Wang-Chiew Tan, and Yinzhan Xu. 2018. HappyDB: A Corpus of 100, 000 crowdsourced happy moments. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18).
    [3]
    Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. 2011. Modern Information Retrieval - The Concepts and Technology Behind Search.2nd ed. Pearson Education, Harlow, England. Retrieved from http://www.mir2ed.org/.
    [4]
    Lorenzo Blanco, Valter Crescenzi, and Paolo Merialdo. 2005. Efficiently locating collections of web pages to wrap. In International Conference on Web Information Systems and Technologies (WEBIST’05). INSTICC Press, Miami, FL, 247--254.
    [5]
    Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2008. Supporting the automatic construction of entity aware search engines. In International Workshop on Web Information and Data Management (WIDM’08), Chee Yong Chan and Neoklis Polyzotis (Eds.). ACM, 149--156.
    [6]
    Lorenzo Blanco, Nilesh Dalvi, and Ashwin Machanavajjhala. 2011. Highly efficient algorithms for structural clustering of large websites. In Proceedings of the 20th International Conference on World Wide Web (WWW’11). ACM, New York, NY, 437--446.
    [7]
    Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2013. Extraction and integration of partially overlapping web sources. PVLDB 6, 10 (2013), 805--816.
    [8]
    Andrew Carlson and Charles Schafer. 2008. Bootstrapping information extraction from semi-structured web pages. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I (ECML PKDD’08). Springer-Verlag, Berlin, 195--210.
    [9]
    Eric Crestan and Patrick Pantel. 2011. Web-scale table census and classification. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM’11). ACM, New York, NY, 545--554.
    [10]
    Fabio Fumarola, Tim Weninger, Rick Barber, Donato Malerba, and Jiawei Han. 2011. HyLiEn: A hybrid approach to general list extraction on the web. In Proceedings of the 20th International Conference Companion on World Wide Web (WWW’11). ACM, New York, NY, 35--36.
    [11]
    Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart, and Cheng Wang. 2014. DIADEM: Thousands of websites to a single database. PVLDB 7, 14 (2014), 1845--1856.
    [12]
    Sujatha Das Gollapalli, Cornelia Caragea, Prasenjit Mitra, and C. Lee Giles. 2015. Improving researcher homepage classification with unlabeled data. ACM Transactions on the Web 9, 4, Article 17 (Oct. 2015), 32 pages.
    [13]
    Peter D. Grünwald. 2007. The Minimum Description Length Principle. MIT Press, London, England.
    [14]
    Pankaj Gulhane, Rajeev Rastogi, Srinivasan H. Sengamedu, and Ashwin Tengli. 2010. Exploiting content redundancy for web information extraction. PVLDB 3, 1 (2010), 578--587.
    [15]
    Alon Y. Halevy. 2017. Technical perspective: Building knowledge bases from messy data. Communications of the ACM 60, 5 (2017), 92.
    [16]
    Yeye He, Dong Xin, Venkatesh Ganti, Sriram Rajaraman, and Nirav Shah. 2013. Crawling deep web entity pages. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM’13). ACM, New York, NY, 355--364.
    [17]
    Inma Hernández, Carlos R. Rivero, David Ruiz, and Rafael Corchuelo. 2016. CALA: ClAssifying links automatically based on their URL. Journal of Systems and Software 115 (2016), 130--143.
    [18]
    Djoerd Hiemstra. 2001. Using Language Models for Information Retrieval. Ph.D. Dissertation. Centre for Telematics and Information Technology, University of Twente, Enschede.
    [19]
    Rianne Kaptein, Pavel Serdyukov, Arjen De Vries, and Jaap Kamps. 2010. Entity ranking using wikipedia as a pivot. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM’10). ACM, New York, NY, 69--78.
    [20]
    Cindy Xide Lin, Bo Zhao, Tim Weninger, Jiawei Han, and Bing Liu. 2010. Entity relation discovery from web tables and links. In Proceedings of the 19th International Conference on World Wide Web (WWW’10). 1145--1146.
    [21]
    Edimar Manica, Carina F. Dorneles, and Renata Galante. 2017. Orion: A cypher-based web data extractor. In Database and Expert Systems Applications - 28th International Conference (DEXA’17), Proceedings, Part I. 275--289.
    [22]
    Edimar Manica, Carina F. Dorneles, and Renata Galante. 2017. R-Extractor: A method for data extraction from template-based entity-pages. In 41st IEEE Annual Computer Software and Applications Conference (COMPSAC’17). Volume 1. 778--787.
    [23]
    Edimar Manica, Renata Galante, and Carina F. Dorneles. 2014. SSUP - A URL-based method to entity-page discovery. In Web Engineering, 14th International Conference (ICWE’14), Proceedings. 254--271.
    [24]
    T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling. 2015. Never-ending learning. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15).
    [25]
    Alfonso Murolo and Moira C. Norrie. 2016. Revisiting web data extraction using in-browser structural analysis and visual cues in modern web designs. In Web Engineering - 16th International Conference (ICWE’16), Proceedings. 114--131.
    [26]
    Stefano Ortona, Giorgio Orsi, Marcello Buoncristiano, and Tim Furche. 2015. WADaR: Joint wrapper and data repair. PVLDB 8, 12 (2015), 1996--1999.
    [27]
    Stefano Ortona, Giorgio Orsi, Tim Furche, and Marcello Buoncristiano. 2016. Joint repairs for web wrappers. In Proceedings of theInternational Conference on Data Engineering. IEEE Computer Society, Washington,1146--1157.
    [28]
    Disheng Qiu, Luciano Barbosa, Xin Luna Dong, Yanyan Shen, and Divesh Srivastava. 2015. Dexter: Large-scale discovery and extraction of product specifications on the web. Proceedings of the VLDB Endowment 8, 13 (Sept. 2015), 2194--2205.
    [29]
    Gianluca Quercini and Chantal Reynaud. 2013. Entity discovery and annotation in tables. In Proceedings of the 16th International Conference on Extending Database Technology (EDBT’13). ACM, New York, NY, 693--704.
    [30]
    Hassan A. Sleiman and Rafael Corchuelo. 2014. Trinity: On using trinary trees for unsupervised web data extraction. IEEE Transactions on Knowledge and Data Engineering 26, 6 (2014), 1544--1556.
    [31]
    Márcio L. A. Vidal, Altigran S. da Silva, Edleno S. de Moura, and João M. B. Cavalcanti. 2006. Structure-driven crawler generation by example. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). ACM, New York, NY, 292--299.
    [32]
    Tim Weninger, Thomas J. Johnston, and Jiawei Han. 2013. The parallel path framework for entity discovery on the web. ACM Transactions on the Web 7, 3, Article 16 (Sept. 2013), 29 pages.
    [33]
    Naoki Yoshinaga and Kentaro Torisawa. 2007. Open-domain attribute-value acquisition from semi-structured texts. In Proceedings of the 6th International Semantic Web Conference (ISWC’07), Workshop on Text to Knowledge: The Lexicon/Ontology Interface (OntoLex’07). Springer, Busan, South Korea, 55--66.
    [34]
    Hwanjo Yu, Jiawei Han, and Kevin Chen-Chuan Chang. 2004. PEBL: Web page classification without negative examples. IEEE Transactions on on Knowledge and Data Engineerings 16, 1 (Jan. 2004), 70--81.
    [35]
    Yanhong Zhai and Bing Liu. 2005. Web data extraction based on partial tree alignment. In Proceedings of the 14th International Conference on World Wide Web (WWW’05). 76--85.
    [36]
    Ce Zhang, Christopher Ré, Michael J. Cafarella, Jaeho Shin, Feiran Wang, and Sen Wu. 2017. DeepDive: Declarative knowledge base construction. Communications of the ACM 60, 5 (2017), 93--102.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    ACM Transactions on the Web  Volume 13, Issue 4
    November 2019
    139 pages
    ISSN:1559-1131
    EISSN:1559-114X
    DOI:10.1145/3372405
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 December 2019
    Accepted: 01 September 2019
    Revised: 01 February 2019
    Received: 01 February 2017
    Published in TWEB Volume 13, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. URL and HTML features
    2. crawler
    3. entity-pages
    4. web structure mining

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)13
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media