Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
skip to main content
10.1145/3299869.3319899acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Progressive Deep Web Crawling Through Keyword Queries For Data Enrichment

Published: 25 June 2019 Publication History
  • Get Citation Alerts
  • Abstract

    Data enrichment is the act of extending a local database with new attributes from external data sources. In this paper, we study a novel problem-how to progressively crawl the deep web (i.e., a hidden database) through a keyword-search API to enrich a local database in an e ective way. This is chal- lenging because these interfaces often limit the data access by enforcing the top-k constraint or limiting the number of queries that can be issued within a time window. In response, we propose SmartCrawl, a new framework to collect re- sults e ectively. Given a query budget b, SmartCrawl rst constructs a query pool based on the local database, and then iteratively issues a set of most bene cial queries to the hidden database such that the union of the query results can cover the maximum number of local records. The key technical challenge is how to estimate query bene t, i.e., the number of local records that can be covered by a given query. A simple approach is to estimate it as the query frequency in the local database. We nd that this is ine ective due to i) the impact of |ΔD|, where |ΔD| represents the number of local records that cannot be found in the hidden database, and ii) the top-k constraint enforced by the hidden database. We study how to mitigate the negative impacts of the two factors and propose e ective optimization techniques to improve performance. The experimental results show that on both simulated and real-world hidden databases, SmartCrawl signi cantly increases coverage over the local database as compared to the baselines.

    References

    [1]
    DBLP. http://dblp.org/.
    [2]
    Forbes Survey. https://tinyurl.com/y7evy7en.
    [3]
    GoodReads. https://www.goodreads.com/.
    [4]
    Google Maps API. https://developers.google.com/maps/ documentation/geocoding/usage-limits.
    [5]
    IMDb. http://www.imdb.com/.
    [6]
    OpenReine Reconciliation Service. https://github.com/OpenReine/ OpenReine/wiki/Reconciliation-Service-API. Accessed: 2018--10--15.
    [7]
    SoundCloud. https://soundcloud.com/.
    [8]
    Spotify API. https://developer.spotify.com/documentation/web-api/ reference/search/search.
    [9]
    The ACM Digital Library. https://dl.acm.org/.
    [10]
    Yelp API. https://www.yelp.com/developers/faq.
    [11]
    Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. Detecting data errors: Where are we and what needs to be done? PVLDB, 9(12):993--1004, 2016.
    [12]
    E. Agichtein, P. G. Ipeirotis, and L. Gravano. Modeling query-based access to text databases. In WebDB, pages 87--92, 2003.
    [13]
    Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. J. ACM, 55(5):24:1--24:74, 2008.
    [14]
    M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. PVLDB, 2(1):1090--1101, 2009.
    [15]
    K. C. Chang, B. He, and Z. Zhang. Toward large scale integration: Building a metaquerier over databases on the web. In CIDR, pages 44--55, 2005.
    [16]
    P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng., 24(9):1537--1555, 2012.
    [17]
    A. Dasgupta, G. Das, and H. Mannila. A random walk approach to sampling hidden databases. In ACM SIGMOD, pages 629--640, 2007.
    [18]
    A. Dasgupta, X. Jin, B. Jewell, N. Zhang, and G. Das. Unbiased estimation of size and other aggregates over hidden web databases. In ACM SIGMOD, pages 855--866, 2010.
    [19]
    A. Dasgupta, N. Zhang, and G. Das. Leveraging COUNT information in sampling hidden databases. In ICDE, pages 329--340, 2009.
    [20]
    A. Dasgupta, N. Zhang, and G. Das. Turbo-charging hidden database samplers with overlowing queries and skew reduction. In EDBT, pages 51--62, 2010.
    [21]
    X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In ACM SIGKDD, pages 601--610, 2014.
    [22]
    J. Eberius, M. Thiele, K. Braunschweig, and W. Lehner. Top-k entity augmentation using consistent set covering. In SSDBM, 2015.
    [23]
    J. Fan, M. Lu, B. C. Ooi, W. Tan, and M. Zhang. A hybrid machinecrowdsourcing system for matching web tables. In ICDE, pages 976-987, 2014.
    [24]
    J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In ACM SIGMOD, pages 1--12, 2000.
    [25]
    H. He, W. Meng, C. T. Yu, and Z. Wu. Automatic integration of web search interfaces with wise-integrator. VLDB J., 13(3):256--273, 2004.
    [26]
    Y. He, D. Xin, V. Ganti, S. Rajaraman, and N. Shah. Crawling deep web entity pages. In WSDM, pages 355--364, 2013.
    [27]
    P. G. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. In VLDB, pages 394--405, 2002.
    [28]
    X. Jin, N. Zhang, and G. Das. Attribute domain discovery for hidden web databases. In ACM SIGMOD, pages 553--564, 2011.
    [29]
    O. Lehmberg, D. Ritze, P. Ristoski, R. Meusel, H. Paulheim, and C. Bizer. The mannheim search join engine. J. Web Sem., 35:159--166, 2015.
    [30]
    J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Y. Halevy. Google's deep web crawl. PVLDB, 1(2):1241--1252, 2008.
    [31]
    W. Meng, C. T. Yu, and K. Liu. Building eicient and efective metasearch engines. ACM Comput. Surv., 34(1):48--89, 2002.
    [32]
    G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions - I. Math. Program., 14(1):265--294, 1978.
    [33]
    A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden web content through keyword queries. In JCDL, pages 100--109, 2005.
    [34]
    R. Pimplikar and S. Sarawagi. Answering table queries on the web using column keywords. PVLDB, 5(10):908--919, 2012.
    [35]
    S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In VLDB, pages 129--138, 2001.
    [36]
    A. D. Sarma, L. Fang, N. Gupta, A. Y. Halevy, H. Lee, F.Wu, R. Xin, and C. Yu. Finding related tables. In ACM SIGMOD, pages 817--828, 2012.
    [37]
    C. Sheng, N. Zhang, Y. Tao, and X. Jin. Optimal algorithms for crawling a hidden database in the web. PVLDB, 5(11):1112-1123, 2012.
    [38]
    S. Thirumuruganathan, N. Zhang, and G. Das. Breaking the top-k barrier of hidden web databases? In ICDE, pages 1045--1056, 2013.
    [39]
    B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, pages 673--684, 2013.
    [40]
    C.Wang, K. Chakrabarti, Y. He, K. Ganjam, Z. Chen, and P. A. Bernstein. Concept expansion using web tables. InWWW, pages 1198--1208, 2015.
    [41]
    F. Wang and G. Agrawal. Efective and eicient sampling methods for deep web aggregation queries. In EDBT, pages 425--436, 2011.
    [42]
    P.Wang, Y. He, R. Shea, J.Wang, and E.Wu. Deeper: A data enrichment system powered by deep web. In ACM SIGMOD Demo, 2018.
    [43]
    R. C.Wang andW.W. Cohen. Iterative set expansion of named entities using the web. In ICDM, pages 1091--1096, 2008.
    [44]
    W. Wu, C. T. Yu, A. Doan, and W. Meng. An interactive clusteringbased approach to integrating source query interfaces on the deep web. In ACM SIGMOD, pages 95--106, 2004.
    [45]
    M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In ACM SIGMOD, pages 97--108, 2012.
    [46]
    M. Zhang and K. Chakrabarti. Infogather+: semantic matching and annotation of numeric and time-varying attributes in web tables. In ACM SIGMOD, pages 145--156, 2013.
    [47]
    M. Zhang, N. Zhang, and G. Das. Mining a search engine's corpus: eicient yet unbiased sampling and aggregate estimation. In ACM SIGMOD, pages 793--804, 2011.
    [48]
    M. Zhang, N. Zhang, and G. Das. Mining a search engine's corpus without a query pool. In CIKM, pages 29--38, 2013.
    [49]
    S. Zhang and K. Balog. Entitables: Smart assistance for entity-focused tables. In ACM SIGIR, pages 255--264, 2017.

    Cited By

    View all
    • (2024)LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data LakesProceedings of the VLDB Endowment10.14778/3659437.365944817:8(1925-1938)Online publication date: 1-Apr-2024
    • (2024)FeatAug: Automatic Feature Augmentation From One-to-Many Relationship Tables2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00146(1805-1818)Online publication date: 13-May-2024
    • (2024)In Situ Neural Relational Schema Matcher2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00018(138-150)Online publication date: 13-May-2024
    • Show More Cited By

    Index Terms

    1. Progressive Deep Web Crawling Through Keyword Queries For Data Enrichment

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data
      June 2019
      2106 pages
      ISBN:9781450356435
      DOI:10.1145/3299869
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 June 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. data extraction
      2. data integration
      3. deep web

      Qualifiers

      • Research-article

      Funding Sources

      • Natural Sciences and Engineering Research Council of Canada (NSERC)

      Conference

      SIGMOD/PODS '19
      Sponsor:
      SIGMOD/PODS '19: International Conference on Management of Data
      June 30 - July 5, 2019
      Amsterdam, Netherlands

      Acceptance Rates

      SIGMOD '19 Paper Acceptance Rate 88 of 430 submissions, 20%;
      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)23
      • Downloads (Last 6 weeks)1

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data LakesProceedings of the VLDB Endowment10.14778/3659437.365944817:8(1925-1938)Online publication date: 1-Apr-2024
      • (2024)FeatAug: Automatic Feature Augmentation From One-to-Many Relationship Tables2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00146(1805-1818)Online publication date: 13-May-2024
      • (2024)In Situ Neural Relational Schema Matcher2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00018(138-150)Online publication date: 13-May-2024
      • (2023)Effective Entity Augmentation by Querying External Data SourcesProceedings of the VLDB Endowment10.14778/3611479.361153516:11(3404-3417)Online publication date: 24-Aug-2023
      • (2022)PoWareMatch: A Quality-aware Deep Learning Approach to Improve Human Schema MatchingJournal of Data and Information Quality10.1145/348342314:3(1-27)Online publication date: 23-May-2022
      • (2022)Data Management for Machine Learning: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3148237(1-1)Online publication date: 2022
      • (2020)ActiveDeeperProceedings of the VLDB Endowment10.14778/3415478.341550013:12(2885-2888)Online publication date: 14-Sep-2020

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media