Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
skip to main content
10.1145/1060745.1060761acmconferencesArticle/Chapter ViewAbstractPublication PageswebconfConference Proceedingsconference-collections
Article

Web data extraction based on partial tree alignment

Published: 10 May 2005 Publication History
  • Get Citation Alerts
  • Abstract

    This paper studies the problem of extracting data from a Web page that contains several structured data records. The objective is to segment these data records, extract data items/fields from them and put the data in a database table. This problem has been studied by several researchers. However, existing methods still have some serious limitations. The first class of methods is based on machine learning, which requires human labeling of many examples from each Web site that one is interested in extracting data from. The process is time consuming due to the large number of sites and pages on the Web. The second class of algorithms is based on automatic pattern discovery. These methods are either inaccurate or make many assumptions. This paper proposes a new method to perform the task automatically. It consists of two steps, (1) identifying individual data records in a page, and (2) aligning and extracting data items from the identified data records. For step 1, we propose a method based on visual information to segment data records, which is more accurate than existing methods. For step 2, we propose a novel partial alignment technique based on tree matching. Partial alignment means that we align only those data fields in a pair of data records that can be aligned (or matched) with certainty, and make no commitment on the rest of the data fields. This approach enables very accurate alignment of multiple data records. Experimental results using a large number of Web pages from diverse domains show that the proposed two-step technique is able to segment data records, align and extract data from them very accurately.

    References

    [1]
    Arasu, A. and Garcia-Molina, H. Extracting Structured Data from Web Pages. SIGMOD-03, 2003.
    [2]
    Baeza-Yates, R. Algorithms for string matching: A survey. ACM SIGIR Forum, 23(3-4):34--58, 1989.
    [3]
    Barton, G., Sternberg, M. A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J. Mol. Biol. 1987, 327--337.
    [4]
    Bar-Yossef, Z. and Rajagopalan, S. Template Detection via Data Mining and its Applications, WWW 2002, 2002.
    [5]
    Buttler, D., Liu, L., Pu, C. A fully automated extraction system for the World Wide Web. IEEE ICDCS-21, 2001.
    [6]
    Carrillo, H., Lipman, D. The multiple sequence alignment problem in biology. SIAM J. Applied Math., 1988;48(5).
    [7]
    Chakrabarti, S. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, 2002.
    [8]
    Chang, C. and Lui, S-L. IEPAD: Information extraction based on pattern discovery. WWW-10, 2001.
    [9]
    Chen, H.-H., Tsai, S.-C., and Tsai, J.-H. Mining tables from large scale html texts. COLING-00, 2000.
    [10]
    Chen, W. New algorithm for ordered tree-to-tree correction problem. Journal of Algorithms, 40:135-158, 2001.
    [11]
    Cohen, W., Hurst, M., and Jensen, L. A flexible learning system for wrapping tables and lists in HTML documents. WWW-2002, 2002.
    [12]
    Crescenzi, V., Mecca, G. and Merialdo, P. Roadrunner: Towards automatic data extraction from large web sites. VLDB-01, 2001.
    [13]
    Doorenbos, R., Etzioni, O., Weld, D. A scalable comparison shopping agent for the World Wide Web. Agents-97, 1997.
    [14]
    Embley, D., Jiang, Y and Ng, Y. "Record-boundary discovery in Web documents." SIGMOD-99, 1999.
    [15]
    Gusfield, D. Algorithms on strings, tree, and sequence, Cambridge. 1997.
    [16]
    Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., and Crespo, A. Extracting semi-structured information from the Web. Workshop on Manag. of Semi-structured Data, 1997.
    [17]
    Hogeweg, P., Hesper, B. The alignment of sets of sequences and the construction of phylogenetic trees: An integrated method. J. Mol. Evol., 20, 175--186 (1984).
    [18]
    Hsu, C.-N. and Dung, M.-T. Generating finite-state transducers for semi-structured data extraction from the Web. Information Systems. 23(8): 521--538, 1998.
    [19]
    Kushmerick, N. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118:15--68, 2000.
    [20]
    Lerman, K., Getoor L., Minton, S. and Knoblock, C. "Using the Structure of Web Sites for Automatic Segmentation of Tables." SIGMOD-04, 2004.
    [21]
    Liu, B., Grossman, R. and Zhai, Y. "Mining data records from Web pages." KDD-03, 2003.
    [22]
    Meng, X., Lu, H., Wang, H., and Gu, M. Schema-Guided Wrapper Generator. ICDE-02, 2002.
    [23]
    Muslea, I., Minton, S. and Knoblock, C. "A hierarchical approach to wrapper induction." Agents-99, 1999.
    [24]
    Notredame, C. Recent progresses in multiple sequence alignment: a survey. Technical Report. 2002.
    [25]
    Pinto, D., McCallum, A., Wei, X. and Bruce, W. Table Extraction Using Conditional Random Fields. SIGIR-03.
    [26]
    Ramaswamy, L., Ivengar, A., Liu, L., and Douglis, F. Automatic detection of fragments in dynamically generated Web pages. WWW-04, 2004.
    [27]
    Reis, D. Golgher, P., Silva, A., Laender, A. Automatic Web news extraction using tree edit distance, WWW-04, 2004.
    [28]
    Rosenfeld, B., Feldman, R., Aumann, Y. Structural extraction from visual layout of documents. CIKM-02, 2002.
    [29]
    Song, R., Liu, H., Wen, J.-R., Ma, W.-Y. Learning block importance models for Web pages. WWW-04, 2004.
    [30]
    Tai, K. The tree-to-tree correction problem. J. ACM, 26(3):422--433, 1979
    [31]
    Valiente, G. Tree edit distance and common subtrees. Research Report LSI-02-20-R, Universitat Politecnica de Catalunya, Barcelona, Spain, 2002.
    [32]
    Wang, J., Shapiro, J., Shasha, D., Zhang, K., Currey, K. An algorithm for finding the largest approximately common substructures of two trees. IEEE PAMI, 20(8), 1998.
    [33]
    Wang, Y., Hu, J. A machine learning based approach for table detection on the Web. WWW-2002.
    [34]
    Wang, J.-Y., and Lochovsky, F. Data extraction and label assignment for Web databases. WWW-03, 2003.
    [35]
    Yang, W. Identifying syntactic differences between two programs. Softw. Pract. Exper., 21(7):739--755, 1991.
    [36]
    Zhang, K., Statman, R., Shasha, D. On the editing distance between unordered labeled trees. Information Processing Letters, 42(3):133-139, 1992.

    Cited By

    View all
    • (2023)Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-PagesProceedings of the VLDB Endowment10.14778/3611479.361151116:11(3098-3110)Online publication date: 24-Aug-2023
    • (2023)EDREW - Enhanced Data Representation for Extraction in WebProceedings of the 29th Brazilian Symposium on Multimedia and the Web10.1145/3617023.3617055(230-237)Online publication date: 23-Oct-2023
    • (2023)AutoDesc: Facilitating Convenient Perusal of Web Data Items for Blind UsersProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584049(32-45)Online publication date: 27-Mar-2023
    • Show More Cited By

    Index Terms

    1. Web data extraction based on partial tree alignment

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        WWW '05: Proceedings of the 14th international conference on World Wide Web
        May 2005
        781 pages
        ISBN:1595930469
        DOI:10.1145/1060745
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 10 May 2005

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. data extraction
        2. data record extraction
        3. wrapper

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)25
        • Downloads (Last 6 weeks)0

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-PagesProceedings of the VLDB Endowment10.14778/3611479.361151116:11(3098-3110)Online publication date: 24-Aug-2023
        • (2023)EDREW - Enhanced Data Representation for Extraction in WebProceedings of the 29th Brazilian Symposium on Multimedia and the Web10.1145/3617023.3617055(230-237)Online publication date: 23-Oct-2023
        • (2023)AutoDesc: Facilitating Convenient Perusal of Web Data Items for Blind UsersProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584049(32-45)Online publication date: 27-Mar-2023
        • (2023)BAGEL: An Approach to Automatically Detect Navigation-Based Web Accessibility Barriers for Keyboard UsersProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580749(1-17)Online publication date: 19-Apr-2023
        • (2022)Web Record Extraction with InvariantsProceedings of the VLDB Endowment10.14778/3574245.357427616:4(959-972)Online publication date: 1-Dec-2022
        • (2022)Customizable Tabular Access to Web Data Records for Convenient Low-vision Screen Magnifier InteractionACM Transactions on Accessible Computing10.1145/351704415:2(1-22)Online publication date: 19-May-2022
        • (2022)On validating web information extraction proposalsExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.116700199:COnline publication date: 1-Aug-2022
        • (2022)Text Preparation and Similarity ComputationMachine Learning for Text10.1007/978-3-030-96623-2_2(19-32)Online publication date: 10-Feb-2022
        • (2021)The smallest extraction problemProceedings of the VLDB Endowment10.14778/3476249.347629314:11(2445-2458)Online publication date: 27-Oct-2021
        • (2021)Tabular Functional Block Detection with Embedding-based Agglomerative Cell ClusteringProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482484(1744-1753)Online publication date: 26-Oct-2021
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media