Svoboda | Graniru | BBC Russia | Golosameriki | Facebook

Article

Web data extraction based on partial tree alignment

Authors:

Yanhong Zhai and

Bing LiuAuthors Info & Claims

WWW '05: Proceedings of the 14th international conference on World Wide Web

May 2005

Pages 76 - 85

https://doi.org/10.1145/1060745.1060761

Published: 10 May 2005 Publication History

Abstract

This paper studies the problem of extracting data from a Web page that contains several structured data records. The objective is to segment these data records, extract data items/fields from them and put the data in a database table. This problem has been studied by several researchers. However, existing methods still have some serious limitations. The first class of methods is based on machine learning, which requires human labeling of many examples from each Web site that one is interested in extracting data from. The process is time consuming due to the large number of sites and pages on the Web. The second class of algorithms is based on automatic pattern discovery. These methods are either inaccurate or make many assumptions. This paper proposes a new method to perform the task automatically. It consists of two steps, (1) identifying individual data records in a page, and (2) aligning and extracting data items from the identified data records. For step 1, we propose a method based on visual information to segment data records, which is more accurate than existing methods. For step 2, we propose a novel partial alignment technique based on tree matching. Partial alignment means that we align only those data fields in a pair of data records that can be aligned (or matched) with certainty, and make no commitment on the rest of the data fields. This approach enables very accurate alignment of multiple data records. Experimental results using a large number of Web pages from diverse domains show that the proposed two-step technique is able to segment data records, align and extract data from them very accurately.

References

[1]

Arasu, A. and Garcia-Molina, H. Extracting Structured Data from Web Pages. SIGMOD-03, 2003.

Digital Library

[2]

Baeza-Yates, R. Algorithms for string matching: A survey. ACM SIGIR Forum, 23(3-4):34--58, 1989.

Digital Library

[3]

Barton, G., Sternberg, M. A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J. Mol. Biol. 1987, 327--337.

[4]

Bar-Yossef, Z. and Rajagopalan, S. Template Detection via Data Mining and its Applications, WWW 2002, 2002.

Digital Library

[5]

Buttler, D., Liu, L., Pu, C. A fully automated extraction system for the World Wide Web. IEEE ICDCS-21, 2001.

[6]

Carrillo, H., Lipman, D. The multiple sequence alignment problem in biology. SIAM J. Applied Math., 1988;48(5).

Digital Library

[7]

Chakrabarti, S. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, 2002.

Digital Library

[8]

Chang, C. and Lui, S-L. IEPAD: Information extraction based on pattern discovery. WWW-10, 2001.

Digital Library

[9]

Chen, H.-H., Tsai, S.-C., and Tsai, J.-H. Mining tables from large scale html texts. COLING-00, 2000.

Digital Library

[10]

Chen, W. New algorithm for ordered tree-to-tree correction problem. Journal of Algorithms, 40:135-158, 2001.

Digital Library

[11]

Cohen, W., Hurst, M., and Jensen, L. A flexible learning system for wrapping tables and lists in HTML documents. WWW-2002, 2002.

Digital Library

[12]

Crescenzi, V., Mecca, G. and Merialdo, P. Roadrunner: Towards automatic data extraction from large web sites. VLDB-01, 2001.

Digital Library

[13]

Doorenbos, R., Etzioni, O., Weld, D. A scalable comparison shopping agent for the World Wide Web. Agents-97, 1997.

Digital Library

[14]

Embley, D., Jiang, Y and Ng, Y. "Record-boundary discovery in Web documents." SIGMOD-99, 1999.

Digital Library

[15]

Gusfield, D. Algorithms on strings, tree, and sequence, Cambridge. 1997.

Digital Library

[16]

Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., and Crespo, A. Extracting semi-structured information from the Web. Workshop on Manag. of Semi-structured Data, 1997.

[17]

Hogeweg, P., Hesper, B. The alignment of sets of sequences and the construction of phylogenetic trees: An integrated method. J. Mol. Evol., 20, 175--186 (1984).

[18]

Hsu, C.-N. and Dung, M.-T. Generating finite-state transducers for semi-structured data extraction from the Web. Information Systems. 23(8): 521--538, 1998.

Digital Library

[19]

Kushmerick, N. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118:15--68, 2000.

Digital Library

[20]

Lerman, K., Getoor L., Minton, S. and Knoblock, C. "Using the Structure of Web Sites for Automatic Segmentation of Tables." SIGMOD-04, 2004.

Digital Library

[21]

Liu, B., Grossman, R. and Zhai, Y. "Mining data records from Web pages." KDD-03, 2003.

Digital Library

[22]

Meng, X., Lu, H., Wang, H., and Gu, M. Schema-Guided Wrapper Generator. ICDE-02, 2002.

[23]

Muslea, I., Minton, S. and Knoblock, C. "A hierarchical approach to wrapper induction." Agents-99, 1999.

Digital Library

[24]

Notredame, C. Recent progresses in multiple sequence alignment: a survey. Technical Report. 2002.

[25]

Pinto, D., McCallum, A., Wei, X. and Bruce, W. Table Extraction Using Conditional Random Fields. SIGIR-03.

Digital Library

[26]

Ramaswamy, L., Ivengar, A., Liu, L., and Douglis, F. Automatic detection of fragments in dynamically generated Web pages. WWW-04, 2004.

Digital Library

[27]

Reis, D. Golgher, P., Silva, A., Laender, A. Automatic Web news extraction using tree edit distance, WWW-04, 2004.

Digital Library

[28]

Rosenfeld, B., Feldman, R., Aumann, Y. Structural extraction from visual layout of documents. CIKM-02, 2002.

Digital Library

[29]

Song, R., Liu, H., Wen, J.-R., Ma, W.-Y. Learning block importance models for Web pages. WWW-04, 2004.

Digital Library

[30]

Tai, K. The tree-to-tree correction problem. J. ACM, 26(3):422--433, 1979

Digital Library

[31]

Valiente, G. Tree edit distance and common subtrees. Research Report LSI-02-20-R, Universitat Politecnica de Catalunya, Barcelona, Spain, 2002.

[32]

Wang, J., Shapiro, J., Shasha, D., Zhang, K., Currey, K. An algorithm for finding the largest approximately common substructures of two trees. IEEE PAMI, 20(8), 1998.

Digital Library

[33]

Wang, Y., Hu, J. A machine learning based approach for table detection on the Web. WWW-2002.

Digital Library

[34]

Wang, J.-Y., and Lochovsky, F. Data extraction and label assignment for Web databases. WWW-03, 2003.

Digital Library

[35]

Yang, W. Identifying syntactic differences between two programs. Softw. Pract. Exper., 21(7):739--755, 1991.

Digital Library

[36]

Zhang, K., Statman, R., Shasha, D. On the editing distance between unordered labeled trees. Information Processing Letters, 42(3):133-139, 1992.

Digital Library

Cited By

Sarkhel RHuang BLockard CShiralkar P(2023)Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-PagesProceedings of the VLDB Endowment10.14778/3611479.361151116:11(3098-3110)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611511
Nunes MDorneles C(2023)EDREW - Enhanced Data Representation for Extraction in WebProceedings of the 29th Brazilian Symposium on Multimedia and the Web10.1145/3617023.3617055(230-237)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3617023.3617055
Prakash YSunkara MLee HJayarathna SAshok V(2023)AutoDesc: Facilitating Convenient Perusal of Web Data Items for Blind UsersProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584049(32-45)Online publication date: 27-Mar-2023
https://dl.acm.org/doi/10.1145/3581641.3584049
Show More Cited By

Index Terms

Web data extraction based on partial tree alignment
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Automatic web data extraction using tree alignment
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

This paper investigates the automatic extraction of data from forums, blogs and news web sites. Web pages are increasingly dynamically generated using a common template populated with data from databases. This paper proposes a novel method that uses ...
Read More
Analysis of Data Extraction and Data Cleaning in Web Usage Mining
ICARCSET '15: Proceedings of the 2015 International Conference on Advanced Research in Computer Science Engineering & Technology (ICARCSET 2015)

Data preprocessing is considered as an important phase of Web usage mining due to unstructured, heterogeneous and noisy nature of log data. Complete and effective data preprocessing insures the efficiency and scalability of algorithms used in pattern ...
Read More
Effective Web Data Extraction with Ducky
IDEAS '15: Proceedings of the 19th International Database Engineering & Applications Symposium

The World Wide Web has become an invaluable source of data. However, extracting useful information from the vastness of the web can become a challenge as depending on the amount of data needed, manual extraction or creation of web scraping programs may ...
Read More

Comments

Information & Contributors

Information

Published In

WWW '05: Proceedings of the 14th international conference on World Wide Web

May 2005

781 pages

ISBN:1595930469

DOI:10.1145/1060745

Conference Chairs:
Allan Ellis
Southern Cross University
,
Tatsuya Hagino
Keio University

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 May 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

337
Total Citations
View Citations
3,013
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

Sarkhel RHuang BLockard CShiralkar P(2023)Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-PagesProceedings of the VLDB Endowment10.14778/3611479.361151116:11(3098-3110)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611511
Nunes MDorneles C(2023)EDREW - Enhanced Data Representation for Extraction in WebProceedings of the 29th Brazilian Symposium on Multimedia and the Web10.1145/3617023.3617055(230-237)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3617023.3617055
Prakash YSunkara MLee HJayarathna SAshok V(2023)AutoDesc: Facilitating Convenient Perusal of Web Data Items for Blind UsersProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584049(32-45)Online publication date: 27-Mar-2023
https://dl.acm.org/doi/10.1145/3581641.3584049
Chiou PAlotaibi AHalfond W(2023)BAGEL: An Approach to Automatically Detect Navigation-Based Web Accessibility Barriers for Keyboard UsersProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580749(1-17)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544548.3580749
Chen ZMeng WDragut E(2022)Web Record Extraction with InvariantsProceedings of the VLDB Endowment10.14778/3574245.357427616:4(959-972)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.14778/3574245.3574276
Lee HAshok V(2022)Customizable Tabular Access to Web Data Records for Convenient Low-vision Screen Magnifier InteractionACM Transactions on Accessible Computing10.1145/351704415:2(1-22)Online publication date: 19-May-2022
https://dl.acm.org/doi/10.1145/3517044
Jiménez PCorchuelo R(2022)On validating web information extraction proposalsExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.116700199:COnline publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1016/j.eswa.2022.116700
Aggarwal CAggarwal C(2022)Text Preparation and Similarity ComputationMachine Learning for Text10.1007/978-3-030-96623-2_2(19-32)Online publication date: 10-Feb-2022
https://doi.org/10.1007/978-3-030-96623-2_2
Cetorelli VAtzeni PCrescenzi VMilicchio F(2021)The smallest extraction problemProceedings of the VLDB Endowment10.14778/3476249.347629314:11(2445-2458)Online publication date: 27-Oct-2021
https://dl.acm.org/doi/10.14778/3476249.3476293
Sun KWang FChen MPujara JDemartini GZuccon GCulpepper JHuang ZTong H(2021)Tabular Functional Block Detection with Embedding-based Agglomerative Cell ClusteringProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482484(1744-1753)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482484
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents