Article

A framework for incremental deep web crawler based on URL classification

Authors:
Zhixiao Zhang

School of Computer Science and Technology, Shandong University, Jinan, China

School of Computer Science and Technology, Shandong University, Jinan, China
View Profile

,
Guoqing Dong

School of Computer Science and Technology, Shandong University, Jinan, China and Shandong Dareway Software Co., Ltd., Jinan, China

School of Computer Science and Technology, Shandong University, Jinan, China and Shandong Dareway Software Co., Ltd., Jinan, China
View Profile

,
Zhaohui Peng

School of Computer Science and Technology, Shandong University, Jinan, China

School of Computer Science and Technology, Shandong University, Jinan, China
View Profile

,
Zhongmin Yan

School of Computer Science and Technology, Shandong University, Jinan, China

School of Computer Science and Technology, Shandong University, Jinan, China
View Profile

WISM'11: Proceedings of the 2011 international conference on Web information systems and mining - Volume Part IISeptember 2011Pages 302–310

Published:24 September 2011Publication History

WISM'11: Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II

Pages 302–310

ABSTRACT

With the Web grows rapidly, more and more data become available in the Deep Web.But users have to key in a set of keywords in order to access the pages from some web sites. Traditional search engines only index and retrieve Surface Web pages through static URL links, because Deep Web pages are hidden behind the forms. However, the amount of information contained in the Deep web is not only far more than the Surface Web, the information of Deep Web is more valuable than the Surface Web. As Deep Web Pages change rapidly, how to maintain the Deep Web pages which were crawled fresh and to crawl the new Deep Web pages is a challenge. A framework for incremental Deep Web crawler based on URL classification is proposed. According to the list page and leaf page, the URL that is related with the page can be divided into two parts: list URL and leaf URL. The framework not only crawls the latest Deep Web pages according to the change frequency of list page, but also crawl the leaf pages which often change.

References

Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. In: Proceedings of the 7th World-Wide Web Conference (1998). Google ScholarDigital Library
Cho, J., Garcia-Molina, H.: Estimating frequency of change. Technical report, Stanford University (2000).Google Scholar
Cho, J., Garcia-Molina, H.: The Evolution of the Web and Implications for an Incremental Crawler. In: Proceedings of the Twenty-Sixth VLDB Conference, Cairo, Egypt, pp. 200-209 (2000). Google ScholarDigital Library
Meng, T., Yan, H.F., Wang, J.: A model of efficient incremental spider for the Chinese Web and its implementation. Journal of Tsinghua University (Science and Technology) 45(S1), 1882-1886 (2005) (in Chinese with English abstract).Google Scholar
Meng, T., Yan, H.F., Wang, J.M.: Web Evolution and Incremental Crawling. Journal of Software 17(5) (May 2006).Google Scholar
Sharma, A.K., Gupta, J.P., Agarwal, D.P.: A novel approach towards management of Volatile Information. Journal of CSI 33(1), 18-27 (2003).Google Scholar
Qprober Research Group (October 2005), acessible at http://qprober.CS.columbia.edGoogle Scholar
Cho, J., Garcia-Molina, H.: Synchronizing a database to improve freshness. In: Proceedings of the 2000 ACM SIGMOD (2000). Google ScholarDigital Library
Key Technology R&D Program of Shandong Province under Grant No. 2010GGX10108Google Scholar
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling new approach to topic-specific web resource discovery. In: Proceedings of the 8th World-Wide Web Conference (1999). Google ScholarDigital Library
Bhatia, K.K., Sharma, A.K.: A Framework for an Extensible Domain-specific Hidden Web Crawler (DSHWC). Communicated to IEEETKDE Journal (December 2008).Google Scholar
Bhatia, K.K., Sharma, A.K.: A Framework for Domain-Specific Interface Mapper (DSIM). International Journal of Computer Science and Network Security, IJCSNS 2008 (2008).Google Scholar
Dixit, A., Sharma, A.K.: Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler. International Journal of Computer Science and Network Security (IJCSNS) 8(12) (December 2008).Google Scholar
Cho, J., Roy, S.: Impact of Web search engines on page popularity. In: Proc. of the 13th World-Wide Web Conf., pp. 20-29. ACM Press, New York (2004). Google ScholarDigital Library

Index Terms

A framework for incremental deep web crawler based on URL classification
1. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Improving the freshness of the search engines by a probabilistic approach based incremental crawler

Web is flooded with data. While the crawler is responsible for accessing these web pages and giving it to the indexer for making them available to the users of search engine, the rate at which these web pages change has created the necessity for the ...
Read More
Advanced Deep Web Crawler Based on Dom
CSO '12: Proceedings of the 2012 Fifth International Joint Conference on Computational Sciences and Optimization

Due to the fact that large amount of data today can only be stored in deep web. In view of the work done by others on deep web crawlers, it is extinct that no perfect, or even complete crawlers for deep web data has been made. To meet the needs of deep ...
Read More
A Novel Architecture for Deep Web Crawler

A traditional crawler picks up a URL, retrieves the corresponding page and extracts various links, adding them to the queue. A deep Web crawler, after adding links to the queue, checks for forms. If forms are present, it processes them and retrieves the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WISM'11: Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
September 2011
471 pages
ISBN:9783642239816
Editors:
Zhiguo Gong
University of Macau, Department of Computer and Information Science, Taipa, Macau, China
,
Xiangfeng Luo
Shanghai University, School of Computer, Shanghai, China
,
Junjie Chen
Taiyuan University of Technology, School of Computer and Software, Taiyuan, China
,
Jingsheng Lei
Shanghai University of Electric Power, School of Computer and Information Engineering, Shanghai, China
,
Fu Lee Wang
Caritas Institute of Higher Education, Department of Business Administration, Hong Kong, China
Sponsors
In-Cooperation
Publisher
Springer-Verlag
Berlin, Heidelberg
Publication History
- Published: 24 September 2011
Author Tags
URL classification
deep web
incremental crawl
Qualifiers
- Article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 0
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

A framework for incremental deep web crawler based on URL classification

WISM'11: Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II

ABSTRACT

References

Cited By

Index Terms

Recommendations

Improving the freshness of the search engines by a probabilistic approach based incremental crawler

Advanced Deep Web Crawler Based on Dom

A Novel Architecture for Deep Web Crawler

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

Digital Edition

Caption

A framework for incremental deep web crawler based on URL classification

WISM'11: Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II

ABSTRACT

References

Cited By

Index Terms

Recommendations

Improving the freshness of the search engines by a probabilistic approach based incremental crawler

Advanced Deep Web Crawler Based on Dom

A Novel Architecture for Deep Web Crawler

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

Digital Edition

Share this Publication link

Share on Social Media