Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
skip to main content
10.5555/2045753.2045797guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

A framework for incremental deep web crawler based on URL classification

Authors Info & Claims
Published:24 September 2011Publication History

ABSTRACT

With the Web grows rapidly, more and more data become available in the Deep Web.But users have to key in a set of keywords in order to access the pages from some web sites. Traditional search engines only index and retrieve Surface Web pages through static URL links, because Deep Web pages are hidden behind the forms. However, the amount of information contained in the Deep web is not only far more than the Surface Web, the information of Deep Web is more valuable than the Surface Web. As Deep Web Pages change rapidly, how to maintain the Deep Web pages which were crawled fresh and to crawl the new Deep Web pages is a challenge. A framework for incremental Deep Web crawler based on URL classification is proposed. According to the list page and leaf page, the URL that is related with the page can be divided into two parts: list URL and leaf URL. The framework not only crawls the latest Deep Web pages according to the change frequency of list page, but also crawl the leaf pages which often change.

References

  1. Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. In: Proceedings of the 7th World-Wide Web Conference (1998). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Cho, J., Garcia-Molina, H.: Estimating frequency of change. Technical report, Stanford University (2000).Google ScholarGoogle Scholar
  3. Cho, J., Garcia-Molina, H.: The Evolution of the Web and Implications for an Incremental Crawler. In: Proceedings of the Twenty-Sixth VLDB Conference, Cairo, Egypt, pp. 200-209 (2000). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Meng, T., Yan, H.F., Wang, J.: A model of efficient incremental spider for the Chinese Web and its implementation. Journal of Tsinghua University (Science and Technology) 45(S1), 1882-1886 (2005) (in Chinese with English abstract).Google ScholarGoogle Scholar
  5. Meng, T., Yan, H.F., Wang, J.M.: Web Evolution and Incremental Crawling. Journal of Software 17(5) (May 2006).Google ScholarGoogle Scholar
  6. Sharma, A.K., Gupta, J.P., Agarwal, D.P.: A novel approach towards management of Volatile Information. Journal of CSI 33(1), 18-27 (2003).Google ScholarGoogle Scholar
  7. Qprober Research Group (October 2005), acessible at http://qprober.CS.columbia.edGoogle ScholarGoogle Scholar
  8. Cho, J., Garcia-Molina, H.: Synchronizing a database to improve freshness. In: Proceedings of the 2000 ACM SIGMOD (2000). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Key Technology R&D Program of Shandong Province under Grant No. 2010GGX10108Google ScholarGoogle Scholar
  10. Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling new approach to topic-specific web resource discovery. In: Proceedings of the 8th World-Wide Web Conference (1999). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Bhatia, K.K., Sharma, A.K.: A Framework for an Extensible Domain-specific Hidden Web Crawler (DSHWC). Communicated to IEEETKDE Journal (December 2008).Google ScholarGoogle Scholar
  12. Bhatia, K.K., Sharma, A.K.: A Framework for Domain-Specific Interface Mapper (DSIM). International Journal of Computer Science and Network Security, IJCSNS 2008 (2008).Google ScholarGoogle Scholar
  13. Dixit, A., Sharma, A.K.: Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler. International Journal of Computer Science and Network Security (IJCSNS) 8(12) (December 2008).Google ScholarGoogle Scholar
  14. Cho, J., Roy, S.: Impact of Web search engines on page popularity. In: Proc. of the 13th World-Wide Web Conf., pp. 20-29. ACM Press, New York (2004). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A framework for incremental deep web crawler based on URL classification
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      WISM'11: Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
      September 2011
      471 pages
      ISBN:9783642239816
      • Editors:
      • Zhiguo Gong,
      • Xiangfeng Luo,
      • Junjie Chen,
      • Jingsheng Lei,
      • Fu Lee Wang

      Publisher

      Springer-Verlag

      Berlin, Heidelberg

      Publication History

      • Published: 24 September 2011

      Qualifiers

      • Article
    • Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0

      Other Metrics