Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
skip to main content
10.1145/2389936.2389949acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Web crawler middleware for search engine digital libraries: a case study for citeseerX

Authors Info & Claims
Published:02 November 2012Publication History

ABSTRACT

Middleware is an important part of many search engine web crawling processes. We developed a middleware, the Crawl Document Importer (CDI), which selectively imports documents and the associated metadata to the digital library CiteSeerX crawl repository and database. This middleware is designed to be extensible as it provides a universal interface to the crawl database. It is designed to support input from multiple open source crawlers and archival formats, e.g., ARC, WARC. It can also import files downloaded via FTP. To use this middleware for another crawler, the user only needs to write a new log parser which returns a resource object with the standard metadata attributes and tells the middleware how to access downloaded files. When importing documents, users can specify document mime types and obtain text extracted from PDF/postscript documents. The middleware can adaptively identify academic research papers based on document context features. We developed a web user interface where the user can submit importing jobs. The middleware package can also work on supplemental jobs related to the crawl database and respository. Though designed for the CiteSeerX search engine, we feel this design would be appropriate for many search engine web crawling systems.

References

  1. A. Holovaty and J. Kaplan-Moss. The Definitive Guide to Django: Web Development Done Right (Pro). Apress, Berkely, CA, USA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. Mohr, M. Kimpton, M. Stack, and I. Ranitovic. Introduction to Heritrix, an archival quality web crawler. In Proceedings of the 4th International Web Archiving Workshop (IWAW'04), Sept. 2004.Google ScholarGoogle Scholar
  3. P. B. Teregowda, I. G. Councill, R. J. P. Fernández, M. Khabsa, S. Zheng, and C. L. Giles. Seersuite: developing a scalable and reliable application framework for building digital libraries by crawling the web. In Proceedings of the 2010 USENIX conference on Web application development, WebApps'10, pages 14--14, Berkeley, CA, USA, 2010. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Wu, P. Teregowda, J. P. F. Ramírez, P. Mitra, S. Zheng, and A. C. Lee Giles. The evolution of a crawling strategy for an academic document search engine: Whitelists and blacklists. In ACM WebSci, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Web crawler middleware for search engine digital libraries: a case study for citeseerX

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                WIDM '12: Proceedings of the twelfth international workshop on Web information and data management
                November 2012
                90 pages
                ISBN:9781450317207
                DOI:10.1145/2389936

                Copyright © 2012 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 2 November 2012

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article

                Upcoming Conference

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader