research-article

Web crawler middleware for search engine digital libraries: a case study for citeseerX

Authors:
Jian Wu

Pennsylvania State University, University Park, PA, USA

Pennsylvania State University, University Park, PA, USA
View Profile

,
Pradeep Teregowda

Pennsylvania State University, University Park, PA, USA

Pennsylvania State University, University Park, PA, USA
View Profile

,
Madian Khabsa

Pennsylvania State University, University Park, PA, USA

Pennsylvania State University, University Park, PA, USA
View Profile

,
Stephen Carman

Pennsylvania State University, University Park, PA, USA

Pennsylvania State University, University Park, PA, USA
View Profile

,
Douglas Jordan

Pennsylvania State University, University Park, PA, USA

Pennsylvania State University, University Park, PA, USA
View Profile

,
Jose San Pedro Wandelmer

Pennsylvania State University, University Park, PA, USA

Pennsylvania State University, University Park, PA, USA
View Profile

,
Xin Lu

Pennsylvania State University, University Park, PA, USA

Pennsylvania State University, University Park, PA, USA
View Profile

,
Prasenjit Mitra

Pennsylvania State University, University Park, PA, USA

Pennsylvania State University, University Park, PA, USA
View Profile

,
C. Lee Giles

Pennsylvania State University, University Park, PA, USA

Pennsylvania State University, University Park, PA, USA
View Profile

WIDM '12: Proceedings of the twelfth international workshop on Web information and data managementNovember 2012Pages 57–64https://doi.org/10.1145/2389936.2389949

Published:02 November 2012Publication History

WIDM '12: Proceedings of the twelfth international workshop on Web information and data management

Pages 57–64

ABSTRACT

Middleware is an important part of many search engine web crawling processes. We developed a middleware, the Crawl Document Importer (CDI), which selectively imports documents and the associated metadata to the digital library CiteSeerX crawl repository and database. This middleware is designed to be extensible as it provides a universal interface to the crawl database. It is designed to support input from multiple open source crawlers and archival formats, e.g., ARC, WARC. It can also import files downloaded via FTP. To use this middleware for another crawler, the user only needs to write a new log parser which returns a resource object with the standard metadata attributes and tells the middleware how to access downloaded files. When importing documents, users can specify document mime types and obtain text extracted from PDF/postscript documents. The middleware can adaptively identify academic research papers based on document context features. We developed a web user interface where the user can submit importing jobs. The middleware package can also work on supplemental jobs related to the crawl database and respository. Though designed for the CiteSeerX search engine, we feel this design would be appropriate for many search engine web crawling systems.

References

A. Holovaty and J. Kaplan-Moss. The Definitive Guide to Django: Web Development Done Right (Pro). Apress, Berkely, CA, USA, 2007. Google ScholarDigital Library
G. Mohr, M. Kimpton, M. Stack, and I. Ranitovic. Introduction to Heritrix, an archival quality web crawler. In Proceedings of the 4th International Web Archiving Workshop (IWAW'04), Sept. 2004.Google Scholar
P. B. Teregowda, I. G. Councill, R. J. P. Fernández, M. Khabsa, S. Zheng, and C. L. Giles. Seersuite: developing a scalable and reliable application framework for building digital libraries by crawling the web. In Proceedings of the 2010 USENIX conference on Web application development, WebApps'10, pages 14--14, Berkeley, CA, USA, 2010. USENIX Association. Google ScholarDigital Library
J. Wu, P. Teregowda, J. P. F. Ramírez, P. Mitra, S. Zheng, and A. C. Lee Giles. The evolution of a crawling strategy for an academic document search engine: Whitelists and blacklists. In ACM WebSci, 2012. Google ScholarDigital Library

Index Terms

Web crawler middleware for search engine digital libraries: a case study for citeseerX

Recommendations

The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists
WebSci '12: Proceedings of the 4th Annual ACM Web Science Conference

We present a preliminary study of the evolution of a crawling strategy for an academic document search engine, in particular CiteSeerX. CiteSeerX actively crawls the web for academic and research documents primarily in computer and information sciences, ...
Read More
A vertical search engine for school information based on Heritrix and Lucene
ICHIT'11: Proceedings of the 5th international conference on Convergence and hybrid information technology

The contents on the web are increasing exponentially as the rapid development of the Internet applications and services continues to expand. A problem in obtaining useful information from vast contents quickly and accurately is facing us while people ...
Read More
Search Engine Optimization by Re-Ranking the Product Search Result Based on User Click Data
AISS '21: Proceedings of the 3rd International Conference on Advanced Information Science and System

Blibli.com provides a search engine for its customers. It used Solr search engine with only plain BM25 similarity function which is based on probability. In order to improve search engine performance, this research tried to implement an algorithm that ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WIDM '12: Proceedings of the twelfth international workshop on Web information and data management
November 2012
90 pages
ISBN:9781450317207
DOI:10.1145/2389936
Program Chairs:
George H.L. Fletcher
Eindhoven University of Technology, The Netherlands
,
Prasenjit Mitra
The Pennsylvania State University, USA
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 November 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
information retrieval
ingestion
middleware
search engine
web crawling
Qualifiers
- research-article
Conference
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 255
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Web crawler middleware for search engine digital libraries: a case study for citeseerX

WIDM '12: Proceedings of the twelfth international workshop on Web information and data management

ABSTRACT

References

Cited By

Index Terms

Recommendations

The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists

A vertical search engine for school information based on Heritrix and Lucene

Search Engine Optimization by Re-Ranking the Product Search Result Based on User Click Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Web crawler middleware for search engine digital libraries: a case study for citeseerX

WIDM '12: Proceedings of the twelfth international workshop on Web information and data management

ABSTRACT

References

Cited By

Index Terms

Recommendations

The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists

A vertical search engine for school information based on Heritrix and Lucene

Search Engine Optimization by Re-Ranking the Product Search Result Based on User Click Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media