Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
skip to main content
10.5555/2045005.2045051guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

A vertical search engine for school information based on Heritrix and Lucene

Published: 22 September 2011 Publication History

Abstract

The contents on the web are increasing exponentially as the rapid development of the Internet applications and services continues to expand. A problem in obtaining useful information from vast contents quickly and accurately is facing us while people are enjoying the convenience of the Internet. The immediate response to this problem is a Web Search Engine. We developed a vertical search engine for a certain domain like university. The search engine consists of Crawler, Indexer, and Searcher. The crawler component is implemented with Heritrix crawler based on the mechanism of recursion and archiving. A reusable, extensible index establishment and management subsystem are designed and implemented by open-source package named Lucene in the indexer component. An experiment has been done for Chungbuk National University web sites, and the number of documents the system retrieves is more than 4 hundred times on the average for typical keywords set than those from Google or university's search engines.

References

[1]
Curran, K., Glinchey, J.: Vertical Search Engines. ITB Journal (16), 22-26 (2007).
[2]
Chau, M., Chen, H.: Comparison of Three Vertical Search Spiders, pp. 56-62. IEEE Computer Society, Los Alamitos (2003).
[3]
Chakrabarti, S., Jaju, R., Joshi, M., Punera, K.: Analyzing Fine-grained Hypertext Features for Enhanced Crawling and Topic Distillation, vol. 25(1). IEEE Computer Society, Los Alamitos (2002).
[4]
Cho, J., Page, L.: Efficient crawling through URL ordering. In: Proceedings of the Seventh International World Wide Web Conference, WWW7 (1998).
[5]
Gravano, L., Ipeirotis, P., Sahami, M.: Query- vs. Crawling-based Classification of Searchable Web Databases, vol. 25(1). IEEE Computer Society, Los Alamitos (2002).
[6]
Gospodnetic, O., Hatcher, E.: Lucene in Action, 2nd edn. Manning Publications Co. (2009).
[7]
Sigurdsson, K.: Incremental crawling with Heritrix, National and University Library of Iceland. In: Proc. IWAW (2005).
[8]
Stack, M.: Full Text Search of Web Archive Collections, Internet Archive, The Presidio of San Francisco, 116 Sheridan Ave, San Francisco, CA 94129 the 5th International Web Archiving Workshop, IWAW (2005).
[9]
Wang, X.: Lucene Nuthc Search Engine Development. Posts and Telcom. Press, Beijing (2008).
[10]
The Apache Software Foundation, http://tomcat.apache.org/
[11]
Chungbuk search engine, http://search.chungbuk.ac.kr/RSA/front/Search.jsp
[12]
Heritrix User Manual, http://crawler.archive.org
[13]
Index (search engine), http://en.wikipedia.org/wiki/Index_(search_engine).
[14]
Google search engine, http://www.google.com

Index Terms

  1. A vertical search engine for school information based on Heritrix and Lucene
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    ICHIT'11: Proceedings of the 5th international conference on Convergence and hybrid information technology
    September 2011
    789 pages
    ISBN:9783642240812
    • Editors:
    • Geuk Lee,
    • Daniel Howard,
    • Dominik Ślęzak

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 22 September 2011

    Author Tags

    1. indexing
    2. information retrieval
    3. search engine
    4. web crawling

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 19 Sep 2024

    Other Metrics

    Citations

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media