Web crawler middleware for search engine digital libraries: A case study for citeseerx

Jian Wu, Pradeep Teregowda, Madian Khabsa, Stephen Carman, Douglas Jordan, Jose San Pedro Wandelmer, Xin Lu, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

Middleware is an important part of many search engine web crawling processes. We developed a middleware, the Crawl Document Importer (CDI), which selectively imports documents and the associated metadata to the digital library CiteSeerX crawl repository and database. This middleware is designed to be extensible as it provides a universal interface to the crawl database. It is designed to support input from multiple open source crawlers and archival formats, e.g., ARC, WARC. It can also import files downloaded via FTP. To use this middleware for another crawler, the user only needs to write a new log parser which returns a resource object with the standard metadata attributes and tells the middleware how to access downloaded files. When importing documents, users can specify document mime types and obtain text extracted from PDF/postscript documents. The middleware can adaptively identify academic research papers based on document context features. We developed a web user interface where the user can submit importing jobs. The middleware package can also work on supplemental jobs related to the crawl database and respository. Though designed for the CiteSeerX search engine, we feel this design would be appropriate for many search engine web crawling systems.

Original languageEnglish
Title of host publicationInternational Conference on Information and Knowledge Management, Proceedings
Pages57-64
Number of pages8
DOIs
Publication statusPublished - 2012
Externally publishedYes
Event12th ACM International Workshop on Web Information and Data Management, WIDM 2012 - Co-located with CIKM 2012 - Maui, HI
Duration: 2 Nov 20122 Nov 2012

Other

Other12th ACM International Workshop on Web Information and Data Management, WIDM 2012 - Co-located with CIKM 2012
CityMaui, HI
Period2/11/122/11/12

Fingerprint

Middleware
Search engine
Digital libraries
World Wide Web
Data base
Metadata
Importing
Import
Resources
Academic research
Repository
Open source
User interface
Importer

Keywords

  • Information retrieval
  • Ingestion
  • Middleware
  • Search engine
  • Web crawling

ASJC Scopus subject areas

  • Business, Management and Accounting(all)
  • Decision Sciences(all)

Cite this

Wu, J., Teregowda, P., Khabsa, M., Carman, S., Jordan, D., Wandelmer, J. S. P., ... Giles, C. L. (2012). Web crawler middleware for search engine digital libraries: A case study for citeseerx. In International Conference on Information and Knowledge Management, Proceedings (pp. 57-64) https://doi.org/10.1145/2389936.2389949

Web crawler middleware for search engine digital libraries : A case study for citeseerx. / Wu, Jian; Teregowda, Pradeep; Khabsa, Madian; Carman, Stephen; Jordan, Douglas; Wandelmer, Jose San Pedro; Lu, Xin; Mitra, Prasenjit; Giles, C. Lee.

International Conference on Information and Knowledge Management, Proceedings. 2012. p. 57-64.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Wu, J, Teregowda, P, Khabsa, M, Carman, S, Jordan, D, Wandelmer, JSP, Lu, X, Mitra, P & Giles, CL 2012, Web crawler middleware for search engine digital libraries: A case study for citeseerx. in International Conference on Information and Knowledge Management, Proceedings. pp. 57-64, 12th ACM International Workshop on Web Information and Data Management, WIDM 2012 - Co-located with CIKM 2012, Maui, HI, 2/11/12. https://doi.org/10.1145/2389936.2389949
Wu J, Teregowda P, Khabsa M, Carman S, Jordan D, Wandelmer JSP et al. Web crawler middleware for search engine digital libraries: A case study for citeseerx. In International Conference on Information and Knowledge Management, Proceedings. 2012. p. 57-64 https://doi.org/10.1145/2389936.2389949
Wu, Jian ; Teregowda, Pradeep ; Khabsa, Madian ; Carman, Stephen ; Jordan, Douglas ; Wandelmer, Jose San Pedro ; Lu, Xin ; Mitra, Prasenjit ; Giles, C. Lee. / Web crawler middleware for search engine digital libraries : A case study for citeseerx. International Conference on Information and Knowledge Management, Proceedings. 2012. pp. 57-64
@inproceedings{afa0e4d545e749f59497d7000f7d94c5,
title = "Web crawler middleware for search engine digital libraries: A case study for citeseerx",
abstract = "Middleware is an important part of many search engine web crawling processes. We developed a middleware, the Crawl Document Importer (CDI), which selectively imports documents and the associated metadata to the digital library CiteSeerX crawl repository and database. This middleware is designed to be extensible as it provides a universal interface to the crawl database. It is designed to support input from multiple open source crawlers and archival formats, e.g., ARC, WARC. It can also import files downloaded via FTP. To use this middleware for another crawler, the user only needs to write a new log parser which returns a resource object with the standard metadata attributes and tells the middleware how to access downloaded files. When importing documents, users can specify document mime types and obtain text extracted from PDF/postscript documents. The middleware can adaptively identify academic research papers based on document context features. We developed a web user interface where the user can submit importing jobs. The middleware package can also work on supplemental jobs related to the crawl database and respository. Though designed for the CiteSeerX search engine, we feel this design would be appropriate for many search engine web crawling systems.",
keywords = "Information retrieval, Ingestion, Middleware, Search engine, Web crawling",
author = "Jian Wu and Pradeep Teregowda and Madian Khabsa and Stephen Carman and Douglas Jordan and Wandelmer, {Jose San Pedro} and Xin Lu and Prasenjit Mitra and Giles, {C. Lee}",
year = "2012",
doi = "10.1145/2389936.2389949",
language = "English",
isbn = "9781450317207",
pages = "57--64",
booktitle = "International Conference on Information and Knowledge Management, Proceedings",

}

TY - GEN

T1 - Web crawler middleware for search engine digital libraries

T2 - A case study for citeseerx

AU - Wu, Jian

AU - Teregowda, Pradeep

AU - Khabsa, Madian

AU - Carman, Stephen

AU - Jordan, Douglas

AU - Wandelmer, Jose San Pedro

AU - Lu, Xin

AU - Mitra, Prasenjit

AU - Giles, C. Lee

PY - 2012

Y1 - 2012

N2 - Middleware is an important part of many search engine web crawling processes. We developed a middleware, the Crawl Document Importer (CDI), which selectively imports documents and the associated metadata to the digital library CiteSeerX crawl repository and database. This middleware is designed to be extensible as it provides a universal interface to the crawl database. It is designed to support input from multiple open source crawlers and archival formats, e.g., ARC, WARC. It can also import files downloaded via FTP. To use this middleware for another crawler, the user only needs to write a new log parser which returns a resource object with the standard metadata attributes and tells the middleware how to access downloaded files. When importing documents, users can specify document mime types and obtain text extracted from PDF/postscript documents. The middleware can adaptively identify academic research papers based on document context features. We developed a web user interface where the user can submit importing jobs. The middleware package can also work on supplemental jobs related to the crawl database and respository. Though designed for the CiteSeerX search engine, we feel this design would be appropriate for many search engine web crawling systems.

AB - Middleware is an important part of many search engine web crawling processes. We developed a middleware, the Crawl Document Importer (CDI), which selectively imports documents and the associated metadata to the digital library CiteSeerX crawl repository and database. This middleware is designed to be extensible as it provides a universal interface to the crawl database. It is designed to support input from multiple open source crawlers and archival formats, e.g., ARC, WARC. It can also import files downloaded via FTP. To use this middleware for another crawler, the user only needs to write a new log parser which returns a resource object with the standard metadata attributes and tells the middleware how to access downloaded files. When importing documents, users can specify document mime types and obtain text extracted from PDF/postscript documents. The middleware can adaptively identify academic research papers based on document context features. We developed a web user interface where the user can submit importing jobs. The middleware package can also work on supplemental jobs related to the crawl database and respository. Though designed for the CiteSeerX search engine, we feel this design would be appropriate for many search engine web crawling systems.

KW - Information retrieval

KW - Ingestion

KW - Middleware

KW - Search engine

KW - Web crawling

UR - http://www.scopus.com/inward/record.url?scp=84870493887&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84870493887&partnerID=8YFLogxK

U2 - 10.1145/2389936.2389949

DO - 10.1145/2389936.2389949

M3 - Conference contribution

AN - SCOPUS:84870493887

SN - 9781450317207

SP - 57

EP - 64

BT - International Conference on Information and Knowledge Management, Proceedings

ER -