The evolution of a crawling strategy for an academic document search engine

Whitelists and blacklists

Jian Wu, Pradeep Teregowda, Juan Pablo Fernández Ramírez, Prasenjit Mitra, Shuyi Zheng, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

15 Citations (Scopus)

Abstract

We present a preliminary study of the evolution of a crawling strategy for an academic document search engine, in particular CiteSeerX. CiteSeerX actively crawls the web for academic and research documents primarily in computer and information sciences, and then performs unique information extraction and indexing extracting information such as OAI metadata, citations, tables and others. As such CiteSeerX could be considered a specialty or vertical search engine. To improve precision in resources expended, we replace a blacklist with a whitelist and compare the crawling efficiencies before and after this change. A blacklist means the crawl is forbidden from a certain list of URLs such as publisher domains but is otherwise unlimited. A whitelist means only certain domains are considered and others are not crawled. The whitelist is generated based on domain ranking scores of approximately five million parent URLs harvested by the CiteSeerX crawler in the past four years. We calculate the F 1 scores for each domain by applying equal weights to document numbers and citation rates. The whitelist is then generated by re-ordering parent URLs based on their domain ranking scores. We found that crawling the whitelist significantly increases the crawl precision by reducing a large amount of irrelevant requests and downloads.

Original languageEnglish
Title of host publicationProceedings of the 3rd Annual ACM Web Science Conference, WebSci'12
Pages340-343
Number of pages4
DOIs
Publication statusPublished - 2012
Externally publishedYes
Event3rd Annual ACM Web Science Conference, WebSci 2012 - Evanston, IL, United States
Duration: 22 Jun 201224 Jun 2012

Other

Other3rd Annual ACM Web Science Conference, WebSci 2012
CountryUnited States
CityEvanston, IL
Period22/6/1224/6/12

Fingerprint

Search engines
Websites
Information science
Metadata
Computer science

Keywords

  • Information retrieval
  • Search engine
  • Web crawling

ASJC Scopus subject areas

  • Computer Networks and Communications

Cite this

Wu, J., Teregowda, P., Ramírez, J. P. F., Mitra, P., Zheng, S., & Giles, C. L. (2012). The evolution of a crawling strategy for an academic document search engine: Whitelists and blacklists. In Proceedings of the 3rd Annual ACM Web Science Conference, WebSci'12 (pp. 340-343) https://doi.org/10.1145/2380718.2380762

The evolution of a crawling strategy for an academic document search engine : Whitelists and blacklists. / Wu, Jian; Teregowda, Pradeep; Ramírez, Juan Pablo Fernández; Mitra, Prasenjit; Zheng, Shuyi; Giles, C. Lee.

Proceedings of the 3rd Annual ACM Web Science Conference, WebSci'12. 2012. p. 340-343.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Wu, J, Teregowda, P, Ramírez, JPF, Mitra, P, Zheng, S & Giles, CL 2012, The evolution of a crawling strategy for an academic document search engine: Whitelists and blacklists. in Proceedings of the 3rd Annual ACM Web Science Conference, WebSci'12. pp. 340-343, 3rd Annual ACM Web Science Conference, WebSci 2012, Evanston, IL, United States, 22/6/12. https://doi.org/10.1145/2380718.2380762
Wu J, Teregowda P, Ramírez JPF, Mitra P, Zheng S, Giles CL. The evolution of a crawling strategy for an academic document search engine: Whitelists and blacklists. In Proceedings of the 3rd Annual ACM Web Science Conference, WebSci'12. 2012. p. 340-343 https://doi.org/10.1145/2380718.2380762
Wu, Jian ; Teregowda, Pradeep ; Ramírez, Juan Pablo Fernández ; Mitra, Prasenjit ; Zheng, Shuyi ; Giles, C. Lee. / The evolution of a crawling strategy for an academic document search engine : Whitelists and blacklists. Proceedings of the 3rd Annual ACM Web Science Conference, WebSci'12. 2012. pp. 340-343
@inproceedings{36e07dcef27a4d55a80afa388136ecaf,
title = "The evolution of a crawling strategy for an academic document search engine: Whitelists and blacklists",
abstract = "We present a preliminary study of the evolution of a crawling strategy for an academic document search engine, in particular CiteSeerX. CiteSeerX actively crawls the web for academic and research documents primarily in computer and information sciences, and then performs unique information extraction and indexing extracting information such as OAI metadata, citations, tables and others. As such CiteSeerX could be considered a specialty or vertical search engine. To improve precision in resources expended, we replace a blacklist with a whitelist and compare the crawling efficiencies before and after this change. A blacklist means the crawl is forbidden from a certain list of URLs such as publisher domains but is otherwise unlimited. A whitelist means only certain domains are considered and others are not crawled. The whitelist is generated based on domain ranking scores of approximately five million parent URLs harvested by the CiteSeerX crawler in the past four years. We calculate the F 1 scores for each domain by applying equal weights to document numbers and citation rates. The whitelist is then generated by re-ordering parent URLs based on their domain ranking scores. We found that crawling the whitelist significantly increases the crawl precision by reducing a large amount of irrelevant requests and downloads.",
keywords = "Information retrieval, Search engine, Web crawling",
author = "Jian Wu and Pradeep Teregowda and Ram{\'i}rez, {Juan Pablo Fern{\'a}ndez} and Prasenjit Mitra and Shuyi Zheng and Giles, {C. Lee}",
year = "2012",
doi = "10.1145/2380718.2380762",
language = "English",
isbn = "9781450312288",
pages = "340--343",
booktitle = "Proceedings of the 3rd Annual ACM Web Science Conference, WebSci'12",

}

TY - GEN

T1 - The evolution of a crawling strategy for an academic document search engine

T2 - Whitelists and blacklists

AU - Wu, Jian

AU - Teregowda, Pradeep

AU - Ramírez, Juan Pablo Fernández

AU - Mitra, Prasenjit

AU - Zheng, Shuyi

AU - Giles, C. Lee

PY - 2012

Y1 - 2012

N2 - We present a preliminary study of the evolution of a crawling strategy for an academic document search engine, in particular CiteSeerX. CiteSeerX actively crawls the web for academic and research documents primarily in computer and information sciences, and then performs unique information extraction and indexing extracting information such as OAI metadata, citations, tables and others. As such CiteSeerX could be considered a specialty or vertical search engine. To improve precision in resources expended, we replace a blacklist with a whitelist and compare the crawling efficiencies before and after this change. A blacklist means the crawl is forbidden from a certain list of URLs such as publisher domains but is otherwise unlimited. A whitelist means only certain domains are considered and others are not crawled. The whitelist is generated based on domain ranking scores of approximately five million parent URLs harvested by the CiteSeerX crawler in the past four years. We calculate the F 1 scores for each domain by applying equal weights to document numbers and citation rates. The whitelist is then generated by re-ordering parent URLs based on their domain ranking scores. We found that crawling the whitelist significantly increases the crawl precision by reducing a large amount of irrelevant requests and downloads.

AB - We present a preliminary study of the evolution of a crawling strategy for an academic document search engine, in particular CiteSeerX. CiteSeerX actively crawls the web for academic and research documents primarily in computer and information sciences, and then performs unique information extraction and indexing extracting information such as OAI metadata, citations, tables and others. As such CiteSeerX could be considered a specialty or vertical search engine. To improve precision in resources expended, we replace a blacklist with a whitelist and compare the crawling efficiencies before and after this change. A blacklist means the crawl is forbidden from a certain list of URLs such as publisher domains but is otherwise unlimited. A whitelist means only certain domains are considered and others are not crawled. The whitelist is generated based on domain ranking scores of approximately five million parent URLs harvested by the CiteSeerX crawler in the past four years. We calculate the F 1 scores for each domain by applying equal weights to document numbers and citation rates. The whitelist is then generated by re-ordering parent URLs based on their domain ranking scores. We found that crawling the whitelist significantly increases the crawl precision by reducing a large amount of irrelevant requests and downloads.

KW - Information retrieval

KW - Search engine

KW - Web crawling

UR - http://www.scopus.com/inward/record.url?scp=84869071720&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84869071720&partnerID=8YFLogxK

U2 - 10.1145/2380718.2380762

DO - 10.1145/2380718.2380762

M3 - Conference contribution

SN - 9781450312288

SP - 340

EP - 343

BT - Proceedings of the 3rd Annual ACM Web Science Conference, WebSci'12

ER -