Scheduling algorithms for Web crawling

Carlos Castillo, Mauricio Marin, Ricardo Baeza-Yates, Andrea Rodriguez

Research output: Chapter in Book/Report/Conference proceedingConference contribution

30 Citations (Scopus)

Abstract

This article presents a comparative study of strategies for Web crawling. We show that a combination of breadth-first ordering with the largest sites first is a practical alternative since it is fast, simple to implement, and able to retrieve the best ranked pages at a rate that is closer to the optimal than other alternatives. Our study was performed on a large sample of the Chilean Web which was crawled by using simulators, so that all strategies were compared under the same conditions, and actual crawls to validate our conclusions. We also explored the effects of large scale parallelism in the page retrieval task and multiple-page requests in a single connection for effective amortization of latency times.

Original languageEnglish
Title of host publicationProceedings - WebMedia and LA-Web 2004
EditorsM.G.C. Pimentel, E.V. Munson
Pages10-17
Number of pages8
DOIs
Publication statusPublished - 1 Dec 2004
EventProceedings - WebMedia and LA-Web 2004 Joint Conference (10th Brazilian Symposium on Multimedia and the Web, 2nd Latin American Web Congress) - Ribeirao Preto-SP, Brazil
Duration: 12 Oct 200415 Oct 2004

Publication series

NameProceedings - WebMedia and LA-Web 2004

Other

OtherProceedings - WebMedia and LA-Web 2004 Joint Conference (10th Brazilian Symposium on Multimedia and the Web, 2nd Latin American Web Congress)
CountryBrazil
CityRibeirao Preto-SP
Period12/10/0415/10/04

    Fingerprint

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Castillo, C., Marin, M., Baeza-Yates, R., & Rodriguez, A. (2004). Scheduling algorithms for Web crawling. In M. G. C. Pimentel, & E. V. Munson (Eds.), Proceedings - WebMedia and LA-Web 2004 (pp. 10-17). (Proceedings - WebMedia and LA-Web 2004). https://doi.org/10.1109/WEBMED.2004.1348139