Scheduling algorithms for Web crawling

Carlos Castillo, Mauricio Marin, Ricardo Baeza-Yates, Andrea Rodriguez

Research output: Chapter in Book/Report/Conference proceedingConference contribution

28 Citations (Scopus)

Abstract

This article presents a comparative study of strategies for Web crawling. We show that a combination of breadth-first ordering with the largest sites first is a practical alternative since it is fast, simple to implement, and able to retrieve the best ranked pages at a rate that is closer to the optimal than other alternatives. Our study was performed on a large sample of the Chilean Web which was crawled by using simulators, so that all strategies were compared under the same conditions, and actual crawls to validate our conclusions. We also explored the effects of large scale parallelism in the page retrieval task and multiple-page requests in a single connection for effective amortization of latency times.

Original languageEnglish
Title of host publicationProceedings - WebMedia and LA-Web 2004
EditorsM.G.C. Pimentel, E.V. Munson
Pages10-17
Number of pages8
DOIs
Publication statusPublished - 1 Dec 2004
Externally publishedYes
EventProceedings - WebMedia and LA-Web 2004 Joint Conference (10th Brazilian Symposium on Multimedia and the Web, 2nd Latin American Web Congress) - Ribeirao Preto-SP, Brazil
Duration: 12 Oct 200415 Oct 2004

Other

OtherProceedings - WebMedia and LA-Web 2004 Joint Conference (10th Brazilian Symposium on Multimedia and the Web, 2nd Latin American Web Congress)
CountryBrazil
CityRibeirao Preto-SP
Period12/10/0415/10/04

Fingerprint

Scheduling algorithms
Simulators

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Castillo, C., Marin, M., Baeza-Yates, R., & Rodriguez, A. (2004). Scheduling algorithms for Web crawling. In M. G. C. Pimentel, & E. V. Munson (Eds.), Proceedings - WebMedia and LA-Web 2004 (pp. 10-17) https://doi.org/10.1109/WEBMED.2004.1348139

Scheduling algorithms for Web crawling. / Castillo, Carlos; Marin, Mauricio; Baeza-Yates, Ricardo; Rodriguez, Andrea.

Proceedings - WebMedia and LA-Web 2004. ed. / M.G.C. Pimentel; E.V. Munson. 2004. p. 10-17.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Castillo, C, Marin, M, Baeza-Yates, R & Rodriguez, A 2004, Scheduling algorithms for Web crawling. in MGC Pimentel & EV Munson (eds), Proceedings - WebMedia and LA-Web 2004. pp. 10-17, Proceedings - WebMedia and LA-Web 2004 Joint Conference (10th Brazilian Symposium on Multimedia and the Web, 2nd Latin American Web Congress), Ribeirao Preto-SP, Brazil, 12/10/04. https://doi.org/10.1109/WEBMED.2004.1348139
Castillo C, Marin M, Baeza-Yates R, Rodriguez A. Scheduling algorithms for Web crawling. In Pimentel MGC, Munson EV, editors, Proceedings - WebMedia and LA-Web 2004. 2004. p. 10-17 https://doi.org/10.1109/WEBMED.2004.1348139
Castillo, Carlos ; Marin, Mauricio ; Baeza-Yates, Ricardo ; Rodriguez, Andrea. / Scheduling algorithms for Web crawling. Proceedings - WebMedia and LA-Web 2004. editor / M.G.C. Pimentel ; E.V. Munson. 2004. pp. 10-17
@inproceedings{4f0db1920dd94e639895c3b73911d1bd,
title = "Scheduling algorithms for Web crawling",
abstract = "This article presents a comparative study of strategies for Web crawling. We show that a combination of breadth-first ordering with the largest sites first is a practical alternative since it is fast, simple to implement, and able to retrieve the best ranked pages at a rate that is closer to the optimal than other alternatives. Our study was performed on a large sample of the Chilean Web which was crawled by using simulators, so that all strategies were compared under the same conditions, and actual crawls to validate our conclusions. We also explored the effects of large scale parallelism in the page retrieval task and multiple-page requests in a single connection for effective amortization of latency times.",
author = "Carlos Castillo and Mauricio Marin and Ricardo Baeza-Yates and Andrea Rodriguez",
year = "2004",
month = "12",
day = "1",
doi = "10.1109/WEBMED.2004.1348139",
language = "English",
isbn = "0769522378",
pages = "10--17",
editor = "M.G.C. Pimentel and E.V. Munson",
booktitle = "Proceedings - WebMedia and LA-Web 2004",

}

TY - GEN

T1 - Scheduling algorithms for Web crawling

AU - Castillo, Carlos

AU - Marin, Mauricio

AU - Baeza-Yates, Ricardo

AU - Rodriguez, Andrea

PY - 2004/12/1

Y1 - 2004/12/1

N2 - This article presents a comparative study of strategies for Web crawling. We show that a combination of breadth-first ordering with the largest sites first is a practical alternative since it is fast, simple to implement, and able to retrieve the best ranked pages at a rate that is closer to the optimal than other alternatives. Our study was performed on a large sample of the Chilean Web which was crawled by using simulators, so that all strategies were compared under the same conditions, and actual crawls to validate our conclusions. We also explored the effects of large scale parallelism in the page retrieval task and multiple-page requests in a single connection for effective amortization of latency times.

AB - This article presents a comparative study of strategies for Web crawling. We show that a combination of breadth-first ordering with the largest sites first is a practical alternative since it is fast, simple to implement, and able to retrieve the best ranked pages at a rate that is closer to the optimal than other alternatives. Our study was performed on a large sample of the Chilean Web which was crawled by using simulators, so that all strategies were compared under the same conditions, and actual crawls to validate our conclusions. We also explored the effects of large scale parallelism in the page retrieval task and multiple-page requests in a single connection for effective amortization of latency times.

UR - http://www.scopus.com/inward/record.url?scp=15844394068&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=15844394068&partnerID=8YFLogxK

U2 - 10.1109/WEBMED.2004.1348139

DO - 10.1109/WEBMED.2004.1348139

M3 - Conference contribution

SN - 0769522378

SN - 9780769522371

SP - 10

EP - 17

BT - Proceedings - WebMedia and LA-Web 2004

A2 - Pimentel, M.G.C.

A2 - Munson, E.V.

ER -