Crawling a country

Better strategies than breadth-first for web page ordering

Ricardo Baeza-Yates, Carlos Castillo, Mauricio Marin, Andrea Rodriguez

Research output: Chapter in Book/Report/Conference proceedingConference contribution

76 Citations (Scopus)

Abstract

This article compares several page ordering strategies for Web crawling under several metrics. The objective of these strategies is to download the most "important" pages "early" during the crawl. As the coverage of modern search engines is small compared to the size of the Web, and it is impossible to index all of the Web for both theoretical and practical reasons, it is relevant to index at least the most important pages.We use data from actual Web pages to build Web graphs and execute a crawler simulator on those graphs. As the Web is very dynamic, crawling simulation is the only way to ensure that all the strategies considered are compared under the same conditions. We propose several page ordering strategies that are more efficient than breadth- first search and strategies based on partial Pagerank calculations.

Original languageEnglish
Title of host publication14th International World Wide Web Conference, WWW2005
Pages864-872
Number of pages9
DOIs
Publication statusPublished - 1 Dec 2005
Externally publishedYes
Event14th International World Wide Web Conference, WWW2005 - Chiba, Japan
Duration: 10 May 200514 May 2005

Other

Other14th International World Wide Web Conference, WWW2005
CountryJapan
CityChiba
Period10/5/0514/5/05

Fingerprint

Search engines
World Wide Web
Websites
Simulators
Computer simulation

Keywords

  • Scheduling policy
  • Web crawler
  • Web page importance

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Cite this

Baeza-Yates, R., Castillo, C., Marin, M., & Rodriguez, A. (2005). Crawling a country: Better strategies than breadth-first for web page ordering. In 14th International World Wide Web Conference, WWW2005 (pp. 864-872) https://doi.org/10.1145/1062745.1062768

Crawling a country : Better strategies than breadth-first for web page ordering. / Baeza-Yates, Ricardo; Castillo, Carlos; Marin, Mauricio; Rodriguez, Andrea.

14th International World Wide Web Conference, WWW2005. 2005. p. 864-872.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Baeza-Yates, R, Castillo, C, Marin, M & Rodriguez, A 2005, Crawling a country: Better strategies than breadth-first for web page ordering. in 14th International World Wide Web Conference, WWW2005. pp. 864-872, 14th International World Wide Web Conference, WWW2005, Chiba, Japan, 10/5/05. https://doi.org/10.1145/1062745.1062768
Baeza-Yates R, Castillo C, Marin M, Rodriguez A. Crawling a country: Better strategies than breadth-first for web page ordering. In 14th International World Wide Web Conference, WWW2005. 2005. p. 864-872 https://doi.org/10.1145/1062745.1062768
Baeza-Yates, Ricardo ; Castillo, Carlos ; Marin, Mauricio ; Rodriguez, Andrea. / Crawling a country : Better strategies than breadth-first for web page ordering. 14th International World Wide Web Conference, WWW2005. 2005. pp. 864-872
@inproceedings{3a8c598aed0f43919feb0389016bf1ea,
title = "Crawling a country: Better strategies than breadth-first for web page ordering",
abstract = "This article compares several page ordering strategies for Web crawling under several metrics. The objective of these strategies is to download the most {"}important{"} pages {"}early{"} during the crawl. As the coverage of modern search engines is small compared to the size of the Web, and it is impossible to index all of the Web for both theoretical and practical reasons, it is relevant to index at least the most important pages.We use data from actual Web pages to build Web graphs and execute a crawler simulator on those graphs. As the Web is very dynamic, crawling simulation is the only way to ensure that all the strategies considered are compared under the same conditions. We propose several page ordering strategies that are more efficient than breadth- first search and strategies based on partial Pagerank calculations.",
keywords = "Scheduling policy, Web crawler, Web page importance",
author = "Ricardo Baeza-Yates and Carlos Castillo and Mauricio Marin and Andrea Rodriguez",
year = "2005",
month = "12",
day = "1",
doi = "10.1145/1062745.1062768",
language = "English",
isbn = "1595930515",
pages = "864--872",
booktitle = "14th International World Wide Web Conference, WWW2005",

}

TY - GEN

T1 - Crawling a country

T2 - Better strategies than breadth-first for web page ordering

AU - Baeza-Yates, Ricardo

AU - Castillo, Carlos

AU - Marin, Mauricio

AU - Rodriguez, Andrea

PY - 2005/12/1

Y1 - 2005/12/1

N2 - This article compares several page ordering strategies for Web crawling under several metrics. The objective of these strategies is to download the most "important" pages "early" during the crawl. As the coverage of modern search engines is small compared to the size of the Web, and it is impossible to index all of the Web for both theoretical and practical reasons, it is relevant to index at least the most important pages.We use data from actual Web pages to build Web graphs and execute a crawler simulator on those graphs. As the Web is very dynamic, crawling simulation is the only way to ensure that all the strategies considered are compared under the same conditions. We propose several page ordering strategies that are more efficient than breadth- first search and strategies based on partial Pagerank calculations.

AB - This article compares several page ordering strategies for Web crawling under several metrics. The objective of these strategies is to download the most "important" pages "early" during the crawl. As the coverage of modern search engines is small compared to the size of the Web, and it is impossible to index all of the Web for both theoretical and practical reasons, it is relevant to index at least the most important pages.We use data from actual Web pages to build Web graphs and execute a crawler simulator on those graphs. As the Web is very dynamic, crawling simulation is the only way to ensure that all the strategies considered are compared under the same conditions. We propose several page ordering strategies that are more efficient than breadth- first search and strategies based on partial Pagerank calculations.

KW - Scheduling policy

KW - Web crawler

KW - Web page importance

UR - http://www.scopus.com/inward/record.url?scp=77953053635&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77953053635&partnerID=8YFLogxK

U2 - 10.1145/1062745.1062768

DO - 10.1145/1062745.1062768

M3 - Conference contribution

SN - 1595930515

SN - 9781595930514

SP - 864

EP - 872

BT - 14th International World Wide Web Conference, WWW2005

ER -