A memory-efficient strategy for exploring the Web

Carlos Castillo, Alberto Nelli, Alessandro Panconesi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Search engines rely on Web crawlers to create an index of the Web. Web crawlers explore the Web downloading pages and finding links to new pages to be explored. At any given moment, there are a number of pages waiting to be downloaded in the crawler queue. We study the growth of this queue of pending pages during a crawl of a large subset of the Web. In a normal breadth-first crawler, the queue quickly grows very large. We present a strategy for managing the pending queue that reduces its maximum size by 50% while preserving the coverage and quality of the pages visited. This can be applied to general purpose Web crawlers as well as topic-specific crawling, peer-to-peer search, on-demand Web crawling, and other environments in which memory usage has to be kept to a minimum.

Original languageEnglish
Title of host publicationProceedings - 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI'06
Pages680-686
Number of pages7
DOIs
Publication statusPublished - 1 Dec 2007
Externally publishedYes
Event2006 IEEE/WIC/ACM International Conference on Web Intelligence, WI'06 - Hong Kong, China
Duration: 18 Dec 200622 Dec 2006

Other

Other2006 IEEE/WIC/ACM International Conference on Web Intelligence, WI'06
CountryChina
CityHong Kong
Period18/12/0622/12/06

Fingerprint

World Wide Web
Data storage equipment
Search engines
Websites

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Cite this

Castillo, C., Nelli, A., & Panconesi, A. (2007). A memory-efficient strategy for exploring the Web. In Proceedings - 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI'06 (pp. 680-686). [4061453] https://doi.org/10.1109/WI.2006.18

A memory-efficient strategy for exploring the Web. / Castillo, Carlos; Nelli, Alberto; Panconesi, Alessandro.

Proceedings - 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI'06. 2007. p. 680-686 4061453.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Castillo, C, Nelli, A & Panconesi, A 2007, A memory-efficient strategy for exploring the Web. in Proceedings - 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI'06., 4061453, pp. 680-686, 2006 IEEE/WIC/ACM International Conference on Web Intelligence, WI'06, Hong Kong, China, 18/12/06. https://doi.org/10.1109/WI.2006.18
Castillo C, Nelli A, Panconesi A. A memory-efficient strategy for exploring the Web. In Proceedings - 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI'06. 2007. p. 680-686. 4061453 https://doi.org/10.1109/WI.2006.18
Castillo, Carlos ; Nelli, Alberto ; Panconesi, Alessandro. / A memory-efficient strategy for exploring the Web. Proceedings - 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI'06. 2007. pp. 680-686
@inproceedings{071e23332810475db3ca37d335df0878,
title = "A memory-efficient strategy for exploring the Web",
abstract = "Search engines rely on Web crawlers to create an index of the Web. Web crawlers explore the Web downloading pages and finding links to new pages to be explored. At any given moment, there are a number of pages waiting to be downloaded in the crawler queue. We study the growth of this queue of pending pages during a crawl of a large subset of the Web. In a normal breadth-first crawler, the queue quickly grows very large. We present a strategy for managing the pending queue that reduces its maximum size by 50{\%} while preserving the coverage and quality of the pages visited. This can be applied to general purpose Web crawlers as well as topic-specific crawling, peer-to-peer search, on-demand Web crawling, and other environments in which memory usage has to be kept to a minimum.",
author = "Carlos Castillo and Alberto Nelli and Alessandro Panconesi",
year = "2007",
month = "12",
day = "1",
doi = "10.1109/WI.2006.18",
language = "English",
isbn = "0769527477",
pages = "680--686",
booktitle = "Proceedings - 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI'06",

}

TY - GEN

T1 - A memory-efficient strategy for exploring the Web

AU - Castillo, Carlos

AU - Nelli, Alberto

AU - Panconesi, Alessandro

PY - 2007/12/1

Y1 - 2007/12/1

N2 - Search engines rely on Web crawlers to create an index of the Web. Web crawlers explore the Web downloading pages and finding links to new pages to be explored. At any given moment, there are a number of pages waiting to be downloaded in the crawler queue. We study the growth of this queue of pending pages during a crawl of a large subset of the Web. In a normal breadth-first crawler, the queue quickly grows very large. We present a strategy for managing the pending queue that reduces its maximum size by 50% while preserving the coverage and quality of the pages visited. This can be applied to general purpose Web crawlers as well as topic-specific crawling, peer-to-peer search, on-demand Web crawling, and other environments in which memory usage has to be kept to a minimum.

AB - Search engines rely on Web crawlers to create an index of the Web. Web crawlers explore the Web downloading pages and finding links to new pages to be explored. At any given moment, there are a number of pages waiting to be downloaded in the crawler queue. We study the growth of this queue of pending pages during a crawl of a large subset of the Web. In a normal breadth-first crawler, the queue quickly grows very large. We present a strategy for managing the pending queue that reduces its maximum size by 50% while preserving the coverage and quality of the pages visited. This can be applied to general purpose Web crawlers as well as topic-specific crawling, peer-to-peer search, on-demand Web crawling, and other environments in which memory usage has to be kept to a minimum.

UR - http://www.scopus.com/inward/record.url?scp=42549123573&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=42549123573&partnerID=8YFLogxK

U2 - 10.1109/WI.2006.18

DO - 10.1109/WI.2006.18

M3 - Conference contribution

SN - 0769527477

SN - 9780769527475

SP - 680

EP - 686

BT - Proceedings - 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI'06

ER -