A memory-efficient strategy for exploring the Web

Carlos Castillo, Alberto Nelli, Alessandro Panconesi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Search engines rely on Web crawlers to create an index of the Web. Web crawlers explore the Web downloading pages and finding links to new pages to be explored. At any given moment, there are a number of pages waiting to be downloaded in the crawler queue. We study the growth of this queue of pending pages during a crawl of a large subset of the Web. In a normal breadth-first crawler, the queue quickly grows very large. We present a strategy for managing the pending queue that reduces its maximum size by 50% while preserving the coverage and quality of the pages visited. This can be applied to general purpose Web crawlers as well as topic-specific crawling, peer-to-peer search, on-demand Web crawling, and other environments in which memory usage has to be kept to a minimum.

Original languageEnglish
Title of host publicationProceedings - 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI'06
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages680-686
Number of pages7
ISBN (Print)0769527477, 9780769527475
DOIs
Publication statusPublished - 1 Jan 2006
Event2006 IEEE/WIC/ACM International Conference on Web Intelligence, WI'06 - Hong Kong, China
Duration: 18 Dec 200622 Dec 2006

Publication series

NameProceedings - 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI'06

Other

Other2006 IEEE/WIC/ACM International Conference on Web Intelligence, WI'06
CountryChina
CityHong Kong
Period18/12/0622/12/06

    Fingerprint

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Cite this

Castillo, C., Nelli, A., & Panconesi, A. (2006). A memory-efficient strategy for exploring the Web. In Proceedings - 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI'06 (pp. 680-686). [4061453] (Proceedings - 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI'06). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/WI.2006.18