Crawling the Infinite Web: Five Levels Are Enough

Ricardo Baeza-Yates, Carlos Castillo

Research output: Contribution to journalArticle

25 Citations (Scopus)

Abstract

A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. In this article, several probabilistic models for browsing "infinite" Web sites are proposed and studied. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 "clicks" away from the start page, to reach 90% of the pages that users actually visit.

Original languageEnglish
Pages (from-to)156-167
Number of pages12
JournalLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3243
Publication statusPublished - 1 Dec 2004
Externally publishedYes

Fingerprint

Statistical Models
Websites
Browsing
Probabilistic Model
Model
Estimate

ASJC Scopus subject areas

  • Computer Science(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Theoretical Computer Science

Cite this

Crawling the Infinite Web : Five Levels Are Enough. / Baeza-Yates, Ricardo; Castillo, Carlos.

In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 3243, 01.12.2004, p. 156-167.

Research output: Contribution to journalArticle

@article{779fd5b681664d10b0c1c00e78abab69,
title = "Crawling the Infinite Web: Five Levels Are Enough",
abstract = "A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. In this article, several probabilistic models for browsing {"}infinite{"} Web sites are proposed and studied. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 {"}clicks{"} away from the start page, to reach 90{\%} of the pages that users actually visit.",
author = "Ricardo Baeza-Yates and Carlos Castillo",
year = "2004",
month = "12",
day = "1",
language = "English",
volume = "3243",
pages = "156--167",
journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
issn = "0302-9743",
publisher = "Springer Verlag",

}

TY - JOUR

T1 - Crawling the Infinite Web

T2 - Five Levels Are Enough

AU - Baeza-Yates, Ricardo

AU - Castillo, Carlos

PY - 2004/12/1

Y1 - 2004/12/1

N2 - A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. In this article, several probabilistic models for browsing "infinite" Web sites are proposed and studied. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 "clicks" away from the start page, to reach 90% of the pages that users actually visit.

AB - A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. In this article, several probabilistic models for browsing "infinite" Web sites are proposed and studied. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 "clicks" away from the start page, to reach 90% of the pages that users actually visit.

UR - http://www.scopus.com/inward/record.url?scp=35048834240&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=35048834240&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:35048834240

VL - 3243

SP - 156

EP - 167

JO - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

JF - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SN - 0302-9743

ER -