Clustering-based incremental web crawling

Qingzhao Tan, Prasenjit Mitra

Research output: Contribution to journalArticle

31 Citations (Scopus)

Abstract

When crawling resources, for example, number of machines, crawl-time, and so on, are limited, so a crawler has to decide an optimal order in which to crawl and recrawl Web pages. Ideally, crawlers should request only those Web pages that have changed since the last crawl; in practice, a crawler may not know whether a Web page has changed before downloading it. In this article, we identify features of Web pages that are correlated to their change frequency. We design a crawling algorithm that clusters Web pages based on features that correlate to their change frequencies obtained by examining past history. The crawler downloads a sample of Web pages from each cluster, and depending upon whether a significant number of these Web pages have changed in the last crawl cycle, it decides whether to recrawl the entire cluster. To evaluate the performance of our incremental crawler, we develop an evaluation framework that measures which crawling policy results in the best search results for the end-user. We run experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 Web sites. The results demonstrate that the clustering-based sampling algorithm effectively clusters the pages with similar change patterns, and our clustering-based crawling algorithm outperforms existing algorithms in that it can improve the quality of the user experience for those who query the search engine.

Original languageEnglish
Article number17
JournalACM Transactions on Information Systems
Volume28
Issue number4
DOIs
Publication statusPublished - Nov 2010
Externally publishedYes

Fingerprint

Websites
Clustering
World Wide Web
Incremental
Search engines
Sampling

Keywords

  • Clustering
  • Refresh policy
  • Sampling
  • Search engine
  • Web crawler

ASJC Scopus subject areas

  • Information Systems
  • Business, Management and Accounting(all)
  • Computer Science Applications

Cite this

Clustering-based incremental web crawling. / Tan, Qingzhao; Mitra, Prasenjit.

In: ACM Transactions on Information Systems, Vol. 28, No. 4, 17, 11.2010.

Research output: Contribution to journalArticle

Tan, Qingzhao ; Mitra, Prasenjit. / Clustering-based incremental web crawling. In: ACM Transactions on Information Systems. 2010 ; Vol. 28, No. 4.
@article{87ff894933c64882943ffed89fdc94d9,
title = "Clustering-based incremental web crawling",
abstract = "When crawling resources, for example, number of machines, crawl-time, and so on, are limited, so a crawler has to decide an optimal order in which to crawl and recrawl Web pages. Ideally, crawlers should request only those Web pages that have changed since the last crawl; in practice, a crawler may not know whether a Web page has changed before downloading it. In this article, we identify features of Web pages that are correlated to their change frequency. We design a crawling algorithm that clusters Web pages based on features that correlate to their change frequencies obtained by examining past history. The crawler downloads a sample of Web pages from each cluster, and depending upon whether a significant number of these Web pages have changed in the last crawl cycle, it decides whether to recrawl the entire cluster. To evaluate the performance of our incremental crawler, we develop an evaluation framework that measures which crawling policy results in the best search results for the end-user. We run experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 Web sites. The results demonstrate that the clustering-based sampling algorithm effectively clusters the pages with similar change patterns, and our clustering-based crawling algorithm outperforms existing algorithms in that it can improve the quality of the user experience for those who query the search engine.",
keywords = "Clustering, Refresh policy, Sampling, Search engine, Web crawler",
author = "Qingzhao Tan and Prasenjit Mitra",
year = "2010",
month = "11",
doi = "10.1145/1852102.1852103",
language = "English",
volume = "28",
journal = "ACM Transactions on Information Systems",
issn = "1046-8188",
publisher = "Association for Computing Machinery (ACM)",
number = "4",

}

TY - JOUR

T1 - Clustering-based incremental web crawling

AU - Tan, Qingzhao

AU - Mitra, Prasenjit

PY - 2010/11

Y1 - 2010/11

N2 - When crawling resources, for example, number of machines, crawl-time, and so on, are limited, so a crawler has to decide an optimal order in which to crawl and recrawl Web pages. Ideally, crawlers should request only those Web pages that have changed since the last crawl; in practice, a crawler may not know whether a Web page has changed before downloading it. In this article, we identify features of Web pages that are correlated to their change frequency. We design a crawling algorithm that clusters Web pages based on features that correlate to their change frequencies obtained by examining past history. The crawler downloads a sample of Web pages from each cluster, and depending upon whether a significant number of these Web pages have changed in the last crawl cycle, it decides whether to recrawl the entire cluster. To evaluate the performance of our incremental crawler, we develop an evaluation framework that measures which crawling policy results in the best search results for the end-user. We run experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 Web sites. The results demonstrate that the clustering-based sampling algorithm effectively clusters the pages with similar change patterns, and our clustering-based crawling algorithm outperforms existing algorithms in that it can improve the quality of the user experience for those who query the search engine.

AB - When crawling resources, for example, number of machines, crawl-time, and so on, are limited, so a crawler has to decide an optimal order in which to crawl and recrawl Web pages. Ideally, crawlers should request only those Web pages that have changed since the last crawl; in practice, a crawler may not know whether a Web page has changed before downloading it. In this article, we identify features of Web pages that are correlated to their change frequency. We design a crawling algorithm that clusters Web pages based on features that correlate to their change frequencies obtained by examining past history. The crawler downloads a sample of Web pages from each cluster, and depending upon whether a significant number of these Web pages have changed in the last crawl cycle, it decides whether to recrawl the entire cluster. To evaluate the performance of our incremental crawler, we develop an evaluation framework that measures which crawling policy results in the best search results for the end-user. We run experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 Web sites. The results demonstrate that the clustering-based sampling algorithm effectively clusters the pages with similar change patterns, and our clustering-based crawling algorithm outperforms existing algorithms in that it can improve the quality of the user experience for those who query the search engine.

KW - Clustering

KW - Refresh policy

KW - Sampling

KW - Search engine

KW - Web crawler

UR - http://www.scopus.com/inward/record.url?scp=80051485415&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80051485415&partnerID=8YFLogxK

U2 - 10.1145/1852102.1852103

DO - 10.1145/1852102.1852103

M3 - Article

VL - 28

JO - ACM Transactions on Information Systems

JF - ACM Transactions on Information Systems

SN - 1046-8188

IS - 4

M1 - 17

ER -