Designing clustering-based web crawling policies for search engine crawlers

Qingzhao Tan, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)

Abstract

The World Wide Web is growing and changing at an astonishing rate. Web information systems such as search engines have to keep up with the growth and change of the Web. Due to resource constraints, search engines usually have difficulties keeping the local database completely synchronized with the Web. In this paper, we study how to make good use of the limited system resource and detect as many changes as possible. Towards this goal, a crawler for the Web search engine should be able to predict the change behavior of the webpages. We propose applying clustering- based sampling approach. Specifically, we first group all the local webpages into different clusters such that each cluster contains webpages with similar change pattern. We then sample webpages from each cluster to estimate the change frequency of all the webpages in that cluster. Finally, we let the crawler re-visit the cluster containing webpages with higher change frequency with a higher probability. To evaluate the performance of an incremental crawler for a Web search engine, we measure both the freshness and the quality of the query results provided by the search engine. We run extensive experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 websites. The results demonstrate that our clustering algorithm effectively clusters the pages with similar change patterns, and our solution significantly outperforms the existing methods in that it can detect more changed webpages and improve the quality of the user experience for those who query the search engine.

Original languageEnglish
Title of host publicationInternational Conference on Information and Knowledge Management, Proceedings
Pages535-544
Number of pages10
DOIs
Publication statusPublished - 2007
Externally publishedYes
Event16th ACM Conference on Information and Knowledge Management, CIKM 2007 - Lisboa
Duration: 6 Nov 20079 Nov 2007

Other

Other16th ACM Conference on Information and Knowledge Management, CIKM 2007
CityLisboa
Period6/11/079/11/07

Fingerprint

Clustering
Search engine
World Wide Web
Web search
Query
Resources
Sampling
Web sites
Information systems
Incremental
Data base
Resource constraints
Behavior change
Clustering algorithm
Experiment
User experience

Keywords

  • Clustering
  • Incremental crawler
  • Refresh policy
  • Sampling
  • Web search engine

ASJC Scopus subject areas

  • Business, Management and Accounting(all)
  • Decision Sciences(all)

Cite this

Tan, Q., Mitra, P., & Lee Giles, C. (2007). Designing clustering-based web crawling policies for search engine crawlers. In International Conference on Information and Knowledge Management, Proceedings (pp. 535-544) https://doi.org/10.1145/1321440.1321516

Designing clustering-based web crawling policies for search engine crawlers. / Tan, Qingzhao; Mitra, Prasenjit; Lee Giles, C.

International Conference on Information and Knowledge Management, Proceedings. 2007. p. 535-544.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Tan, Q, Mitra, P & Lee Giles, C 2007, Designing clustering-based web crawling policies for search engine crawlers. in International Conference on Information and Knowledge Management, Proceedings. pp. 535-544, 16th ACM Conference on Information and Knowledge Management, CIKM 2007, Lisboa, 6/11/07. https://doi.org/10.1145/1321440.1321516
Tan Q, Mitra P, Lee Giles C. Designing clustering-based web crawling policies for search engine crawlers. In International Conference on Information and Knowledge Management, Proceedings. 2007. p. 535-544 https://doi.org/10.1145/1321440.1321516
Tan, Qingzhao ; Mitra, Prasenjit ; Lee Giles, C. / Designing clustering-based web crawling policies for search engine crawlers. International Conference on Information and Knowledge Management, Proceedings. 2007. pp. 535-544
@inproceedings{5c9c4a721cef47bba83a2cbebe7f6290,
title = "Designing clustering-based web crawling policies for search engine crawlers",
abstract = "The World Wide Web is growing and changing at an astonishing rate. Web information systems such as search engines have to keep up with the growth and change of the Web. Due to resource constraints, search engines usually have difficulties keeping the local database completely synchronized with the Web. In this paper, we study how to make good use of the limited system resource and detect as many changes as possible. Towards this goal, a crawler for the Web search engine should be able to predict the change behavior of the webpages. We propose applying clustering- based sampling approach. Specifically, we first group all the local webpages into different clusters such that each cluster contains webpages with similar change pattern. We then sample webpages from each cluster to estimate the change frequency of all the webpages in that cluster. Finally, we let the crawler re-visit the cluster containing webpages with higher change frequency with a higher probability. To evaluate the performance of an incremental crawler for a Web search engine, we measure both the freshness and the quality of the query results provided by the search engine. We run extensive experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 websites. The results demonstrate that our clustering algorithm effectively clusters the pages with similar change patterns, and our solution significantly outperforms the existing methods in that it can detect more changed webpages and improve the quality of the user experience for those who query the search engine.",
keywords = "Clustering, Incremental crawler, Refresh policy, Sampling, Web search engine",
author = "Qingzhao Tan and Prasenjit Mitra and {Lee Giles}, C.",
year = "2007",
doi = "10.1145/1321440.1321516",
language = "English",
isbn = "9781595938039",
pages = "535--544",
booktitle = "International Conference on Information and Knowledge Management, Proceedings",

}

TY - GEN

T1 - Designing clustering-based web crawling policies for search engine crawlers

AU - Tan, Qingzhao

AU - Mitra, Prasenjit

AU - Lee Giles, C.

PY - 2007

Y1 - 2007

N2 - The World Wide Web is growing and changing at an astonishing rate. Web information systems such as search engines have to keep up with the growth and change of the Web. Due to resource constraints, search engines usually have difficulties keeping the local database completely synchronized with the Web. In this paper, we study how to make good use of the limited system resource and detect as many changes as possible. Towards this goal, a crawler for the Web search engine should be able to predict the change behavior of the webpages. We propose applying clustering- based sampling approach. Specifically, we first group all the local webpages into different clusters such that each cluster contains webpages with similar change pattern. We then sample webpages from each cluster to estimate the change frequency of all the webpages in that cluster. Finally, we let the crawler re-visit the cluster containing webpages with higher change frequency with a higher probability. To evaluate the performance of an incremental crawler for a Web search engine, we measure both the freshness and the quality of the query results provided by the search engine. We run extensive experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 websites. The results demonstrate that our clustering algorithm effectively clusters the pages with similar change patterns, and our solution significantly outperforms the existing methods in that it can detect more changed webpages and improve the quality of the user experience for those who query the search engine.

AB - The World Wide Web is growing and changing at an astonishing rate. Web information systems such as search engines have to keep up with the growth and change of the Web. Due to resource constraints, search engines usually have difficulties keeping the local database completely synchronized with the Web. In this paper, we study how to make good use of the limited system resource and detect as many changes as possible. Towards this goal, a crawler for the Web search engine should be able to predict the change behavior of the webpages. We propose applying clustering- based sampling approach. Specifically, we first group all the local webpages into different clusters such that each cluster contains webpages with similar change pattern. We then sample webpages from each cluster to estimate the change frequency of all the webpages in that cluster. Finally, we let the crawler re-visit the cluster containing webpages with higher change frequency with a higher probability. To evaluate the performance of an incremental crawler for a Web search engine, we measure both the freshness and the quality of the query results provided by the search engine. We run extensive experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 websites. The results demonstrate that our clustering algorithm effectively clusters the pages with similar change patterns, and our solution significantly outperforms the existing methods in that it can detect more changed webpages and improve the quality of the user experience for those who query the search engine.

KW - Clustering

KW - Incremental crawler

KW - Refresh policy

KW - Sampling

KW - Web search engine

UR - http://www.scopus.com/inward/record.url?scp=63449100809&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=63449100809&partnerID=8YFLogxK

U2 - 10.1145/1321440.1321516

DO - 10.1145/1321440.1321516

M3 - Conference contribution

AN - SCOPUS:63449100809

SN - 9781595938039

SP - 535

EP - 544

BT - International Conference on Information and Knowledge Management, Proceedings

ER -