Designing clustering-based web crawling policies for search engine crawlers

Qingzhao Tan, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)

Abstract

The World Wide Web is growing and changing at an astonishing rate. Web information systems such as search engines have to keep up with the growth and change of the Web. Due to resource constraints, search engines usually have difficulties keeping the local database completely synchronized with the Web. In this paper, we study how to make good use of the limited system resource and detect as many changes as possible. Towards this goal, a crawler for the Web search engine should be able to predict the change behavior of the webpages. We propose applying clustering- based sampling approach. Specifically, we first group all the local webpages into different clusters such that each cluster contains webpages with similar change pattern. We then sample webpages from each cluster to estimate the change frequency of all the webpages in that cluster. Finally, we let the crawler re-visit the cluster containing webpages with higher change frequency with a higher probability. To evaluate the performance of an incremental crawler for a Web search engine, we measure both the freshness and the quality of the query results provided by the search engine. We run extensive experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 websites. The results demonstrate that our clustering algorithm effectively clusters the pages with similar change patterns, and our solution significantly outperforms the existing methods in that it can detect more changed webpages and improve the quality of the user experience for those who query the search engine.

Original languageEnglish
Title of host publicationInternational Conference on Information and Knowledge Management, Proceedings
Pages535-544
Number of pages10
DOIs
Publication statusPublished - 2007
Externally publishedYes
Event16th ACM Conference on Information and Knowledge Management, CIKM 2007 - Lisboa
Duration: 6 Nov 20079 Nov 2007

Other

Other16th ACM Conference on Information and Knowledge Management, CIKM 2007
CityLisboa
Period6/11/079/11/07

    Fingerprint

Keywords

  • Clustering
  • Incremental crawler
  • Refresh policy
  • Sampling
  • Web search engine

ASJC Scopus subject areas

  • Business, Management and Accounting(all)
  • Decision Sciences(all)

Cite this

Tan, Q., Mitra, P., & Lee Giles, C. (2007). Designing clustering-based web crawling policies for search engine crawlers. In International Conference on Information and Knowledge Management, Proceedings (pp. 535-544) https://doi.org/10.1145/1321440.1321516