Designing efficient sampling techniques to detect webpage updates

Qingzhao Tan, Ziming Zhuang, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

Due to resource constraints, Web archiving systems and search engines usually have difficulties keeping the entire local repository synchronized with the Web. We advance the state-of-art of the sampling-based synchronization techniques by answering a challenging question: Given a sampled webpage and its change status, which other webpages are also likely to change? We present a study of various downloading granularities and policies, and propose an adaptive model based on the update history and the popularity of the webpages. We run extensive experiments on a large dataset of approximately 300,000 webpages to demonstrate that it is most likely to find more updated webpages in the current or upper directories of the changed samples. Moreover, the adaptive strategies outperform the non-adaptive one in terms of detecting important changes.

Original languageEnglish
Title of host publication16th International World Wide Web Conference, WWW2007
Pages1147-1148
Number of pages2
DOIs
Publication statusPublished - 2007
Externally publishedYes
Event16th International World Wide Web Conference, WWW2007 - Banff, AB
Duration: 8 May 200712 May 2007

Other

Other16th International World Wide Web Conference, WWW2007
CityBanff, AB
Period8/5/0712/5/07

Fingerprint

Search engines
World Wide Web
Synchronization
Sampling
Experiments

Keywords

  • Sampling
  • Search engine
  • Web crawler

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Cite this

Tan, Q., Zhuang, Z., Mitra, P., & Giles, C. L. (2007). Designing efficient sampling techniques to detect webpage updates. In 16th International World Wide Web Conference, WWW2007 (pp. 1147-1148) https://doi.org/10.1145/1242572.1242738

Designing efficient sampling techniques to detect webpage updates. / Tan, Qingzhao; Zhuang, Ziming; Mitra, Prasenjit; Giles, C. Lee.

16th International World Wide Web Conference, WWW2007. 2007. p. 1147-1148.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Tan, Q, Zhuang, Z, Mitra, P & Giles, CL 2007, Designing efficient sampling techniques to detect webpage updates. in 16th International World Wide Web Conference, WWW2007. pp. 1147-1148, 16th International World Wide Web Conference, WWW2007, Banff, AB, 8/5/07. https://doi.org/10.1145/1242572.1242738
Tan Q, Zhuang Z, Mitra P, Giles CL. Designing efficient sampling techniques to detect webpage updates. In 16th International World Wide Web Conference, WWW2007. 2007. p. 1147-1148 https://doi.org/10.1145/1242572.1242738
Tan, Qingzhao ; Zhuang, Ziming ; Mitra, Prasenjit ; Giles, C. Lee. / Designing efficient sampling techniques to detect webpage updates. 16th International World Wide Web Conference, WWW2007. 2007. pp. 1147-1148
@inproceedings{c105034362b349b69a96bb4365afe5c7,
title = "Designing efficient sampling techniques to detect webpage updates",
abstract = "Due to resource constraints, Web archiving systems and search engines usually have difficulties keeping the entire local repository synchronized with the Web. We advance the state-of-art of the sampling-based synchronization techniques by answering a challenging question: Given a sampled webpage and its change status, which other webpages are also likely to change? We present a study of various downloading granularities and policies, and propose an adaptive model based on the update history and the popularity of the webpages. We run extensive experiments on a large dataset of approximately 300,000 webpages to demonstrate that it is most likely to find more updated webpages in the current or upper directories of the changed samples. Moreover, the adaptive strategies outperform the non-adaptive one in terms of detecting important changes.",
keywords = "Sampling, Search engine, Web crawler",
author = "Qingzhao Tan and Ziming Zhuang and Prasenjit Mitra and Giles, {C. Lee}",
year = "2007",
doi = "10.1145/1242572.1242738",
language = "English",
isbn = "1595936548",
pages = "1147--1148",
booktitle = "16th International World Wide Web Conference, WWW2007",

}

TY - GEN

T1 - Designing efficient sampling techniques to detect webpage updates

AU - Tan, Qingzhao

AU - Zhuang, Ziming

AU - Mitra, Prasenjit

AU - Giles, C. Lee

PY - 2007

Y1 - 2007

N2 - Due to resource constraints, Web archiving systems and search engines usually have difficulties keeping the entire local repository synchronized with the Web. We advance the state-of-art of the sampling-based synchronization techniques by answering a challenging question: Given a sampled webpage and its change status, which other webpages are also likely to change? We present a study of various downloading granularities and policies, and propose an adaptive model based on the update history and the popularity of the webpages. We run extensive experiments on a large dataset of approximately 300,000 webpages to demonstrate that it is most likely to find more updated webpages in the current or upper directories of the changed samples. Moreover, the adaptive strategies outperform the non-adaptive one in terms of detecting important changes.

AB - Due to resource constraints, Web archiving systems and search engines usually have difficulties keeping the entire local repository synchronized with the Web. We advance the state-of-art of the sampling-based synchronization techniques by answering a challenging question: Given a sampled webpage and its change status, which other webpages are also likely to change? We present a study of various downloading granularities and policies, and propose an adaptive model based on the update history and the popularity of the webpages. We run extensive experiments on a large dataset of approximately 300,000 webpages to demonstrate that it is most likely to find more updated webpages in the current or upper directories of the changed samples. Moreover, the adaptive strategies outperform the non-adaptive one in terms of detecting important changes.

KW - Sampling

KW - Search engine

KW - Web crawler

UR - http://www.scopus.com/inward/record.url?scp=35348878593&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=35348878593&partnerID=8YFLogxK

U2 - 10.1145/1242572.1242738

DO - 10.1145/1242572.1242738

M3 - Conference contribution

SN - 1595936548

SN - 9781595936547

SP - 1147

EP - 1148

BT - 16th International World Wide Web Conference, WWW2007

ER -