A scalable algorithm for high-quality clustering of Web snippets

Filippo Geraci, Marco Pellegrini, Paolo Pisati, Fabrizio Sebastiani

Research output: Chapter in Book/Report/Conference proceedingConference contribution

19 Citations (Scopus)

Abstract

We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k non-overlapping clusters. We augment the well-known furthest-point-first algorithm for k-center clustering in metric spaces with a filtering scheme based on the triangular inequality. We apply this algorithm to Web snippet clustering, comparing it against strong baselines consisting of recent, fast variants of the classical k-means iterative algorithm. Our main conclusion is that our method attains solutions of better or comparable accuracy, and does this within a fraction of the time required by the baselines. Our algorithm is thus valuable when, as in Web snippet clustering, either the real-time nature of the task or the large amount of data make the poorly scalable, traditional clustering methods unsuitable.

Original languageEnglish
Title of host publicationProceedings of the ACM Symposium on Applied Computing
Pages1058-1062
Number of pages5
Volume2
Publication statusPublished - 2006
Externally publishedYes
Event2006 ACM Symposium on Applied Computing - Dijon
Duration: 23 Apr 200627 Apr 2006

Other

Other2006 ACM Symposium on Applied Computing
CityDijon
Period23/4/0627/4/06

Keywords

  • Clustering
  • Meta search engines
  • Metric spaces
  • Web snippets

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Geraci, F., Pellegrini, M., Pisati, P., & Sebastiani, F. (2006). A scalable algorithm for high-quality clustering of Web snippets. In Proceedings of the ACM Symposium on Applied Computing (Vol. 2, pp. 1058-1062)

A scalable algorithm for high-quality clustering of Web snippets. / Geraci, Filippo; Pellegrini, Marco; Pisati, Paolo; Sebastiani, Fabrizio.

Proceedings of the ACM Symposium on Applied Computing. Vol. 2 2006. p. 1058-1062.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Geraci, F, Pellegrini, M, Pisati, P & Sebastiani, F 2006, A scalable algorithm for high-quality clustering of Web snippets. in Proceedings of the ACM Symposium on Applied Computing. vol. 2, pp. 1058-1062, 2006 ACM Symposium on Applied Computing, Dijon, 23/4/06.
Geraci F, Pellegrini M, Pisati P, Sebastiani F. A scalable algorithm for high-quality clustering of Web snippets. In Proceedings of the ACM Symposium on Applied Computing. Vol. 2. 2006. p. 1058-1062
Geraci, Filippo ; Pellegrini, Marco ; Pisati, Paolo ; Sebastiani, Fabrizio. / A scalable algorithm for high-quality clustering of Web snippets. Proceedings of the ACM Symposium on Applied Computing. Vol. 2 2006. pp. 1058-1062
@inproceedings{2a80b2af594b43c58d424cfa62622929,
title = "A scalable algorithm for high-quality clustering of Web snippets",
abstract = "We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k non-overlapping clusters. We augment the well-known furthest-point-first algorithm for k-center clustering in metric spaces with a filtering scheme based on the triangular inequality. We apply this algorithm to Web snippet clustering, comparing it against strong baselines consisting of recent, fast variants of the classical k-means iterative algorithm. Our main conclusion is that our method attains solutions of better or comparable accuracy, and does this within a fraction of the time required by the baselines. Our algorithm is thus valuable when, as in Web snippet clustering, either the real-time nature of the task or the large amount of data make the poorly scalable, traditional clustering methods unsuitable.",
keywords = "Clustering, Meta search engines, Metric spaces, Web snippets",
author = "Filippo Geraci and Marco Pellegrini and Paolo Pisati and Fabrizio Sebastiani",
year = "2006",
language = "English",
isbn = "1595931082",
volume = "2",
pages = "1058--1062",
booktitle = "Proceedings of the ACM Symposium on Applied Computing",

}

TY - GEN

T1 - A scalable algorithm for high-quality clustering of Web snippets

AU - Geraci, Filippo

AU - Pellegrini, Marco

AU - Pisati, Paolo

AU - Sebastiani, Fabrizio

PY - 2006

Y1 - 2006

N2 - We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k non-overlapping clusters. We augment the well-known furthest-point-first algorithm for k-center clustering in metric spaces with a filtering scheme based on the triangular inequality. We apply this algorithm to Web snippet clustering, comparing it against strong baselines consisting of recent, fast variants of the classical k-means iterative algorithm. Our main conclusion is that our method attains solutions of better or comparable accuracy, and does this within a fraction of the time required by the baselines. Our algorithm is thus valuable when, as in Web snippet clustering, either the real-time nature of the task or the large amount of data make the poorly scalable, traditional clustering methods unsuitable.

AB - We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k non-overlapping clusters. We augment the well-known furthest-point-first algorithm for k-center clustering in metric spaces with a filtering scheme based on the triangular inequality. We apply this algorithm to Web snippet clustering, comparing it against strong baselines consisting of recent, fast variants of the classical k-means iterative algorithm. Our main conclusion is that our method attains solutions of better or comparable accuracy, and does this within a fraction of the time required by the baselines. Our algorithm is thus valuable when, as in Web snippet clustering, either the real-time nature of the task or the large amount of data make the poorly scalable, traditional clustering methods unsuitable.

KW - Clustering

KW - Meta search engines

KW - Metric spaces

KW - Web snippets

UR - http://www.scopus.com/inward/record.url?scp=33750377487&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33750377487&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:33750377487

SN - 1595931082

SN - 9781595931085

VL - 2

SP - 1058

EP - 1062

BT - Proceedings of the ACM Symposium on Applied Computing

ER -