### Abstract

We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k non-overlapping clusters. We augment the well-known furthest-point-first algorithm for k-center clustering in metric spaces with a filtering scheme based on the triangular inequality. We apply this algorithm to Web snippet clustering, comparing it against strong baselines consisting of recent, fast variants of the classical k-means iterative algorithm. Our main conclusion is that our method attains solutions of better or comparable accuracy, and does this within a fraction of the time required by the baselines. Our algorithm is thus valuable when, as in Web snippet clustering, either the real-time nature of the task or the large amount of data make the poorly scalable, traditional clustering methods unsuitable.

Original language | English |
---|---|

Title of host publication | Proceedings of the ACM Symposium on Applied Computing |

Pages | 1058-1062 |

Number of pages | 5 |

Volume | 2 |

Publication status | Published - 2006 |

Externally published | Yes |

Event | 2006 ACM Symposium on Applied Computing - Dijon Duration: 23 Apr 2006 → 27 Apr 2006 |

### Other

Other | 2006 ACM Symposium on Applied Computing |
---|---|

City | Dijon |

Period | 23/4/06 → 27/4/06 |

### Keywords

- Clustering
- Meta search engines
- Metric spaces
- Web snippets

### ASJC Scopus subject areas

- Computer Science(all)

### Cite this

*Proceedings of the ACM Symposium on Applied Computing*(Vol. 2, pp. 1058-1062)

**A scalable algorithm for high-quality clustering of Web snippets.** / Geraci, Filippo; Pellegrini, Marco; Pisati, Paolo; Sebastiani, Fabrizio.

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

*Proceedings of the ACM Symposium on Applied Computing.*vol. 2, pp. 1058-1062, 2006 ACM Symposium on Applied Computing, Dijon, 23/4/06.

}

TY - GEN

T1 - A scalable algorithm for high-quality clustering of Web snippets

AU - Geraci, Filippo

AU - Pellegrini, Marco

AU - Pisati, Paolo

AU - Sebastiani, Fabrizio

PY - 2006

Y1 - 2006

N2 - We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k non-overlapping clusters. We augment the well-known furthest-point-first algorithm for k-center clustering in metric spaces with a filtering scheme based on the triangular inequality. We apply this algorithm to Web snippet clustering, comparing it against strong baselines consisting of recent, fast variants of the classical k-means iterative algorithm. Our main conclusion is that our method attains solutions of better or comparable accuracy, and does this within a fraction of the time required by the baselines. Our algorithm is thus valuable when, as in Web snippet clustering, either the real-time nature of the task or the large amount of data make the poorly scalable, traditional clustering methods unsuitable.

AB - We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k non-overlapping clusters. We augment the well-known furthest-point-first algorithm for k-center clustering in metric spaces with a filtering scheme based on the triangular inequality. We apply this algorithm to Web snippet clustering, comparing it against strong baselines consisting of recent, fast variants of the classical k-means iterative algorithm. Our main conclusion is that our method attains solutions of better or comparable accuracy, and does this within a fraction of the time required by the baselines. Our algorithm is thus valuable when, as in Web snippet clustering, either the real-time nature of the task or the large amount of data make the poorly scalable, traditional clustering methods unsuitable.

KW - Clustering

KW - Meta search engines

KW - Metric spaces

KW - Web snippets

UR - http://www.scopus.com/inward/record.url?scp=33750377487&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33750377487&partnerID=8YFLogxK

M3 - Conference contribution

SN - 1595931082

SN - 9781595931085

VL - 2

SP - 1058

EP - 1062

BT - Proceedings of the ACM Symposium on Applied Computing

ER -