### Abstract

We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k non-overlapping clusters. We augment the well-known furthest-point-first algorithm for k-center clustering in metric spaces with a filtering scheme based on the triangular inequality. We apply this algorithm to Web snippet clustering, comparing it against strong baselines consisting of recent, fast variants of the classical k-means iterative algorithm. Our main conclusion is that our method attains solutions of better or comparable accuracy, and does this within a fraction of the time required by the baselines. Our algorithm is thus valuable when, as in Web snippet clustering, either the real-time nature of the task or the large amount of data make the poorly scalable, traditional clustering methods unsuitable.

Original language | English |
---|---|

Title of host publication | Proceedings of the ACM Symposium on Applied Computing |

Pages | 1058-1062 |

Number of pages | 5 |

Volume | 2 |

Publication status | Published - 2006 |

Externally published | Yes |

Event | 2006 ACM Symposium on Applied Computing - Dijon Duration: 23 Apr 2006 → 27 Apr 2006 |

### Other

Other | 2006 ACM Symposium on Applied Computing |
---|---|

City | Dijon |

Period | 23/4/06 → 27/4/06 |

### Keywords

- Clustering
- Meta search engines
- Metric spaces
- Web snippets

### ASJC Scopus subject areas

- Computer Science(all)

N2 - We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k non-overlapping clusters. We augment the well-known furthest-point-first algorithm for k-center clustering in metric spaces with a filtering scheme based on the triangular inequality. We apply this algorithm to Web snippet clustering, comparing it against strong baselines consisting of recent, fast variants of the classical k-means iterative algorithm. Our main conclusion is that our method attains solutions of better or comparable accuracy, and does this within a fraction of the time required by the baselines. Our algorithm is thus valuable when, as in Web snippet clustering, either the real-time nature of the task or the large amount of data make the poorly scalable, traditional clustering methods unsuitable.

