SimClus

An effective algorithm for clustering with a lower bound on similarity

Mohammad Al Hasan, Saeed Salem, Mohammed J. Zaki

Research output: Contribution to journalArticle

16 Citations (Scopus)

Abstract

Clustering algorithms generally accept a parameter k from the user, which determines the number of clusters sought. However, in many application domains, like document categorization, social network clustering, and frequent pattern summarization, the proper value of k is difficult to guess. An alternative clustering formulation that does not require k is to impose a lower bound on the similarity between an object and its corresponding cluster representative. Such a formulation chooses exactly one representative for every cluster and minimizes the representative count. It has many additional benefits. For instance, it supports overlapping clusters in a natural way. Moreover, for every cluster, it selects a representative object, which can be effectively used in summarization or semi-supervised classification task. In this work, we propose an algorithm, SimClus, for clustering with lower bound on similarity. It achieves a O(log n) approximation bound on the number of clusters, whereas for the best previous algorithm the bound can be as poor as O(n). Experiments on real and synthetic data sets show that our algorithm produces more than 40% fewer representative objects, yet offers the same or better clustering quality. We also propose a dynamic variant of the algorithm, which can be effectively used in an on-line setting.

Original languageEnglish
Pages (from-to)665-685
Number of pages21
JournalKnowledge and Information Systems
Volume28
Issue number3
DOIs
Publication statusPublished - 1 Sep 2011
Externally publishedYes

Fingerprint

Clustering algorithms
Experiments

Keywords

  • Dominating set
  • Overlapping clustering
  • Set cover
  • Star clustering

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Information Systems
  • Hardware and Architecture
  • Human-Computer Interaction

Cite this

SimClus : An effective algorithm for clustering with a lower bound on similarity. / Hasan, Mohammad Al; Salem, Saeed; Zaki, Mohammed J.

In: Knowledge and Information Systems, Vol. 28, No. 3, 01.09.2011, p. 665-685.

Research output: Contribution to journalArticle

Hasan, Mohammad Al ; Salem, Saeed ; Zaki, Mohammed J. / SimClus : An effective algorithm for clustering with a lower bound on similarity. In: Knowledge and Information Systems. 2011 ; Vol. 28, No. 3. pp. 665-685.
@article{2a80360dedb5409981d2d93cf4c9e4d1,
title = "SimClus: An effective algorithm for clustering with a lower bound on similarity",
abstract = "Clustering algorithms generally accept a parameter k from the user, which determines the number of clusters sought. However, in many application domains, like document categorization, social network clustering, and frequent pattern summarization, the proper value of k is difficult to guess. An alternative clustering formulation that does not require k is to impose a lower bound on the similarity between an object and its corresponding cluster representative. Such a formulation chooses exactly one representative for every cluster and minimizes the representative count. It has many additional benefits. For instance, it supports overlapping clusters in a natural way. Moreover, for every cluster, it selects a representative object, which can be effectively used in summarization or semi-supervised classification task. In this work, we propose an algorithm, SimClus, for clustering with lower bound on similarity. It achieves a O(log n) approximation bound on the number of clusters, whereas for the best previous algorithm the bound can be as poor as O(n). Experiments on real and synthetic data sets show that our algorithm produces more than 40{\%} fewer representative objects, yet offers the same or better clustering quality. We also propose a dynamic variant of the algorithm, which can be effectively used in an on-line setting.",
keywords = "Dominating set, Overlapping clustering, Set cover, Star clustering",
author = "Hasan, {Mohammad Al} and Saeed Salem and Zaki, {Mohammed J.}",
year = "2011",
month = "9",
day = "1",
doi = "10.1007/s10115-010-0360-6",
language = "English",
volume = "28",
pages = "665--685",
journal = "Knowledge and Information Systems",
issn = "0219-1377",
publisher = "Springer London",
number = "3",

}

TY - JOUR

T1 - SimClus

T2 - An effective algorithm for clustering with a lower bound on similarity

AU - Hasan, Mohammad Al

AU - Salem, Saeed

AU - Zaki, Mohammed J.

PY - 2011/9/1

Y1 - 2011/9/1

N2 - Clustering algorithms generally accept a parameter k from the user, which determines the number of clusters sought. However, in many application domains, like document categorization, social network clustering, and frequent pattern summarization, the proper value of k is difficult to guess. An alternative clustering formulation that does not require k is to impose a lower bound on the similarity between an object and its corresponding cluster representative. Such a formulation chooses exactly one representative for every cluster and minimizes the representative count. It has many additional benefits. For instance, it supports overlapping clusters in a natural way. Moreover, for every cluster, it selects a representative object, which can be effectively used in summarization or semi-supervised classification task. In this work, we propose an algorithm, SimClus, for clustering with lower bound on similarity. It achieves a O(log n) approximation bound on the number of clusters, whereas for the best previous algorithm the bound can be as poor as O(n). Experiments on real and synthetic data sets show that our algorithm produces more than 40% fewer representative objects, yet offers the same or better clustering quality. We also propose a dynamic variant of the algorithm, which can be effectively used in an on-line setting.

AB - Clustering algorithms generally accept a parameter k from the user, which determines the number of clusters sought. However, in many application domains, like document categorization, social network clustering, and frequent pattern summarization, the proper value of k is difficult to guess. An alternative clustering formulation that does not require k is to impose a lower bound on the similarity between an object and its corresponding cluster representative. Such a formulation chooses exactly one representative for every cluster and minimizes the representative count. It has many additional benefits. For instance, it supports overlapping clusters in a natural way. Moreover, for every cluster, it selects a representative object, which can be effectively used in summarization or semi-supervised classification task. In this work, we propose an algorithm, SimClus, for clustering with lower bound on similarity. It achieves a O(log n) approximation bound on the number of clusters, whereas for the best previous algorithm the bound can be as poor as O(n). Experiments on real and synthetic data sets show that our algorithm produces more than 40% fewer representative objects, yet offers the same or better clustering quality. We also propose a dynamic variant of the algorithm, which can be effectively used in an on-line setting.

KW - Dominating set

KW - Overlapping clustering

KW - Set cover

KW - Star clustering

UR - http://www.scopus.com/inward/record.url?scp=80052022735&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80052022735&partnerID=8YFLogxK

U2 - 10.1007/s10115-010-0360-6

DO - 10.1007/s10115-010-0360-6

M3 - Article

VL - 28

SP - 665

EP - 685

JO - Knowledge and Information Systems

JF - Knowledge and Information Systems

SN - 0219-1377

IS - 3

ER -