SPARCL: An effective and efficient algorithm for mining arbitrary shape-based clusters

Vineet Chaoji, Mohammad Al Hasan, Saeed Salem, Mohammed J. Zaki

Research output: Contribution to journalArticle

21 Citations (Scopus)

Abstract

Clustering is one of the fundamental data mining tasks. Many different clustering paradigms have been developed over the years, which include partitional, hierarchical, mixture model based, density-based, spectral, subspace, and so on. The focus of this paper is on full-dimensional, arbitrary shaped clusters. Existing methods for this problem suffer either in terms of the memory or time complexity (quadratic or even cubic). This shortcoming has restricted these algorithms to datasets of moderate sizes. In this paper we propose SPARCL, a simple and scalable algorithm for finding clusters with arbitrary shapes and sizes, and it has linear space and time complexity. SPARCL consists of two stages-the first stage runs a carefully initialized version of the Kmeans algorithm to generate many small seed clusters. The second stage iteratively merges the generated clusters to obtain the final shape-based clusters. Experiments were conducted on a variety of datasets to highlight the effectiveness, efficiency, and scalability of our approach. On the large datasets SPARCL is an order of magnitude faster than the best existing approaches.

Original languageEnglish
Pages (from-to)201-229
Number of pages29
JournalKnowledge and Information Systems
Volume21
Issue number2
DOIs
Publication statusPublished - 8 Jun 2009
Externally publishedYes

Fingerprint

Data mining
Seed
Scalability
Data storage equipment
Experiments

Keywords

  • Clustering
  • Hierarchical
  • Kmeans
  • Linear time
  • Spatial

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Information Systems
  • Hardware and Architecture
  • Human-Computer Interaction

Cite this

SPARCL : An effective and efficient algorithm for mining arbitrary shape-based clusters. / Chaoji, Vineet; Al Hasan, Mohammad; Salem, Saeed; Zaki, Mohammed J.

In: Knowledge and Information Systems, Vol. 21, No. 2, 08.06.2009, p. 201-229.

Research output: Contribution to journalArticle

Chaoji, Vineet ; Al Hasan, Mohammad ; Salem, Saeed ; Zaki, Mohammed J. / SPARCL : An effective and efficient algorithm for mining arbitrary shape-based clusters. In: Knowledge and Information Systems. 2009 ; Vol. 21, No. 2. pp. 201-229.
@article{fd687063657a4c438d065ecf1004e586,
title = "SPARCL: An effective and efficient algorithm for mining arbitrary shape-based clusters",
abstract = "Clustering is one of the fundamental data mining tasks. Many different clustering paradigms have been developed over the years, which include partitional, hierarchical, mixture model based, density-based, spectral, subspace, and so on. The focus of this paper is on full-dimensional, arbitrary shaped clusters. Existing methods for this problem suffer either in terms of the memory or time complexity (quadratic or even cubic). This shortcoming has restricted these algorithms to datasets of moderate sizes. In this paper we propose SPARCL, a simple and scalable algorithm for finding clusters with arbitrary shapes and sizes, and it has linear space and time complexity. SPARCL consists of two stages-the first stage runs a carefully initialized version of the Kmeans algorithm to generate many small seed clusters. The second stage iteratively merges the generated clusters to obtain the final shape-based clusters. Experiments were conducted on a variety of datasets to highlight the effectiveness, efficiency, and scalability of our approach. On the large datasets SPARCL is an order of magnitude faster than the best existing approaches.",
keywords = "Clustering, Hierarchical, Kmeans, Linear time, Spatial",
author = "Vineet Chaoji and {Al Hasan}, Mohammad and Saeed Salem and Zaki, {Mohammed J.}",
year = "2009",
month = "6",
day = "8",
doi = "10.1007/s10115-009-0216-0",
language = "English",
volume = "21",
pages = "201--229",
journal = "Knowledge and Information Systems",
issn = "0219-1377",
publisher = "Springer London",
number = "2",

}

TY - JOUR

T1 - SPARCL

T2 - An effective and efficient algorithm for mining arbitrary shape-based clusters

AU - Chaoji, Vineet

AU - Al Hasan, Mohammad

AU - Salem, Saeed

AU - Zaki, Mohammed J.

PY - 2009/6/8

Y1 - 2009/6/8

N2 - Clustering is one of the fundamental data mining tasks. Many different clustering paradigms have been developed over the years, which include partitional, hierarchical, mixture model based, density-based, spectral, subspace, and so on. The focus of this paper is on full-dimensional, arbitrary shaped clusters. Existing methods for this problem suffer either in terms of the memory or time complexity (quadratic or even cubic). This shortcoming has restricted these algorithms to datasets of moderate sizes. In this paper we propose SPARCL, a simple and scalable algorithm for finding clusters with arbitrary shapes and sizes, and it has linear space and time complexity. SPARCL consists of two stages-the first stage runs a carefully initialized version of the Kmeans algorithm to generate many small seed clusters. The second stage iteratively merges the generated clusters to obtain the final shape-based clusters. Experiments were conducted on a variety of datasets to highlight the effectiveness, efficiency, and scalability of our approach. On the large datasets SPARCL is an order of magnitude faster than the best existing approaches.

AB - Clustering is one of the fundamental data mining tasks. Many different clustering paradigms have been developed over the years, which include partitional, hierarchical, mixture model based, density-based, spectral, subspace, and so on. The focus of this paper is on full-dimensional, arbitrary shaped clusters. Existing methods for this problem suffer either in terms of the memory or time complexity (quadratic or even cubic). This shortcoming has restricted these algorithms to datasets of moderate sizes. In this paper we propose SPARCL, a simple and scalable algorithm for finding clusters with arbitrary shapes and sizes, and it has linear space and time complexity. SPARCL consists of two stages-the first stage runs a carefully initialized version of the Kmeans algorithm to generate many small seed clusters. The second stage iteratively merges the generated clusters to obtain the final shape-based clusters. Experiments were conducted on a variety of datasets to highlight the effectiveness, efficiency, and scalability of our approach. On the large datasets SPARCL is an order of magnitude faster than the best existing approaches.

KW - Clustering

KW - Hierarchical

KW - Kmeans

KW - Linear time

KW - Spatial

UR - http://www.scopus.com/inward/record.url?scp=70350536514&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70350536514&partnerID=8YFLogxK

U2 - 10.1007/s10115-009-0216-0

DO - 10.1007/s10115-009-0216-0

M3 - Article

AN - SCOPUS:70350536514

VL - 21

SP - 201

EP - 229

JO - Knowledge and Information Systems

JF - Knowledge and Information Systems

SN - 0219-1377

IS - 2

ER -