κ-means-: A unified approach to clustering and outlier detection

Sanjay Chawla, Aristides Gionist

Research output: Chapter in Book/Report/Conference proceedingConference contribution

71 Citations (Scopus)

Abstract

We present a unified approach for simultaneously clustering and discovering outliers in data. Our approach is formalized as a generalization of the κ-MEANS problem. We prove that the problem is NP-hard and then present a practical polynomial time algorithm, which is guaranteed to converge to a local optimum. Furthermore we extend our approach to all distance measures that can be expressed in the form of a Bregman divergence. Experiments on synthetic and real dataseis demonstrate the effectiveness of our approach and the utility of carrying out both clustering and outlier detection in a concurrent manner. In particular on the famous KDD cup network-intrusion dataset, we were able to increase the precision of the outlier detection task by nearly 100% compared to the classical nearest-neighbor approach.

Original languageEnglish
Title of host publicationSIAM International Conference on Data Mining 2013, SMD 2013
PublisherSociety for Industrial and Applied Mathematics Publications
Pages189-197
Number of pages9
ISBN (Print)9781627487245
Publication statusPublished - 2013
Externally publishedYes
Event13th SIAM International Conference on Data Mining, SMD 2013 - Austin, United States
Duration: 2 May 20134 May 2013

Other

Other13th SIAM International Conference on Data Mining, SMD 2013
CountryUnited States
CityAustin
Period2/5/134/5/13

Fingerprint

Outlier Detection
Computational complexity
Bregman Divergence
Polynomials
Clustering
Distance Measure
Polynomial-time Algorithm
Outlier
Concurrent
Nearest Neighbor
NP-complete problem
Experiments
Converge
Demonstrate
Experiment
Generalization
Form

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Information Systems
  • Signal Processing
  • Software

Cite this

Chawla, S., & Gionist, A. (2013). κ-means-: A unified approach to clustering and outlier detection. In SIAM International Conference on Data Mining 2013, SMD 2013 (pp. 189-197). Society for Industrial and Applied Mathematics Publications.

κ-means- : A unified approach to clustering and outlier detection. / Chawla, Sanjay; Gionist, Aristides.

SIAM International Conference on Data Mining 2013, SMD 2013. Society for Industrial and Applied Mathematics Publications, 2013. p. 189-197.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Chawla, S & Gionist, A 2013, κ-means-: A unified approach to clustering and outlier detection. in SIAM International Conference on Data Mining 2013, SMD 2013. Society for Industrial and Applied Mathematics Publications, pp. 189-197, 13th SIAM International Conference on Data Mining, SMD 2013, Austin, United States, 2/5/13.
Chawla S, Gionist A. κ-means-: A unified approach to clustering and outlier detection. In SIAM International Conference on Data Mining 2013, SMD 2013. Society for Industrial and Applied Mathematics Publications. 2013. p. 189-197
Chawla, Sanjay ; Gionist, Aristides. / κ-means- : A unified approach to clustering and outlier detection. SIAM International Conference on Data Mining 2013, SMD 2013. Society for Industrial and Applied Mathematics Publications, 2013. pp. 189-197
@inproceedings{e4bb9fc329824df18d67db6c1b671ebb,
title = "κ-means-: A unified approach to clustering and outlier detection",
abstract = "We present a unified approach for simultaneously clustering and discovering outliers in data. Our approach is formalized as a generalization of the κ-MEANS problem. We prove that the problem is NP-hard and then present a practical polynomial time algorithm, which is guaranteed to converge to a local optimum. Furthermore we extend our approach to all distance measures that can be expressed in the form of a Bregman divergence. Experiments on synthetic and real dataseis demonstrate the effectiveness of our approach and the utility of carrying out both clustering and outlier detection in a concurrent manner. In particular on the famous KDD cup network-intrusion dataset, we were able to increase the precision of the outlier detection task by nearly 100{\%} compared to the classical nearest-neighbor approach.",
author = "Sanjay Chawla and Aristides Gionist",
year = "2013",
language = "English",
isbn = "9781627487245",
pages = "189--197",
booktitle = "SIAM International Conference on Data Mining 2013, SMD 2013",
publisher = "Society for Industrial and Applied Mathematics Publications",

}

TY - GEN

T1 - κ-means-

T2 - A unified approach to clustering and outlier detection

AU - Chawla, Sanjay

AU - Gionist, Aristides

PY - 2013

Y1 - 2013

N2 - We present a unified approach for simultaneously clustering and discovering outliers in data. Our approach is formalized as a generalization of the κ-MEANS problem. We prove that the problem is NP-hard and then present a practical polynomial time algorithm, which is guaranteed to converge to a local optimum. Furthermore we extend our approach to all distance measures that can be expressed in the form of a Bregman divergence. Experiments on synthetic and real dataseis demonstrate the effectiveness of our approach and the utility of carrying out both clustering and outlier detection in a concurrent manner. In particular on the famous KDD cup network-intrusion dataset, we were able to increase the precision of the outlier detection task by nearly 100% compared to the classical nearest-neighbor approach.

AB - We present a unified approach for simultaneously clustering and discovering outliers in data. Our approach is formalized as a generalization of the κ-MEANS problem. We prove that the problem is NP-hard and then present a practical polynomial time algorithm, which is guaranteed to converge to a local optimum. Furthermore we extend our approach to all distance measures that can be expressed in the form of a Bregman divergence. Experiments on synthetic and real dataseis demonstrate the effectiveness of our approach and the utility of carrying out both clustering and outlier detection in a concurrent manner. In particular on the famous KDD cup network-intrusion dataset, we were able to increase the precision of the outlier detection task by nearly 100% compared to the classical nearest-neighbor approach.

UR - http://www.scopus.com/inward/record.url?scp=84960498671&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84960498671&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84960498671

SN - 9781627487245

SP - 189

EP - 197

BT - SIAM International Conference on Data Mining 2013, SMD 2013

PB - Society for Industrial and Applied Mathematics Publications

ER -