Distance-based outlier detection: Consolidation and renewed bearing

Gustavo H. Orair, Carlos H C Teixeira, Wagner Meira, Ye Wang, Srinivasan Parthasarathy

Research output: Chapter in Book/Report/Conference proceedingChapter

64 Citations (Scopus)

Abstract

Detecting outliers in data is an important problem with interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Over the last decade of research, distance-based outlier detection algorithms have emerged as a viable, scalable, parameter-free alternative to the more traditional statistical approaches. In this paper we assess several distance-based outlier detection approaches and evaluate them. We begin by surveying and examining the design landscape of extant approaches, while identifying key design decisions of such approaches. We then implement an outlier detection framework and conduct a factorial design experiment to understand the pros and cons of various optimizations proposed by us as well as those proposed in the literature, both independently and in conjunction with one another, on a diverse set of real-life datasets. To the best of our knowledge this is the first such study in the literature. The outcome of this study is a family of state of the art distance-based outlier detection algorithms. Our detailed empirical study supports the following observations. The combination of optimization strategies enables significant efficiency gains. Our factorial design study highlights the important fact that no single optimization or combination of optimizations (factors) always dominates on all types of data. Our study also allows us to characterize when a certain combination of optimizations is likely to prevail and helps provide interesting and useful insights for moving forward in this domain.

Original languageEnglish
Title of host publicationProceedings of the VLDB Endowment
Pages1469-1480
Number of pages12
Volume3
Edition2
Publication statusPublished - Sep 2010
Externally publishedYes

Fingerprint

Bearings (structural)
Consolidation
Surveying
Intrusion detection
Cleaning

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Orair, G. H., Teixeira, C. H. C., Meira, W., Wang, Y., & Parthasarathy, S. (2010). Distance-based outlier detection: Consolidation and renewed bearing. In Proceedings of the VLDB Endowment (2 ed., Vol. 3, pp. 1469-1480)

Distance-based outlier detection : Consolidation and renewed bearing. / Orair, Gustavo H.; Teixeira, Carlos H C; Meira, Wagner; Wang, Ye; Parthasarathy, Srinivasan.

Proceedings of the VLDB Endowment. Vol. 3 2. ed. 2010. p. 1469-1480.

Research output: Chapter in Book/Report/Conference proceedingChapter

Orair, GH, Teixeira, CHC, Meira, W, Wang, Y & Parthasarathy, S 2010, Distance-based outlier detection: Consolidation and renewed bearing. in Proceedings of the VLDB Endowment. 2 edn, vol. 3, pp. 1469-1480.
Orair GH, Teixeira CHC, Meira W, Wang Y, Parthasarathy S. Distance-based outlier detection: Consolidation and renewed bearing. In Proceedings of the VLDB Endowment. 2 ed. Vol. 3. 2010. p. 1469-1480
Orair, Gustavo H. ; Teixeira, Carlos H C ; Meira, Wagner ; Wang, Ye ; Parthasarathy, Srinivasan. / Distance-based outlier detection : Consolidation and renewed bearing. Proceedings of the VLDB Endowment. Vol. 3 2. ed. 2010. pp. 1469-1480
@inbook{3d71603bc58d418b984e6271676cc1d0,
title = "Distance-based outlier detection: Consolidation and renewed bearing",
abstract = "Detecting outliers in data is an important problem with interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Over the last decade of research, distance-based outlier detection algorithms have emerged as a viable, scalable, parameter-free alternative to the more traditional statistical approaches. In this paper we assess several distance-based outlier detection approaches and evaluate them. We begin by surveying and examining the design landscape of extant approaches, while identifying key design decisions of such approaches. We then implement an outlier detection framework and conduct a factorial design experiment to understand the pros and cons of various optimizations proposed by us as well as those proposed in the literature, both independently and in conjunction with one another, on a diverse set of real-life datasets. To the best of our knowledge this is the first such study in the literature. The outcome of this study is a family of state of the art distance-based outlier detection algorithms. Our detailed empirical study supports the following observations. The combination of optimization strategies enables significant efficiency gains. Our factorial design study highlights the important fact that no single optimization or combination of optimizations (factors) always dominates on all types of data. Our study also allows us to characterize when a certain combination of optimizations is likely to prevail and helps provide interesting and useful insights for moving forward in this domain.",
author = "Orair, {Gustavo H.} and Teixeira, {Carlos H C} and Wagner Meira and Ye Wang and Srinivasan Parthasarathy",
year = "2010",
month = "9",
language = "English",
volume = "3",
pages = "1469--1480",
booktitle = "Proceedings of the VLDB Endowment",
edition = "2",

}

TY - CHAP

T1 - Distance-based outlier detection

T2 - Consolidation and renewed bearing

AU - Orair, Gustavo H.

AU - Teixeira, Carlos H C

AU - Meira, Wagner

AU - Wang, Ye

AU - Parthasarathy, Srinivasan

PY - 2010/9

Y1 - 2010/9

N2 - Detecting outliers in data is an important problem with interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Over the last decade of research, distance-based outlier detection algorithms have emerged as a viable, scalable, parameter-free alternative to the more traditional statistical approaches. In this paper we assess several distance-based outlier detection approaches and evaluate them. We begin by surveying and examining the design landscape of extant approaches, while identifying key design decisions of such approaches. We then implement an outlier detection framework and conduct a factorial design experiment to understand the pros and cons of various optimizations proposed by us as well as those proposed in the literature, both independently and in conjunction with one another, on a diverse set of real-life datasets. To the best of our knowledge this is the first such study in the literature. The outcome of this study is a family of state of the art distance-based outlier detection algorithms. Our detailed empirical study supports the following observations. The combination of optimization strategies enables significant efficiency gains. Our factorial design study highlights the important fact that no single optimization or combination of optimizations (factors) always dominates on all types of data. Our study also allows us to characterize when a certain combination of optimizations is likely to prevail and helps provide interesting and useful insights for moving forward in this domain.

AB - Detecting outliers in data is an important problem with interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Over the last decade of research, distance-based outlier detection algorithms have emerged as a viable, scalable, parameter-free alternative to the more traditional statistical approaches. In this paper we assess several distance-based outlier detection approaches and evaluate them. We begin by surveying and examining the design landscape of extant approaches, while identifying key design decisions of such approaches. We then implement an outlier detection framework and conduct a factorial design experiment to understand the pros and cons of various optimizations proposed by us as well as those proposed in the literature, both independently and in conjunction with one another, on a diverse set of real-life datasets. To the best of our knowledge this is the first such study in the literature. The outcome of this study is a family of state of the art distance-based outlier detection algorithms. Our detailed empirical study supports the following observations. The combination of optimization strategies enables significant efficiency gains. Our factorial design study highlights the important fact that no single optimization or combination of optimizations (factors) always dominates on all types of data. Our study also allows us to characterize when a certain combination of optimizations is likely to prevail and helps provide interesting and useful insights for moving forward in this domain.

UR - http://www.scopus.com/inward/record.url?scp=80053412531&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053412531&partnerID=8YFLogxK

M3 - Chapter

AN - SCOPUS:80053412531

VL - 3

SP - 1469

EP - 1480

BT - Proceedings of the VLDB Endowment

ER -