Mining for outliers in sequential databases

Pei Sun, Sanjay Chawla, Bavani Arunasalam

Research output: Chapter in Book/Report/Conference proceedingConference contribution

53 Citations (Scopus)

Abstract

The mining of outliers (or anomaly detection) in large databases continues to remain an active area of research with many potential applications. Over the last several years many novel methods have been proposed to efficiently and accurately mine for outliers. In this paper we propose a unique approach to mine for sequential outliers using Probabilistic Suffix Trees (PST). The key insight that underpins our work is that we can distinguish outliers from non-outliers by only examining the nodes close to the root of the PST. Thus, if the goal is to just mine outliers, then we can drastically reduce the size of the PST and reduce its construction and query time. In our experiments, we show that on a real data set consisting of protein sequences, by retaining less than 5% of the original PST we can retrieve all the outliers that were reported by the full-sized PST. We also carry out a detailed comparison between two measures of sequence similarity: the normalized probability and the odds and show that while the current research literature in PST favours the odds, for outlier detection it is normalized probability which gives far superior results. We provide an information theoretic argument based on entropy to explain the success of the normalized probability measure. Finally, we describe a more efficient implementation of the PST algorithm, which dramatically reduces its construction time compared to the implementation of Bejerano [3].

Original languageEnglish
Title of host publicationProceedings of the Sixth SIAM International Conference on Data Mining
Pages94-105
Number of pages12
Volume2006
Publication statusPublished - 2006
Externally publishedYes
EventSixth SIAM International Conference on Data Mining - Bethesda, MD
Duration: 20 Apr 200622 Apr 2006

Other

OtherSixth SIAM International Conference on Data Mining
CityBethesda, MD
Period20/4/0622/4/06

Fingerprint

Entropy
Proteins
Experiments

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Sun, P., Chawla, S., & Arunasalam, B. (2006). Mining for outliers in sequential databases. In Proceedings of the Sixth SIAM International Conference on Data Mining (Vol. 2006, pp. 94-105)

Mining for outliers in sequential databases. / Sun, Pei; Chawla, Sanjay; Arunasalam, Bavani.

Proceedings of the Sixth SIAM International Conference on Data Mining. Vol. 2006 2006. p. 94-105.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Sun, P, Chawla, S & Arunasalam, B 2006, Mining for outliers in sequential databases. in Proceedings of the Sixth SIAM International Conference on Data Mining. vol. 2006, pp. 94-105, Sixth SIAM International Conference on Data Mining, Bethesda, MD, 20/4/06.
Sun P, Chawla S, Arunasalam B. Mining for outliers in sequential databases. In Proceedings of the Sixth SIAM International Conference on Data Mining. Vol. 2006. 2006. p. 94-105
Sun, Pei ; Chawla, Sanjay ; Arunasalam, Bavani. / Mining for outliers in sequential databases. Proceedings of the Sixth SIAM International Conference on Data Mining. Vol. 2006 2006. pp. 94-105
@inproceedings{6a90557601354538925fcbc8b643769f,
title = "Mining for outliers in sequential databases",
abstract = "The mining of outliers (or anomaly detection) in large databases continues to remain an active area of research with many potential applications. Over the last several years many novel methods have been proposed to efficiently and accurately mine for outliers. In this paper we propose a unique approach to mine for sequential outliers using Probabilistic Suffix Trees (PST). The key insight that underpins our work is that we can distinguish outliers from non-outliers by only examining the nodes close to the root of the PST. Thus, if the goal is to just mine outliers, then we can drastically reduce the size of the PST and reduce its construction and query time. In our experiments, we show that on a real data set consisting of protein sequences, by retaining less than 5{\%} of the original PST we can retrieve all the outliers that were reported by the full-sized PST. We also carry out a detailed comparison between two measures of sequence similarity: the normalized probability and the odds and show that while the current research literature in PST favours the odds, for outlier detection it is normalized probability which gives far superior results. We provide an information theoretic argument based on entropy to explain the success of the normalized probability measure. Finally, we describe a more efficient implementation of the PST algorithm, which dramatically reduces its construction time compared to the implementation of Bejerano [3].",
author = "Pei Sun and Sanjay Chawla and Bavani Arunasalam",
year = "2006",
language = "English",
isbn = "089871611X",
volume = "2006",
pages = "94--105",
booktitle = "Proceedings of the Sixth SIAM International Conference on Data Mining",

}

TY - GEN

T1 - Mining for outliers in sequential databases

AU - Sun, Pei

AU - Chawla, Sanjay

AU - Arunasalam, Bavani

PY - 2006

Y1 - 2006

N2 - The mining of outliers (or anomaly detection) in large databases continues to remain an active area of research with many potential applications. Over the last several years many novel methods have been proposed to efficiently and accurately mine for outliers. In this paper we propose a unique approach to mine for sequential outliers using Probabilistic Suffix Trees (PST). The key insight that underpins our work is that we can distinguish outliers from non-outliers by only examining the nodes close to the root of the PST. Thus, if the goal is to just mine outliers, then we can drastically reduce the size of the PST and reduce its construction and query time. In our experiments, we show that on a real data set consisting of protein sequences, by retaining less than 5% of the original PST we can retrieve all the outliers that were reported by the full-sized PST. We also carry out a detailed comparison between two measures of sequence similarity: the normalized probability and the odds and show that while the current research literature in PST favours the odds, for outlier detection it is normalized probability which gives far superior results. We provide an information theoretic argument based on entropy to explain the success of the normalized probability measure. Finally, we describe a more efficient implementation of the PST algorithm, which dramatically reduces its construction time compared to the implementation of Bejerano [3].

AB - The mining of outliers (or anomaly detection) in large databases continues to remain an active area of research with many potential applications. Over the last several years many novel methods have been proposed to efficiently and accurately mine for outliers. In this paper we propose a unique approach to mine for sequential outliers using Probabilistic Suffix Trees (PST). The key insight that underpins our work is that we can distinguish outliers from non-outliers by only examining the nodes close to the root of the PST. Thus, if the goal is to just mine outliers, then we can drastically reduce the size of the PST and reduce its construction and query time. In our experiments, we show that on a real data set consisting of protein sequences, by retaining less than 5% of the original PST we can retrieve all the outliers that were reported by the full-sized PST. We also carry out a detailed comparison between two measures of sequence similarity: the normalized probability and the odds and show that while the current research literature in PST favours the odds, for outlier detection it is normalized probability which gives far superior results. We provide an information theoretic argument based on entropy to explain the success of the normalized probability measure. Finally, we describe a more efficient implementation of the PST algorithm, which dramatically reduces its construction time compared to the implementation of Bejerano [3].

UR - http://www.scopus.com/inward/record.url?scp=33745447341&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33745447341&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:33745447341

SN - 089871611X

SN - 9780898716115

VL - 2006

SP - 94

EP - 105

BT - Proceedings of the Sixth SIAM International Conference on Data Mining

ER -