Finding local anomalies in very high dimensional space

Timothy De Vries, Sanjay Chawla, Michael E. Houle

Research output: Chapter in Book/Report/Conference proceedingConference contribution

40 Citations (Scopus)

Abstract

Time, cost and energy efficiency are critical factors for many data analysis techniques when the size and dimensionality of data is very large. We investigate the use of Local Outlier Factor (LOF) for data of this type, providing a motivating example from real world data. We propose Projection-Indexed Nearest-Neighbours (PINN), a novel technique that exploits extended nearest neighbour sets in the a reduced dimensional space to create an accurate approximation for k-nearest-neighbour distances, which is used as the core density measurement within LOF. The reduced dimensionality allows for efficient sub-quadratic indexing in the number of items in the data set, where previously only quadratic performance was possible. A detailed theoretical analysis of Random Projection (RP) and PINN shows that we are able to preserve the density of the intrinsic manifold of the data set after projection. Experimental results show that PINN outperforms the standard projection methods RP and PCA when measuring LOF for many high-dimensional real-world data sets of up to 300000 elements and 102600 dimensions.

Original languageEnglish
Title of host publicationProceedings - IEEE International Conference on Data Mining, ICDM
Pages128-137
Number of pages10
DOIs
Publication statusPublished - 2010
Externally publishedYes
Event10th IEEE International Conference on Data Mining, ICDM 2010 - Sydney, NSW
Duration: 14 Dec 201017 Dec 2010

Other

Other10th IEEE International Conference on Data Mining, ICDM 2010
CitySydney, NSW
Period14/12/1017/12/10

Fingerprint

Energy efficiency
Costs

Keywords

  • Anomaly detection
  • Dimensionality reduction

ASJC Scopus subject areas

  • Engineering(all)

Cite this

De Vries, T., Chawla, S., & Houle, M. E. (2010). Finding local anomalies in very high dimensional space. In Proceedings - IEEE International Conference on Data Mining, ICDM (pp. 128-137). [5693966] https://doi.org/10.1109/ICDM.2010.151

Finding local anomalies in very high dimensional space. / De Vries, Timothy; Chawla, Sanjay; Houle, Michael E.

Proceedings - IEEE International Conference on Data Mining, ICDM. 2010. p. 128-137 5693966.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

De Vries, T, Chawla, S & Houle, ME 2010, Finding local anomalies in very high dimensional space. in Proceedings - IEEE International Conference on Data Mining, ICDM., 5693966, pp. 128-137, 10th IEEE International Conference on Data Mining, ICDM 2010, Sydney, NSW, 14/12/10. https://doi.org/10.1109/ICDM.2010.151
De Vries T, Chawla S, Houle ME. Finding local anomalies in very high dimensional space. In Proceedings - IEEE International Conference on Data Mining, ICDM. 2010. p. 128-137. 5693966 https://doi.org/10.1109/ICDM.2010.151
De Vries, Timothy ; Chawla, Sanjay ; Houle, Michael E. / Finding local anomalies in very high dimensional space. Proceedings - IEEE International Conference on Data Mining, ICDM. 2010. pp. 128-137
@inproceedings{b91a8f4453524c2d941baf5f44d042c7,
title = "Finding local anomalies in very high dimensional space",
abstract = "Time, cost and energy efficiency are critical factors for many data analysis techniques when the size and dimensionality of data is very large. We investigate the use of Local Outlier Factor (LOF) for data of this type, providing a motivating example from real world data. We propose Projection-Indexed Nearest-Neighbours (PINN), a novel technique that exploits extended nearest neighbour sets in the a reduced dimensional space to create an accurate approximation for k-nearest-neighbour distances, which is used as the core density measurement within LOF. The reduced dimensionality allows for efficient sub-quadratic indexing in the number of items in the data set, where previously only quadratic performance was possible. A detailed theoretical analysis of Random Projection (RP) and PINN shows that we are able to preserve the density of the intrinsic manifold of the data set after projection. Experimental results show that PINN outperforms the standard projection methods RP and PCA when measuring LOF for many high-dimensional real-world data sets of up to 300000 elements and 102600 dimensions.",
keywords = "Anomaly detection, Dimensionality reduction",
author = "{De Vries}, Timothy and Sanjay Chawla and Houle, {Michael E.}",
year = "2010",
doi = "10.1109/ICDM.2010.151",
language = "English",
isbn = "9780769542560",
pages = "128--137",
booktitle = "Proceedings - IEEE International Conference on Data Mining, ICDM",

}

TY - GEN

T1 - Finding local anomalies in very high dimensional space

AU - De Vries, Timothy

AU - Chawla, Sanjay

AU - Houle, Michael E.

PY - 2010

Y1 - 2010

N2 - Time, cost and energy efficiency are critical factors for many data analysis techniques when the size and dimensionality of data is very large. We investigate the use of Local Outlier Factor (LOF) for data of this type, providing a motivating example from real world data. We propose Projection-Indexed Nearest-Neighbours (PINN), a novel technique that exploits extended nearest neighbour sets in the a reduced dimensional space to create an accurate approximation for k-nearest-neighbour distances, which is used as the core density measurement within LOF. The reduced dimensionality allows for efficient sub-quadratic indexing in the number of items in the data set, where previously only quadratic performance was possible. A detailed theoretical analysis of Random Projection (RP) and PINN shows that we are able to preserve the density of the intrinsic manifold of the data set after projection. Experimental results show that PINN outperforms the standard projection methods RP and PCA when measuring LOF for many high-dimensional real-world data sets of up to 300000 elements and 102600 dimensions.

AB - Time, cost and energy efficiency are critical factors for many data analysis techniques when the size and dimensionality of data is very large. We investigate the use of Local Outlier Factor (LOF) for data of this type, providing a motivating example from real world data. We propose Projection-Indexed Nearest-Neighbours (PINN), a novel technique that exploits extended nearest neighbour sets in the a reduced dimensional space to create an accurate approximation for k-nearest-neighbour distances, which is used as the core density measurement within LOF. The reduced dimensionality allows for efficient sub-quadratic indexing in the number of items in the data set, where previously only quadratic performance was possible. A detailed theoretical analysis of Random Projection (RP) and PINN shows that we are able to preserve the density of the intrinsic manifold of the data set after projection. Experimental results show that PINN outperforms the standard projection methods RP and PCA when measuring LOF for many high-dimensional real-world data sets of up to 300000 elements and 102600 dimensions.

KW - Anomaly detection

KW - Dimensionality reduction

UR - http://www.scopus.com/inward/record.url?scp=79951739637&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79951739637&partnerID=8YFLogxK

U2 - 10.1109/ICDM.2010.151

DO - 10.1109/ICDM.2010.151

M3 - Conference contribution

SN - 9780769542560

SP - 128

EP - 137

BT - Proceedings - IEEE International Conference on Data Mining, ICDM

ER -