A masking index for quantifying hidden glitches

Laure Berti-Equille, Ji Meng Loh, Tamraparni Dasu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Data glitches are errors in a data set, they are complex entities that often span multiple attributes and records. When they co-occur in data, the presence of one type of glitch can hinder the detection of another type of glitch. This phenomenon is called masking. In this paper, we define two important types of masking, and we propose a novel, statistically rigorous indicator called masking index for quantifying the hidden glitches in four cases of masking: outliers masked by missing values, outliers masked by duplicates, duplicates masked by missing values, and duplicates masked by outliers. The masking index is critical for data quality profiling and data exploration, it enables a user to measure the extent of masking and hence the confidence in the data. In this sense, it is a valuable data quality index for measuring the true cleanliness of the data. It is also an objective and quantitative basis for choosing an anomaly detection method that is best suited for the glitches that are present in any given data set. We demonstrate the utility and effectiveness of the masking index by intensive experiments on synthetic and real-world datasets.

Original languageEnglish
Title of host publicationProceedings - IEEE International Conference on Data Mining, ICDM
Pages21-30
Number of pages10
DOIs
Publication statusPublished - 2013
Event13th IEEE International Conference on Data Mining, ICDM 2013 - Dallas, TX, United States
Duration: 7 Dec 201310 Dec 2013

Other

Other13th IEEE International Conference on Data Mining, ICDM 2013
CountryUnited States
CityDallas, TX
Period7/12/1310/12/13

Fingerprint

Experiments

Keywords

  • Anomaly detection
  • data cleaning
  • duplicate record identification
  • masking
  • missing values
  • outlier detection

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Berti-Equille, L., Loh, J. M., & Dasu, T. (2013). A masking index for quantifying hidden glitches. In Proceedings - IEEE International Conference on Data Mining, ICDM (pp. 21-30). [6729486] https://doi.org/10.1109/ICDM.2013.16

A masking index for quantifying hidden glitches. / Berti-Equille, Laure; Loh, Ji Meng; Dasu, Tamraparni.

Proceedings - IEEE International Conference on Data Mining, ICDM. 2013. p. 21-30 6729486.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Berti-Equille, L, Loh, JM & Dasu, T 2013, A masking index for quantifying hidden glitches. in Proceedings - IEEE International Conference on Data Mining, ICDM., 6729486, pp. 21-30, 13th IEEE International Conference on Data Mining, ICDM 2013, Dallas, TX, United States, 7/12/13. https://doi.org/10.1109/ICDM.2013.16
Berti-Equille L, Loh JM, Dasu T. A masking index for quantifying hidden glitches. In Proceedings - IEEE International Conference on Data Mining, ICDM. 2013. p. 21-30. 6729486 https://doi.org/10.1109/ICDM.2013.16
Berti-Equille, Laure ; Loh, Ji Meng ; Dasu, Tamraparni. / A masking index for quantifying hidden glitches. Proceedings - IEEE International Conference on Data Mining, ICDM. 2013. pp. 21-30
@inproceedings{0da7245d607347e69bd57c1e2cb6fac0,
title = "A masking index for quantifying hidden glitches",
abstract = "Data glitches are errors in a data set, they are complex entities that often span multiple attributes and records. When they co-occur in data, the presence of one type of glitch can hinder the detection of another type of glitch. This phenomenon is called masking. In this paper, we define two important types of masking, and we propose a novel, statistically rigorous indicator called masking index for quantifying the hidden glitches in four cases of masking: outliers masked by missing values, outliers masked by duplicates, duplicates masked by missing values, and duplicates masked by outliers. The masking index is critical for data quality profiling and data exploration, it enables a user to measure the extent of masking and hence the confidence in the data. In this sense, it is a valuable data quality index for measuring the true cleanliness of the data. It is also an objective and quantitative basis for choosing an anomaly detection method that is best suited for the glitches that are present in any given data set. We demonstrate the utility and effectiveness of the masking index by intensive experiments on synthetic and real-world datasets.",
keywords = "Anomaly detection, data cleaning, duplicate record identification, masking, missing values, outlier detection",
author = "Laure Berti-Equille and Loh, {Ji Meng} and Tamraparni Dasu",
year = "2013",
doi = "10.1109/ICDM.2013.16",
language = "English",
pages = "21--30",
booktitle = "Proceedings - IEEE International Conference on Data Mining, ICDM",

}

TY - GEN

T1 - A masking index for quantifying hidden glitches

AU - Berti-Equille, Laure

AU - Loh, Ji Meng

AU - Dasu, Tamraparni

PY - 2013

Y1 - 2013

N2 - Data glitches are errors in a data set, they are complex entities that often span multiple attributes and records. When they co-occur in data, the presence of one type of glitch can hinder the detection of another type of glitch. This phenomenon is called masking. In this paper, we define two important types of masking, and we propose a novel, statistically rigorous indicator called masking index for quantifying the hidden glitches in four cases of masking: outliers masked by missing values, outliers masked by duplicates, duplicates masked by missing values, and duplicates masked by outliers. The masking index is critical for data quality profiling and data exploration, it enables a user to measure the extent of masking and hence the confidence in the data. In this sense, it is a valuable data quality index for measuring the true cleanliness of the data. It is also an objective and quantitative basis for choosing an anomaly detection method that is best suited for the glitches that are present in any given data set. We demonstrate the utility and effectiveness of the masking index by intensive experiments on synthetic and real-world datasets.

AB - Data glitches are errors in a data set, they are complex entities that often span multiple attributes and records. When they co-occur in data, the presence of one type of glitch can hinder the detection of another type of glitch. This phenomenon is called masking. In this paper, we define two important types of masking, and we propose a novel, statistically rigorous indicator called masking index for quantifying the hidden glitches in four cases of masking: outliers masked by missing values, outliers masked by duplicates, duplicates masked by missing values, and duplicates masked by outliers. The masking index is critical for data quality profiling and data exploration, it enables a user to measure the extent of masking and hence the confidence in the data. In this sense, it is a valuable data quality index for measuring the true cleanliness of the data. It is also an objective and quantitative basis for choosing an anomaly detection method that is best suited for the glitches that are present in any given data set. We demonstrate the utility and effectiveness of the masking index by intensive experiments on synthetic and real-world datasets.

KW - Anomaly detection

KW - data cleaning

KW - duplicate record identification

KW - masking

KW - missing values

KW - outlier detection

UR - http://www.scopus.com/inward/record.url?scp=84894647567&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84894647567&partnerID=8YFLogxK

U2 - 10.1109/ICDM.2013.16

DO - 10.1109/ICDM.2013.16

M3 - Conference contribution

SP - 21

EP - 30

BT - Proceedings - IEEE International Conference on Data Mining, ICDM

ER -