A masking index for quantifying hidden glitches

Laure Berti-Equille, Ji Meng Loh, Tamraparni Dasu

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

Data glitches are errors in a dataset. They are complex entities that often span multiple attributes and records. When they co-occur in data, the presence of one type of glitch can hinder the detection of another type of glitch. This phenomenon is called masking. In this paper, we define two important types of masking and propose a novel, statistically rigorous indicator called masking index for quantifying the hidden glitches. We outline four cases of masking: outliers masked by missing values, outliers masked by duplicates, duplicates masked by missing values, and duplicates masked by outliers. The masking index is critical for data quality profiling and data exploration. It enables a user to measure the extent of masking and hence the confidence in the data. In this sense, it is a valuable data quality index for choosing an anomaly detection method that is best suited for the glitches that are present in any given dataset. We demonstrate the utility and effectiveness of the masking index by intensive experiments on synthetic and real-world datasets.

Original languageEnglish
Pages (from-to)253-277
Number of pages25
JournalKnowledge and Information Systems
Volume44
Issue number2
DOIs
Publication statusPublished - 1 Jul 2014

Fingerprint

Experiments

Keywords

  • Anomaly detection
  • Duplicate record identification
  • Masking
  • Missing values
  • Outlier detection

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Information Systems
  • Hardware and Architecture
  • Human-Computer Interaction

Cite this

A masking index for quantifying hidden glitches. / Berti-Equille, Laure; Loh, Ji Meng; Dasu, Tamraparni.

In: Knowledge and Information Systems, Vol. 44, No. 2, 01.07.2014, p. 253-277.

Research output: Contribution to journalArticle

Berti-Equille, Laure ; Loh, Ji Meng ; Dasu, Tamraparni. / A masking index for quantifying hidden glitches. In: Knowledge and Information Systems. 2014 ; Vol. 44, No. 2. pp. 253-277.
@article{310c4e2ffa8b4777a1bbe95aa11eef6f,
title = "A masking index for quantifying hidden glitches",
abstract = "Data glitches are errors in a dataset. They are complex entities that often span multiple attributes and records. When they co-occur in data, the presence of one type of glitch can hinder the detection of another type of glitch. This phenomenon is called masking. In this paper, we define two important types of masking and propose a novel, statistically rigorous indicator called masking index for quantifying the hidden glitches. We outline four cases of masking: outliers masked by missing values, outliers masked by duplicates, duplicates masked by missing values, and duplicates masked by outliers. The masking index is critical for data quality profiling and data exploration. It enables a user to measure the extent of masking and hence the confidence in the data. In this sense, it is a valuable data quality index for choosing an anomaly detection method that is best suited for the glitches that are present in any given dataset. We demonstrate the utility and effectiveness of the masking index by intensive experiments on synthetic and real-world datasets.",
keywords = "Anomaly detection, Duplicate record identification, Masking, Missing values, Outlier detection",
author = "Laure Berti-Equille and Loh, {Ji Meng} and Tamraparni Dasu",
year = "2014",
month = "7",
day = "1",
doi = "10.1007/s10115-014-0760-0",
language = "English",
volume = "44",
pages = "253--277",
journal = "Knowledge and Information Systems",
issn = "0219-1377",
publisher = "Springer London",
number = "2",

}

TY - JOUR

T1 - A masking index for quantifying hidden glitches

AU - Berti-Equille, Laure

AU - Loh, Ji Meng

AU - Dasu, Tamraparni

PY - 2014/7/1

Y1 - 2014/7/1

N2 - Data glitches are errors in a dataset. They are complex entities that often span multiple attributes and records. When they co-occur in data, the presence of one type of glitch can hinder the detection of another type of glitch. This phenomenon is called masking. In this paper, we define two important types of masking and propose a novel, statistically rigorous indicator called masking index for quantifying the hidden glitches. We outline four cases of masking: outliers masked by missing values, outliers masked by duplicates, duplicates masked by missing values, and duplicates masked by outliers. The masking index is critical for data quality profiling and data exploration. It enables a user to measure the extent of masking and hence the confidence in the data. In this sense, it is a valuable data quality index for choosing an anomaly detection method that is best suited for the glitches that are present in any given dataset. We demonstrate the utility and effectiveness of the masking index by intensive experiments on synthetic and real-world datasets.

AB - Data glitches are errors in a dataset. They are complex entities that often span multiple attributes and records. When they co-occur in data, the presence of one type of glitch can hinder the detection of another type of glitch. This phenomenon is called masking. In this paper, we define two important types of masking and propose a novel, statistically rigorous indicator called masking index for quantifying the hidden glitches. We outline four cases of masking: outliers masked by missing values, outliers masked by duplicates, duplicates masked by missing values, and duplicates masked by outliers. The masking index is critical for data quality profiling and data exploration. It enables a user to measure the extent of masking and hence the confidence in the data. In this sense, it is a valuable data quality index for choosing an anomaly detection method that is best suited for the glitches that are present in any given dataset. We demonstrate the utility and effectiveness of the masking index by intensive experiments on synthetic and real-world datasets.

KW - Anomaly detection

KW - Duplicate record identification

KW - Masking

KW - Missing values

KW - Outlier detection

UR - http://www.scopus.com/inward/record.url?scp=84937634235&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84937634235&partnerID=8YFLogxK

U2 - 10.1007/s10115-014-0760-0

DO - 10.1007/s10115-014-0760-0

M3 - Article

VL - 44

SP - 253

EP - 277

JO - Knowledge and Information Systems

JF - Knowledge and Information Systems

SN - 0219-1377

IS - 2

ER -