Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning

Laure Berti-Equille, Tamraparni Dasu, Divesh Srivastava

Research output: Chapter in Book/Report/Conference proceedingConference contribution

36 Citations (Scopus)

Abstract

Quantitative Data Cleaning (QDC) is the use of statistical and other analytical techniques to detect, quantify, and correct data quality problems (or glitches). Current QDC approaches focus on addressing each category of data glitch individually. However, in real-world data, different types of data glitches co-occur in complex patterns. These patterns and interactions between glitches offer valuable clues for developing effective domain-specific quantitative cleaning strategies. In this paper, we address the shortcomings of the extant QDC methods by proposing a novel framework, the DEC (Detect-Explore-Clean) framework. It is a comprehensive approach for the definition, detection and cleaning of complex, multi-type data glitches. We exploit the distributions and interactions of different types of glitches to develop data-driven cleaning strategies that may offer significant advantages over blind strategies. The DEC framework is a statistically rigorous methodology for evaluating and scoring glitches and selecting the quantitative cleaning strategies that result in cleaned data sets that are statistically proximal to user specifications. We demonstrate the efficacy and scalability of the DEC framework on very large real-world and synthetic data sets.

Original languageEnglish
Title of host publicationProceedings - International Conference on Data Engineering
Pages733-744
Number of pages12
DOIs
Publication statusPublished - 6 Jun 2011
Externally publishedYes
Event2011 IEEE 27th International Conference on Data Engineering, ICDE 2011 - Hannover, Germany
Duration: 11 Apr 201116 Apr 2011

Other

Other2011 IEEE 27th International Conference on Data Engineering, ICDE 2011
CountryGermany
CityHannover
Period11/4/1116/4/11

Fingerprint

Cleaning
Scalability
Specifications

ASJC Scopus subject areas

  • Information Systems
  • Signal Processing
  • Software

Cite this

Berti-Equille, L., Dasu, T., & Srivastava, D. (2011). Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning. In Proceedings - International Conference on Data Engineering (pp. 733-744). [5767864] https://doi.org/10.1109/ICDE.2011.5767864

Discovery of complex glitch patterns : A novel approach to Quantitative Data Cleaning. / Berti-Equille, Laure; Dasu, Tamraparni; Srivastava, Divesh.

Proceedings - International Conference on Data Engineering. 2011. p. 733-744 5767864.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Berti-Equille, L, Dasu, T & Srivastava, D 2011, Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning. in Proceedings - International Conference on Data Engineering., 5767864, pp. 733-744, 2011 IEEE 27th International Conference on Data Engineering, ICDE 2011, Hannover, Germany, 11/4/11. https://doi.org/10.1109/ICDE.2011.5767864
Berti-Equille L, Dasu T, Srivastava D. Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning. In Proceedings - International Conference on Data Engineering. 2011. p. 733-744. 5767864 https://doi.org/10.1109/ICDE.2011.5767864
Berti-Equille, Laure ; Dasu, Tamraparni ; Srivastava, Divesh. / Discovery of complex glitch patterns : A novel approach to Quantitative Data Cleaning. Proceedings - International Conference on Data Engineering. 2011. pp. 733-744
@inproceedings{685f98b638ee43abba99fc85e7544dae,
title = "Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning",
abstract = "Quantitative Data Cleaning (QDC) is the use of statistical and other analytical techniques to detect, quantify, and correct data quality problems (or glitches). Current QDC approaches focus on addressing each category of data glitch individually. However, in real-world data, different types of data glitches co-occur in complex patterns. These patterns and interactions between glitches offer valuable clues for developing effective domain-specific quantitative cleaning strategies. In this paper, we address the shortcomings of the extant QDC methods by proposing a novel framework, the DEC (Detect-Explore-Clean) framework. It is a comprehensive approach for the definition, detection and cleaning of complex, multi-type data glitches. We exploit the distributions and interactions of different types of glitches to develop data-driven cleaning strategies that may offer significant advantages over blind strategies. The DEC framework is a statistically rigorous methodology for evaluating and scoring glitches and selecting the quantitative cleaning strategies that result in cleaned data sets that are statistically proximal to user specifications. We demonstrate the efficacy and scalability of the DEC framework on very large real-world and synthetic data sets.",
author = "Laure Berti-Equille and Tamraparni Dasu and Divesh Srivastava",
year = "2011",
month = "6",
day = "6",
doi = "10.1109/ICDE.2011.5767864",
language = "English",
isbn = "9781424489589",
pages = "733--744",
booktitle = "Proceedings - International Conference on Data Engineering",

}

TY - GEN

T1 - Discovery of complex glitch patterns

T2 - A novel approach to Quantitative Data Cleaning

AU - Berti-Equille, Laure

AU - Dasu, Tamraparni

AU - Srivastava, Divesh

PY - 2011/6/6

Y1 - 2011/6/6

N2 - Quantitative Data Cleaning (QDC) is the use of statistical and other analytical techniques to detect, quantify, and correct data quality problems (or glitches). Current QDC approaches focus on addressing each category of data glitch individually. However, in real-world data, different types of data glitches co-occur in complex patterns. These patterns and interactions between glitches offer valuable clues for developing effective domain-specific quantitative cleaning strategies. In this paper, we address the shortcomings of the extant QDC methods by proposing a novel framework, the DEC (Detect-Explore-Clean) framework. It is a comprehensive approach for the definition, detection and cleaning of complex, multi-type data glitches. We exploit the distributions and interactions of different types of glitches to develop data-driven cleaning strategies that may offer significant advantages over blind strategies. The DEC framework is a statistically rigorous methodology for evaluating and scoring glitches and selecting the quantitative cleaning strategies that result in cleaned data sets that are statistically proximal to user specifications. We demonstrate the efficacy and scalability of the DEC framework on very large real-world and synthetic data sets.

AB - Quantitative Data Cleaning (QDC) is the use of statistical and other analytical techniques to detect, quantify, and correct data quality problems (or glitches). Current QDC approaches focus on addressing each category of data glitch individually. However, in real-world data, different types of data glitches co-occur in complex patterns. These patterns and interactions between glitches offer valuable clues for developing effective domain-specific quantitative cleaning strategies. In this paper, we address the shortcomings of the extant QDC methods by proposing a novel framework, the DEC (Detect-Explore-Clean) framework. It is a comprehensive approach for the definition, detection and cleaning of complex, multi-type data glitches. We exploit the distributions and interactions of different types of glitches to develop data-driven cleaning strategies that may offer significant advantages over blind strategies. The DEC framework is a statistically rigorous methodology for evaluating and scoring glitches and selecting the quantitative cleaning strategies that result in cleaned data sets that are statistically proximal to user specifications. We demonstrate the efficacy and scalability of the DEC framework on very large real-world and synthetic data sets.

UR - http://www.scopus.com/inward/record.url?scp=79957860515&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79957860515&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2011.5767864

DO - 10.1109/ICDE.2011.5767864

M3 - Conference contribution

AN - SCOPUS:79957860515

SN - 9781424489589

SP - 733

EP - 744

BT - Proceedings - International Conference on Data Engineering

ER -