Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning

Laure Berti-Équille, Tamraparni Dasu, Divesh Srivastava

Research output: Chapter in Book/Report/Conference proceedingConference contribution

37 Citations (Scopus)

Abstract

Quantitative Data Cleaning (QDC) is the use of statistical and other analytical techniques to detect, quantify, and correct data quality problems (or glitches). Current QDC approaches focus on addressing each category of data glitch individually. However, in real-world data, different types of data glitches co-occur in complex patterns. These patterns and interactions between glitches offer valuable clues for developing effective domain-specific quantitative cleaning strategies. In this paper, we address the shortcomings of the extant QDC methods by proposing a novel framework, the DEC (Detect-Explore-Clean) framework. It is a comprehensive approach for the definition, detection and cleaning of complex, multi-type data glitches. We exploit the distributions and interactions of different types of glitches to develop data-driven cleaning strategies that may offer significant advantages over blind strategies. The DEC framework is a statistically rigorous methodology for evaluating and scoring glitches and selecting the quantitative cleaning strategies that result in cleaned data sets that are statistically proximal to user specifications. We demonstrate the efficacy and scalability of the DEC framework on very large real-world and synthetic data sets.

Original languageEnglish
Title of host publication2011 IEEE 27th International Conference on Data Engineering, ICDE 2011
Pages733-744
Number of pages12
DOIs
Publication statusPublished - 6 Jun 2011
Event2011 IEEE 27th International Conference on Data Engineering, ICDE 2011 - Hannover, Germany
Duration: 11 Apr 201116 Apr 2011

Publication series

NameProceedings - International Conference on Data Engineering
ISSN (Print)1084-4627

Other

Other2011 IEEE 27th International Conference on Data Engineering, ICDE 2011
CountryGermany
CityHannover
Period11/4/1116/4/11

    Fingerprint

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems

Cite this

Berti-Équille, L., Dasu, T., & Srivastava, D. (2011). Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning. In 2011 IEEE 27th International Conference on Data Engineering, ICDE 2011 (pp. 733-744). [5767864] (Proceedings - International Conference on Data Engineering). https://doi.org/10.1109/ICDE.2011.5767864