Towards an end-to-end human-centric data cleaning framework

El Kindi Rezig, Mourad Ouzzani, Ahmed Elmagarmid, Walid G. Aref, Michael Stonebraker

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process such as providing rules or validating computed repairs. There is a plethora of data cleaning algorithms addressing a wide range of data errors (e.g., detecting duplicates, violations of integrity constraints, and missing values). Many of these algorithms involve a human in the loop, however, this latter is usually coupled to the underlying cleaning algorithms. In a real data cleaning pipeline, several data cleaning operations are performed using different tools. A high-level reasoning on these tools, when combined to repair the data, has the potential to unlock useful use cases to involve humans in the cleaning process. Additionally, we believe there is an opportunity to benefit from recent advances in active learning methods to minimize the effort humans have to spend to verify data items produced by tools or humans. There is currently no end-to-end data cleaning framework that systematically involves humans in the cleaning pipeline regardless of the underlying cleaning algorithms. In this paper, we present opportunities that this framework could offer, and highlight key challenges that need to be addressed to realize this vision. We present a design vision and discuss scenarios that motivate the need for this framework to judiciously assist humans in the cleaning process.

Original languageEnglish
Title of host publicationProceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA 2019
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450367912
DOIs
Publication statusPublished - 5 Jul 2019
Event2019 Workshop on Human-In-the-Loop Data Analytics, HILDA 2019, co-located with SIGMOD 2019 - Amsterdam, Netherlands
Duration: 5 Jul 2019 → …

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Conference

Conference2019 Workshop on Human-In-the-Loop Data Analytics, HILDA 2019, co-located with SIGMOD 2019
CountryNetherlands
CityAmsterdam
Period5/7/19 → …

    Fingerprint

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Rezig, E. K., Ouzzani, M., Elmagarmid, A., Aref, W. G., & Stonebraker, M. (2019). Towards an end-to-end human-centric data cleaning framework. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA 2019 [a1] (Proceedings of the ACM SIGMOD International Conference on Management of Data). Association for Computing Machinery. https://doi.org/10.1145/3328519.3329133