Towards an end-to-end human-centric data cleaning framework

El Kindi Rezig, Mourad Ouzzani, Ahmed Elmagarmid, Walid G. Aref, Michael Stonebraker

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process such as providing rules or validating computed repairs. There is a plethora of data cleaning algorithms addressing a wide range of data errors (e.g., detecting duplicates, violations of integrity constraints, and missing values). Many of these algorithms involve a human in the loop, however, this latter is usually coupled to the underlying cleaning algorithms. In a real data cleaning pipeline, several data cleaning operations are performed using different tools. A high-level reasoning on these tools, when combined to repair the data, has the potential to unlock useful use cases to involve humans in the cleaning process. Additionally, we believe there is an opportunity to benefit from recent advances in active learning methods to minimize the effort humans have to spend to verify data items produced by tools or humans. There is currently no end-to-end data cleaning framework that systematically involves humans in the cleaning pipeline regardless of the underlying cleaning algorithms. In this paper, we present opportunities that this framework could offer, and highlight key challenges that need to be addressed to realize this vision. We present a design vision and discuss scenarios that motivate the need for this framework to judiciously assist humans in the cleaning process.

Original languageEnglish
Title of host publicationProceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA 2019
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450367912
DOIs
Publication statusPublished - 5 Jul 2019
Event2019 Workshop on Human-In-the-Loop Data Analytics, HILDA 2019, co-located with SIGMOD 2019 - Amsterdam, Netherlands
Duration: 5 Jul 2019 → …

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Conference

Conference2019 Workshop on Human-In-the-Loop Data Analytics, HILDA 2019, co-located with SIGMOD 2019
CountryNetherlands
CityAmsterdam
Period5/7/19 → …

Fingerprint

Cleaning
Repair
Pipelines

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Rezig, E. K., Ouzzani, M., Elmagarmid, A., Aref, W. G., & Stonebraker, M. (2019). Towards an end-to-end human-centric data cleaning framework. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA 2019 [a1] (Proceedings of the ACM SIGMOD International Conference on Management of Data). Association for Computing Machinery. https://doi.org/10.1145/3328519.3329133

Towards an end-to-end human-centric data cleaning framework. / Rezig, El Kindi; Ouzzani, Mourad; Elmagarmid, Ahmed; Aref, Walid G.; Stonebraker, Michael.

Proceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA 2019. Association for Computing Machinery, 2019. a1 (Proceedings of the ACM SIGMOD International Conference on Management of Data).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Rezig, EK, Ouzzani, M, Elmagarmid, A, Aref, WG & Stonebraker, M 2019, Towards an end-to-end human-centric data cleaning framework. in Proceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA 2019., a1, Proceedings of the ACM SIGMOD International Conference on Management of Data, Association for Computing Machinery, 2019 Workshop on Human-In-the-Loop Data Analytics, HILDA 2019, co-located with SIGMOD 2019, Amsterdam, Netherlands, 5/7/19. https://doi.org/10.1145/3328519.3329133
Rezig EK, Ouzzani M, Elmagarmid A, Aref WG, Stonebraker M. Towards an end-to-end human-centric data cleaning framework. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA 2019. Association for Computing Machinery. 2019. a1. (Proceedings of the ACM SIGMOD International Conference on Management of Data). https://doi.org/10.1145/3328519.3329133
Rezig, El Kindi ; Ouzzani, Mourad ; Elmagarmid, Ahmed ; Aref, Walid G. ; Stonebraker, Michael. / Towards an end-to-end human-centric data cleaning framework. Proceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA 2019. Association for Computing Machinery, 2019. (Proceedings of the ACM SIGMOD International Conference on Management of Data).
@inproceedings{277112b2a312461eb9aa325d1d19cc76,
title = "Towards an end-to-end human-centric data cleaning framework",
abstract = "Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process such as providing rules or validating computed repairs. There is a plethora of data cleaning algorithms addressing a wide range of data errors (e.g., detecting duplicates, violations of integrity constraints, and missing values). Many of these algorithms involve a human in the loop, however, this latter is usually coupled to the underlying cleaning algorithms. In a real data cleaning pipeline, several data cleaning operations are performed using different tools. A high-level reasoning on these tools, when combined to repair the data, has the potential to unlock useful use cases to involve humans in the cleaning process. Additionally, we believe there is an opportunity to benefit from recent advances in active learning methods to minimize the effort humans have to spend to verify data items produced by tools or humans. There is currently no end-to-end data cleaning framework that systematically involves humans in the cleaning pipeline regardless of the underlying cleaning algorithms. In this paper, we present opportunities that this framework could offer, and highlight key challenges that need to be addressed to realize this vision. We present a design vision and discuss scenarios that motivate the need for this framework to judiciously assist humans in the cleaning process.",
author = "Rezig, {El Kindi} and Mourad Ouzzani and Ahmed Elmagarmid and Aref, {Walid G.} and Michael Stonebraker",
year = "2019",
month = "7",
day = "5",
doi = "10.1145/3328519.3329133",
language = "English",
series = "Proceedings of the ACM SIGMOD International Conference on Management of Data",
publisher = "Association for Computing Machinery",
booktitle = "Proceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA 2019",

}

TY - GEN

T1 - Towards an end-to-end human-centric data cleaning framework

AU - Rezig, El Kindi

AU - Ouzzani, Mourad

AU - Elmagarmid, Ahmed

AU - Aref, Walid G.

AU - Stonebraker, Michael

PY - 2019/7/5

Y1 - 2019/7/5

N2 - Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process such as providing rules or validating computed repairs. There is a plethora of data cleaning algorithms addressing a wide range of data errors (e.g., detecting duplicates, violations of integrity constraints, and missing values). Many of these algorithms involve a human in the loop, however, this latter is usually coupled to the underlying cleaning algorithms. In a real data cleaning pipeline, several data cleaning operations are performed using different tools. A high-level reasoning on these tools, when combined to repair the data, has the potential to unlock useful use cases to involve humans in the cleaning process. Additionally, we believe there is an opportunity to benefit from recent advances in active learning methods to minimize the effort humans have to spend to verify data items produced by tools or humans. There is currently no end-to-end data cleaning framework that systematically involves humans in the cleaning pipeline regardless of the underlying cleaning algorithms. In this paper, we present opportunities that this framework could offer, and highlight key challenges that need to be addressed to realize this vision. We present a design vision and discuss scenarios that motivate the need for this framework to judiciously assist humans in the cleaning process.

AB - Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process such as providing rules or validating computed repairs. There is a plethora of data cleaning algorithms addressing a wide range of data errors (e.g., detecting duplicates, violations of integrity constraints, and missing values). Many of these algorithms involve a human in the loop, however, this latter is usually coupled to the underlying cleaning algorithms. In a real data cleaning pipeline, several data cleaning operations are performed using different tools. A high-level reasoning on these tools, when combined to repair the data, has the potential to unlock useful use cases to involve humans in the cleaning process. Additionally, we believe there is an opportunity to benefit from recent advances in active learning methods to minimize the effort humans have to spend to verify data items produced by tools or humans. There is currently no end-to-end data cleaning framework that systematically involves humans in the cleaning pipeline regardless of the underlying cleaning algorithms. In this paper, we present opportunities that this framework could offer, and highlight key challenges that need to be addressed to realize this vision. We present a design vision and discuss scenarios that motivate the need for this framework to judiciously assist humans in the cleaning process.

UR - http://www.scopus.com/inward/record.url?scp=85072811207&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85072811207&partnerID=8YFLogxK

U2 - 10.1145/3328519.3329133

DO - 10.1145/3328519.3329133

M3 - Conference contribution

AN - SCOPUS:85072811207

T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data

BT - Proceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA 2019

PB - Association for Computing Machinery

ER -