NADEEF

A commodity data cleaning system

Michele Dallachiesat, Amr Ebaid, Ahmed Eldawy, Ahmed Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, Nan Tang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

123 Citations (Scopus)

Abstract

Despite the increasing importance of data quality and the rich theoretical and practical contributions in all aspects of data cleaning, there is no single end-to-end off-the-shelf solution to (semi-)automate the detection and the repairing of violations w.r.t. a set of heterogeneous and ad-hoc quality constraints. In short, there is no commodity platform similar to general purpose DBMSs that can be easily customized and deployed to solve application-specific data quality problems. In this paper, we present NADEEF, an extensible, generalized and easy-to-deploy data cleaning platform. NADEEF distinguishes between a programming interface and a core to achieve generality and extensibility. The programming interface allows the users to specify multiple types of data quality rules, which uniformly define what is wrong with the data and (possibly) how to repair it through writing code that implements predefined classes. We show that the programming interface can be used to express many types of data quality rules beyond the well known CFDs (FDs), MDs and ETL rules. Treating user implemented interfaces as black-boxes, the core provides algorithms to detect errors and to clean data. The core is designed in a way to allow cleaning algorithms to cope with multiple rules holistically, i.e., detecting and repairing data errors without differentiating between various types of rules. We showcase two implementations for core repairing algorithms. These two implementations demonstrate the extensibility of our core, which can also be replaced by other user-provided algorithms. Using real-life data, we experimentally verify the generality, extensibility, and effectiveness of our system.

Original languageEnglish
Title of host publicationProceedings of the ACM SIGMOD International Conference on Management of Data
Pages541-552
Number of pages12
DOIs
Publication statusPublished - 29 Jul 2013
Event2013 ACM SIGMOD Conference on Management of Data, SIGMOD 2013 - New York, NY, United States
Duration: 22 Jun 201327 Jun 2013

Other

Other2013 ACM SIGMOD Conference on Management of Data, SIGMOD 2013
CountryUnited States
CityNew York, NY
Period22/6/1327/6/13

Fingerprint

Cleaning
User interfaces
Computational fluid dynamics
Repair

Keywords

  • Conditional functional dependency
  • Data cleaning
  • ETL
  • Matching dependency

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Dallachiesat, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I. F., Ouzzani, M., & Tang, N. (2013). NADEEF: A commodity data cleaning system. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 541-552) https://doi.org/10.1145/2463676.2465327

NADEEF : A commodity data cleaning system. / Dallachiesat, Michele; Ebaid, Amr; Eldawy, Ahmed; Elmagarmid, Ahmed; Ilyas, Ihab F.; Ouzzani, Mourad; Tang, Nan.

Proceedings of the ACM SIGMOD International Conference on Management of Data. 2013. p. 541-552.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Dallachiesat, M, Ebaid, A, Eldawy, A, Elmagarmid, A, Ilyas, IF, Ouzzani, M & Tang, N 2013, NADEEF: A commodity data cleaning system. in Proceedings of the ACM SIGMOD International Conference on Management of Data. pp. 541-552, 2013 ACM SIGMOD Conference on Management of Data, SIGMOD 2013, New York, NY, United States, 22/6/13. https://doi.org/10.1145/2463676.2465327
Dallachiesat M, Ebaid A, Eldawy A, Elmagarmid A, Ilyas IF, Ouzzani M et al. NADEEF: A commodity data cleaning system. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 2013. p. 541-552 https://doi.org/10.1145/2463676.2465327
Dallachiesat, Michele ; Ebaid, Amr ; Eldawy, Ahmed ; Elmagarmid, Ahmed ; Ilyas, Ihab F. ; Ouzzani, Mourad ; Tang, Nan. / NADEEF : A commodity data cleaning system. Proceedings of the ACM SIGMOD International Conference on Management of Data. 2013. pp. 541-552
@inproceedings{db8727017cee41e78c30865649c4846e,
title = "NADEEF: A commodity data cleaning system",
abstract = "Despite the increasing importance of data quality and the rich theoretical and practical contributions in all aspects of data cleaning, there is no single end-to-end off-the-shelf solution to (semi-)automate the detection and the repairing of violations w.r.t. a set of heterogeneous and ad-hoc quality constraints. In short, there is no commodity platform similar to general purpose DBMSs that can be easily customized and deployed to solve application-specific data quality problems. In this paper, we present NADEEF, an extensible, generalized and easy-to-deploy data cleaning platform. NADEEF distinguishes between a programming interface and a core to achieve generality and extensibility. The programming interface allows the users to specify multiple types of data quality rules, which uniformly define what is wrong with the data and (possibly) how to repair it through writing code that implements predefined classes. We show that the programming interface can be used to express many types of data quality rules beyond the well known CFDs (FDs), MDs and ETL rules. Treating user implemented interfaces as black-boxes, the core provides algorithms to detect errors and to clean data. The core is designed in a way to allow cleaning algorithms to cope with multiple rules holistically, i.e., detecting and repairing data errors without differentiating between various types of rules. We showcase two implementations for core repairing algorithms. These two implementations demonstrate the extensibility of our core, which can also be replaced by other user-provided algorithms. Using real-life data, we experimentally verify the generality, extensibility, and effectiveness of our system.",
keywords = "Conditional functional dependency, Data cleaning, ETL, Matching dependency",
author = "Michele Dallachiesat and Amr Ebaid and Ahmed Eldawy and Ahmed Elmagarmid and Ilyas, {Ihab F.} and Mourad Ouzzani and Nan Tang",
year = "2013",
month = "7",
day = "29",
doi = "10.1145/2463676.2465327",
language = "English",
isbn = "9781450320375",
pages = "541--552",
booktitle = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

}

TY - GEN

T1 - NADEEF

T2 - A commodity data cleaning system

AU - Dallachiesat, Michele

AU - Ebaid, Amr

AU - Eldawy, Ahmed

AU - Elmagarmid, Ahmed

AU - Ilyas, Ihab F.

AU - Ouzzani, Mourad

AU - Tang, Nan

PY - 2013/7/29

Y1 - 2013/7/29

N2 - Despite the increasing importance of data quality and the rich theoretical and practical contributions in all aspects of data cleaning, there is no single end-to-end off-the-shelf solution to (semi-)automate the detection and the repairing of violations w.r.t. a set of heterogeneous and ad-hoc quality constraints. In short, there is no commodity platform similar to general purpose DBMSs that can be easily customized and deployed to solve application-specific data quality problems. In this paper, we present NADEEF, an extensible, generalized and easy-to-deploy data cleaning platform. NADEEF distinguishes between a programming interface and a core to achieve generality and extensibility. The programming interface allows the users to specify multiple types of data quality rules, which uniformly define what is wrong with the data and (possibly) how to repair it through writing code that implements predefined classes. We show that the programming interface can be used to express many types of data quality rules beyond the well known CFDs (FDs), MDs and ETL rules. Treating user implemented interfaces as black-boxes, the core provides algorithms to detect errors and to clean data. The core is designed in a way to allow cleaning algorithms to cope with multiple rules holistically, i.e., detecting and repairing data errors without differentiating between various types of rules. We showcase two implementations for core repairing algorithms. These two implementations demonstrate the extensibility of our core, which can also be replaced by other user-provided algorithms. Using real-life data, we experimentally verify the generality, extensibility, and effectiveness of our system.

AB - Despite the increasing importance of data quality and the rich theoretical and practical contributions in all aspects of data cleaning, there is no single end-to-end off-the-shelf solution to (semi-)automate the detection and the repairing of violations w.r.t. a set of heterogeneous and ad-hoc quality constraints. In short, there is no commodity platform similar to general purpose DBMSs that can be easily customized and deployed to solve application-specific data quality problems. In this paper, we present NADEEF, an extensible, generalized and easy-to-deploy data cleaning platform. NADEEF distinguishes between a programming interface and a core to achieve generality and extensibility. The programming interface allows the users to specify multiple types of data quality rules, which uniformly define what is wrong with the data and (possibly) how to repair it through writing code that implements predefined classes. We show that the programming interface can be used to express many types of data quality rules beyond the well known CFDs (FDs), MDs and ETL rules. Treating user implemented interfaces as black-boxes, the core provides algorithms to detect errors and to clean data. The core is designed in a way to allow cleaning algorithms to cope with multiple rules holistically, i.e., detecting and repairing data errors without differentiating between various types of rules. We showcase two implementations for core repairing algorithms. These two implementations demonstrate the extensibility of our core, which can also be replaced by other user-provided algorithms. Using real-life data, we experimentally verify the generality, extensibility, and effectiveness of our system.

KW - Conditional functional dependency

KW - Data cleaning

KW - ETL

KW - Matching dependency

UR - http://www.scopus.com/inward/record.url?scp=84880546390&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84880546390&partnerID=8YFLogxK

U2 - 10.1145/2463676.2465327

DO - 10.1145/2463676.2465327

M3 - Conference contribution

SN - 9781450320375

SP - 541

EP - 552

BT - Proceedings of the ACM SIGMOD International Conference on Management of Data

ER -