Don't be SCAREd: Use SCalable Automatic REpairing with maximal likelihood and bounded changes

Research output: Chapter in Book/Report/Conference proceedingConference contribution

55 Citations (Scopus)

Abstract

Various computational procedures or constraint-based methods for data repairing have been proposed over the last decades to identify errors and, when possible, correct them. However, these approaches have several limitations including the scalability and quality of the values to be used in replacement of the errors. In this paper, we propose a new data repairing approach that is based on maximizing the likelihood of replacement data given the data distribution, which can be modeled using statistical machine learning techniques. This is a novel approach combining machine learning and likelihood methods for cleaning dirty databases by value modification. We develop a quality measure of the repairing updates based on the likelihood benefit and the amount of changes applied to the database. We propose SCARE (SCalable Automatic REpairing), a systematic scalable framework that follows our approach. SCARE relies on a robust mechanism for horizontal data partitioning and a combination of machine learning techniques to predict the set of possible updates. Due to data partitioning, several updates can be predicted for a single record based on local views on each data partition. Therefore, we propose a mechanism to combine the local predictions and obtain accurate final predictions. Finally, we experimentally demonstrate the effectiveness, efficiency, and scalability of our approach on real-world datasets in comparison to recent data cleaning approaches.

Original languageEnglish
Title of host publicationProceedings of the ACM SIGMOD International Conference on Management of Data
Pages553-564
Number of pages12
DOIs
Publication statusPublished - 29 Jul 2013
Event2013 ACM SIGMOD Conference on Management of Data, SIGMOD 2013 - New York, NY, United States
Duration: 22 Jun 201327 Jun 2013

Other

Other2013 ACM SIGMOD Conference on Management of Data, SIGMOD 2013
CountryUnited States
CityNew York, NY
Period22/6/1327/6/13

Fingerprint

Learning systems
Scalability
Cleaning

Keywords

  • Data cleaning
  • Inconsistent data

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Yakout, M., Berti-Equille, L., & Elmagarmid, A. (2013). Don't be SCAREd: Use SCalable Automatic REpairing with maximal likelihood and bounded changes. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 553-564) https://doi.org/10.1145/2463676.2463706

Don't be SCAREd : Use SCalable Automatic REpairing with maximal likelihood and bounded changes. / Yakout, Mohamed; Berti-Equille, Laure; Elmagarmid, Ahmed.

Proceedings of the ACM SIGMOD International Conference on Management of Data. 2013. p. 553-564.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Yakout, M, Berti-Equille, L & Elmagarmid, A 2013, Don't be SCAREd: Use SCalable Automatic REpairing with maximal likelihood and bounded changes. in Proceedings of the ACM SIGMOD International Conference on Management of Data. pp. 553-564, 2013 ACM SIGMOD Conference on Management of Data, SIGMOD 2013, New York, NY, United States, 22/6/13. https://doi.org/10.1145/2463676.2463706
Yakout M, Berti-Equille L, Elmagarmid A. Don't be SCAREd: Use SCalable Automatic REpairing with maximal likelihood and bounded changes. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 2013. p. 553-564 https://doi.org/10.1145/2463676.2463706
Yakout, Mohamed ; Berti-Equille, Laure ; Elmagarmid, Ahmed. / Don't be SCAREd : Use SCalable Automatic REpairing with maximal likelihood and bounded changes. Proceedings of the ACM SIGMOD International Conference on Management of Data. 2013. pp. 553-564
@inproceedings{766d8cd71500439abeb44ed14883e4c1,
title = "Don't be SCAREd: Use SCalable Automatic REpairing with maximal likelihood and bounded changes",
abstract = "Various computational procedures or constraint-based methods for data repairing have been proposed over the last decades to identify errors and, when possible, correct them. However, these approaches have several limitations including the scalability and quality of the values to be used in replacement of the errors. In this paper, we propose a new data repairing approach that is based on maximizing the likelihood of replacement data given the data distribution, which can be modeled using statistical machine learning techniques. This is a novel approach combining machine learning and likelihood methods for cleaning dirty databases by value modification. We develop a quality measure of the repairing updates based on the likelihood benefit and the amount of changes applied to the database. We propose SCARE (SCalable Automatic REpairing), a systematic scalable framework that follows our approach. SCARE relies on a robust mechanism for horizontal data partitioning and a combination of machine learning techniques to predict the set of possible updates. Due to data partitioning, several updates can be predicted for a single record based on local views on each data partition. Therefore, we propose a mechanism to combine the local predictions and obtain accurate final predictions. Finally, we experimentally demonstrate the effectiveness, efficiency, and scalability of our approach on real-world datasets in comparison to recent data cleaning approaches.",
keywords = "Data cleaning, Inconsistent data",
author = "Mohamed Yakout and Laure Berti-Equille and Ahmed Elmagarmid",
year = "2013",
month = "7",
day = "29",
doi = "10.1145/2463676.2463706",
language = "English",
isbn = "9781450320375",
pages = "553--564",
booktitle = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

}

TY - GEN

T1 - Don't be SCAREd

T2 - Use SCalable Automatic REpairing with maximal likelihood and bounded changes

AU - Yakout, Mohamed

AU - Berti-Equille, Laure

AU - Elmagarmid, Ahmed

PY - 2013/7/29

Y1 - 2013/7/29

N2 - Various computational procedures or constraint-based methods for data repairing have been proposed over the last decades to identify errors and, when possible, correct them. However, these approaches have several limitations including the scalability and quality of the values to be used in replacement of the errors. In this paper, we propose a new data repairing approach that is based on maximizing the likelihood of replacement data given the data distribution, which can be modeled using statistical machine learning techniques. This is a novel approach combining machine learning and likelihood methods for cleaning dirty databases by value modification. We develop a quality measure of the repairing updates based on the likelihood benefit and the amount of changes applied to the database. We propose SCARE (SCalable Automatic REpairing), a systematic scalable framework that follows our approach. SCARE relies on a robust mechanism for horizontal data partitioning and a combination of machine learning techniques to predict the set of possible updates. Due to data partitioning, several updates can be predicted for a single record based on local views on each data partition. Therefore, we propose a mechanism to combine the local predictions and obtain accurate final predictions. Finally, we experimentally demonstrate the effectiveness, efficiency, and scalability of our approach on real-world datasets in comparison to recent data cleaning approaches.

AB - Various computational procedures or constraint-based methods for data repairing have been proposed over the last decades to identify errors and, when possible, correct them. However, these approaches have several limitations including the scalability and quality of the values to be used in replacement of the errors. In this paper, we propose a new data repairing approach that is based on maximizing the likelihood of replacement data given the data distribution, which can be modeled using statistical machine learning techniques. This is a novel approach combining machine learning and likelihood methods for cleaning dirty databases by value modification. We develop a quality measure of the repairing updates based on the likelihood benefit and the amount of changes applied to the database. We propose SCARE (SCalable Automatic REpairing), a systematic scalable framework that follows our approach. SCARE relies on a robust mechanism for horizontal data partitioning and a combination of machine learning techniques to predict the set of possible updates. Due to data partitioning, several updates can be predicted for a single record based on local views on each data partition. Therefore, we propose a mechanism to combine the local predictions and obtain accurate final predictions. Finally, we experimentally demonstrate the effectiveness, efficiency, and scalability of our approach on real-world datasets in comparison to recent data cleaning approaches.

KW - Data cleaning

KW - Inconsistent data

UR - http://www.scopus.com/inward/record.url?scp=84880515658&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84880515658&partnerID=8YFLogxK

U2 - 10.1145/2463676.2463706

DO - 10.1145/2463676.2463706

M3 - Conference contribution

AN - SCOPUS:84880515658

SN - 9781450320375

SP - 553

EP - 564

BT - Proceedings of the ACM SIGMOD International Conference on Management of Data

ER -