Ranking for data repairs

Mohamed Yakout, Ahmed Elmagarmid, Jennifer Neville

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

Improving data quality is a time-consuming, labor-intensive and often domain specific operation. A recent principled approach for repairing dirty database is to use data quality rules in the form of database constraints to identify dirty tuples and then use the rules to derive data repairs. Most of existing data repair approaches focus on providing fully automated solutions, which could be risky to depend upon especially for critical data. To guarantee the optimal quality repairs applied to the database, users should be involved to confirm each repair. This highlights the need for an interactive approach that combines the best of both; automatically generating repairs, while efficiently employing user's efforts to verify the repairs. In such approach, the user will guide an online repairing process to incrementally generate repairs. A key challenge in this approach is the response time within the user's interactive sessions, because the process of generating the repairs is time consuming due to the large search space of possible repairs. To this end, we present in this paper a mechanism to continuously generate repairs only to the current top k important violated data quality rules. Moreover, the repairs are grouped and ranked such that the most beneficial in terms of improving data quality comes first to consult the user for verification and feedback. Our experiments on real-world dataset demonstrate the effectiveness of our ranking mechanism to provide a fast response time for the user while improving the data quality as quickly as possible.

Original languageEnglish
Title of host publicationProceedings - International Conference on Data Engineering
Pages23-28
Number of pages6
DOIs
Publication statusPublished - 28 May 2010
Externally publishedYes
Event2010 IEEE 26th International Conference on Data Engineering Workshops, ICDEW 2010 - Long Beach, CA, United States
Duration: 1 Mar 20106 Mar 2010

Other

Other2010 IEEE 26th International Conference on Data Engineering Workshops, ICDEW 2010
CountryUnited States
CityLong Beach, CA
Period1/3/106/3/10

Fingerprint

Repair
Personnel
Feedback

ASJC Scopus subject areas

  • Information Systems
  • Signal Processing
  • Software

Cite this

Yakout, M., Elmagarmid, A., & Neville, J. (2010). Ranking for data repairs. In Proceedings - International Conference on Data Engineering (pp. 23-28). [5452767] https://doi.org/10.1109/ICDEW.2010.5452767

Ranking for data repairs. / Yakout, Mohamed; Elmagarmid, Ahmed; Neville, Jennifer.

Proceedings - International Conference on Data Engineering. 2010. p. 23-28 5452767.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Yakout, M, Elmagarmid, A & Neville, J 2010, Ranking for data repairs. in Proceedings - International Conference on Data Engineering., 5452767, pp. 23-28, 2010 IEEE 26th International Conference on Data Engineering Workshops, ICDEW 2010, Long Beach, CA, United States, 1/3/10. https://doi.org/10.1109/ICDEW.2010.5452767
Yakout M, Elmagarmid A, Neville J. Ranking for data repairs. In Proceedings - International Conference on Data Engineering. 2010. p. 23-28. 5452767 https://doi.org/10.1109/ICDEW.2010.5452767
Yakout, Mohamed ; Elmagarmid, Ahmed ; Neville, Jennifer. / Ranking for data repairs. Proceedings - International Conference on Data Engineering. 2010. pp. 23-28
@inproceedings{3a724e8fe421489b9cd3d73cf72df07d,
title = "Ranking for data repairs",
abstract = "Improving data quality is a time-consuming, labor-intensive and often domain specific operation. A recent principled approach for repairing dirty database is to use data quality rules in the form of database constraints to identify dirty tuples and then use the rules to derive data repairs. Most of existing data repair approaches focus on providing fully automated solutions, which could be risky to depend upon especially for critical data. To guarantee the optimal quality repairs applied to the database, users should be involved to confirm each repair. This highlights the need for an interactive approach that combines the best of both; automatically generating repairs, while efficiently employing user's efforts to verify the repairs. In such approach, the user will guide an online repairing process to incrementally generate repairs. A key challenge in this approach is the response time within the user's interactive sessions, because the process of generating the repairs is time consuming due to the large search space of possible repairs. To this end, we present in this paper a mechanism to continuously generate repairs only to the current top k important violated data quality rules. Moreover, the repairs are grouped and ranked such that the most beneficial in terms of improving data quality comes first to consult the user for verification and feedback. Our experiments on real-world dataset demonstrate the effectiveness of our ranking mechanism to provide a fast response time for the user while improving the data quality as quickly as possible.",
author = "Mohamed Yakout and Ahmed Elmagarmid and Jennifer Neville",
year = "2010",
month = "5",
day = "28",
doi = "10.1109/ICDEW.2010.5452767",
language = "English",
isbn = "9781424465217",
pages = "23--28",
booktitle = "Proceedings - International Conference on Data Engineering",

}

TY - GEN

T1 - Ranking for data repairs

AU - Yakout, Mohamed

AU - Elmagarmid, Ahmed

AU - Neville, Jennifer

PY - 2010/5/28

Y1 - 2010/5/28

N2 - Improving data quality is a time-consuming, labor-intensive and often domain specific operation. A recent principled approach for repairing dirty database is to use data quality rules in the form of database constraints to identify dirty tuples and then use the rules to derive data repairs. Most of existing data repair approaches focus on providing fully automated solutions, which could be risky to depend upon especially for critical data. To guarantee the optimal quality repairs applied to the database, users should be involved to confirm each repair. This highlights the need for an interactive approach that combines the best of both; automatically generating repairs, while efficiently employing user's efforts to verify the repairs. In such approach, the user will guide an online repairing process to incrementally generate repairs. A key challenge in this approach is the response time within the user's interactive sessions, because the process of generating the repairs is time consuming due to the large search space of possible repairs. To this end, we present in this paper a mechanism to continuously generate repairs only to the current top k important violated data quality rules. Moreover, the repairs are grouped and ranked such that the most beneficial in terms of improving data quality comes first to consult the user for verification and feedback. Our experiments on real-world dataset demonstrate the effectiveness of our ranking mechanism to provide a fast response time for the user while improving the data quality as quickly as possible.

AB - Improving data quality is a time-consuming, labor-intensive and often domain specific operation. A recent principled approach for repairing dirty database is to use data quality rules in the form of database constraints to identify dirty tuples and then use the rules to derive data repairs. Most of existing data repair approaches focus on providing fully automated solutions, which could be risky to depend upon especially for critical data. To guarantee the optimal quality repairs applied to the database, users should be involved to confirm each repair. This highlights the need for an interactive approach that combines the best of both; automatically generating repairs, while efficiently employing user's efforts to verify the repairs. In such approach, the user will guide an online repairing process to incrementally generate repairs. A key challenge in this approach is the response time within the user's interactive sessions, because the process of generating the repairs is time consuming due to the large search space of possible repairs. To this end, we present in this paper a mechanism to continuously generate repairs only to the current top k important violated data quality rules. Moreover, the repairs are grouped and ranked such that the most beneficial in terms of improving data quality comes first to consult the user for verification and feedback. Our experiments on real-world dataset demonstrate the effectiveness of our ranking mechanism to provide a fast response time for the user while improving the data quality as quickly as possible.

UR - http://www.scopus.com/inward/record.url?scp=77952647100&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77952647100&partnerID=8YFLogxK

U2 - 10.1109/ICDEW.2010.5452767

DO - 10.1109/ICDEW.2010.5452767

M3 - Conference contribution

SN - 9781424465217

SP - 23

EP - 28

BT - Proceedings - International Conference on Data Engineering

ER -