Cleaning relations using knowledge bases

Shuang Hao, Nan Tang, Guoliang Li, Jian Li

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)

Abstract

We study the data cleaning problem of detecting and repairing wrong relational data, as well as marking correct data, using well curated knowledge bases (KBs). We propose detective rules (DRs), a new type of data cleaning rules that can make actionable decisions on relational data, by building connections between a relation and a KB. The main invention is that, a DR simultaneously models two opposite semantics of a relation using types and relationships in a KB: The positive semantics that explains how attribute values are linked to each other in correct tuples, and the negative semantics that indicates how wrong attribute values are connected to other correct attribute values within the same tuples. Naturally, a DR can mark correct values in a tuple if it matches the positive semantics. Meanwhile, a DR can detect/repair an error if it matches the negative semantics. We study fundamental problems associated with DRs, e.g., rule generation and rule consistency.We present efficient algorithms to apply DRs to clean a relation, based on rule order selection and inverted indexes. Extensive experiments, using both real-world and synthetic datasets, verify the effectiveness and efficiency of applying DRs in practice.

Original languageEnglish
Title of host publicationProceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017
PublisherIEEE Computer Society
Pages933-944
Number of pages12
ISBN (Electronic)9781509065431
DOIs
Publication statusPublished - 16 May 2017
Event33rd IEEE International Conference on Data Engineering, ICDE 2017 - San Diego, United States
Duration: 19 Apr 201722 Apr 2017

Other

Other33rd IEEE International Conference on Data Engineering, ICDE 2017
CountryUnited States
CitySan Diego
Period19/4/1722/4/17

Fingerprint

Cleaning
Semantics
Patents and inventions
Repair
Experiments

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems

Cite this

Hao, S., Tang, N., Li, G., & Li, J. (2017). Cleaning relations using knowledge bases. In Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017 (pp. 933-944). [7930037] IEEE Computer Society. https://doi.org/10.1109/ICDE.2017.141

Cleaning relations using knowledge bases. / Hao, Shuang; Tang, Nan; Li, Guoliang; Li, Jian.

Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017. IEEE Computer Society, 2017. p. 933-944 7930037.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hao, S, Tang, N, Li, G & Li, J 2017, Cleaning relations using knowledge bases. in Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017., 7930037, IEEE Computer Society, pp. 933-944, 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, United States, 19/4/17. https://doi.org/10.1109/ICDE.2017.141
Hao S, Tang N, Li G, Li J. Cleaning relations using knowledge bases. In Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017. IEEE Computer Society. 2017. p. 933-944. 7930037 https://doi.org/10.1109/ICDE.2017.141
Hao, Shuang ; Tang, Nan ; Li, Guoliang ; Li, Jian. / Cleaning relations using knowledge bases. Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017. IEEE Computer Society, 2017. pp. 933-944
@inproceedings{2b310badd97243ea806515fde6674b6d,
title = "Cleaning relations using knowledge bases",
abstract = "We study the data cleaning problem of detecting and repairing wrong relational data, as well as marking correct data, using well curated knowledge bases (KBs). We propose detective rules (DRs), a new type of data cleaning rules that can make actionable decisions on relational data, by building connections between a relation and a KB. The main invention is that, a DR simultaneously models two opposite semantics of a relation using types and relationships in a KB: The positive semantics that explains how attribute values are linked to each other in correct tuples, and the negative semantics that indicates how wrong attribute values are connected to other correct attribute values within the same tuples. Naturally, a DR can mark correct values in a tuple if it matches the positive semantics. Meanwhile, a DR can detect/repair an error if it matches the negative semantics. We study fundamental problems associated with DRs, e.g., rule generation and rule consistency.We present efficient algorithms to apply DRs to clean a relation, based on rule order selection and inverted indexes. Extensive experiments, using both real-world and synthetic datasets, verify the effectiveness and efficiency of applying DRs in practice.",
author = "Shuang Hao and Nan Tang and Guoliang Li and Jian Li",
year = "2017",
month = "5",
day = "16",
doi = "10.1109/ICDE.2017.141",
language = "English",
pages = "933--944",
booktitle = "Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017",
publisher = "IEEE Computer Society",

}

TY - GEN

T1 - Cleaning relations using knowledge bases

AU - Hao, Shuang

AU - Tang, Nan

AU - Li, Guoliang

AU - Li, Jian

PY - 2017/5/16

Y1 - 2017/5/16

N2 - We study the data cleaning problem of detecting and repairing wrong relational data, as well as marking correct data, using well curated knowledge bases (KBs). We propose detective rules (DRs), a new type of data cleaning rules that can make actionable decisions on relational data, by building connections between a relation and a KB. The main invention is that, a DR simultaneously models two opposite semantics of a relation using types and relationships in a KB: The positive semantics that explains how attribute values are linked to each other in correct tuples, and the negative semantics that indicates how wrong attribute values are connected to other correct attribute values within the same tuples. Naturally, a DR can mark correct values in a tuple if it matches the positive semantics. Meanwhile, a DR can detect/repair an error if it matches the negative semantics. We study fundamental problems associated with DRs, e.g., rule generation and rule consistency.We present efficient algorithms to apply DRs to clean a relation, based on rule order selection and inverted indexes. Extensive experiments, using both real-world and synthetic datasets, verify the effectiveness and efficiency of applying DRs in practice.

AB - We study the data cleaning problem of detecting and repairing wrong relational data, as well as marking correct data, using well curated knowledge bases (KBs). We propose detective rules (DRs), a new type of data cleaning rules that can make actionable decisions on relational data, by building connections between a relation and a KB. The main invention is that, a DR simultaneously models two opposite semantics of a relation using types and relationships in a KB: The positive semantics that explains how attribute values are linked to each other in correct tuples, and the negative semantics that indicates how wrong attribute values are connected to other correct attribute values within the same tuples. Naturally, a DR can mark correct values in a tuple if it matches the positive semantics. Meanwhile, a DR can detect/repair an error if it matches the negative semantics. We study fundamental problems associated with DRs, e.g., rule generation and rule consistency.We present efficient algorithms to apply DRs to clean a relation, based on rule order selection and inverted indexes. Extensive experiments, using both real-world and synthetic datasets, verify the effectiveness and efficiency of applying DRs in practice.

UR - http://www.scopus.com/inward/record.url?scp=85021202567&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85021202567&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2017.141

DO - 10.1109/ICDE.2017.141

M3 - Conference contribution

AN - SCOPUS:85021202567

SP - 933

EP - 944

BT - Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017

PB - IEEE Computer Society

ER -