Distilling relations using knowledge bases

Shuang Hao, Nan Tang, Guoliang Li, Jian Li, Jianhua Feng

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Given a relational table, we study the problem of detecting and repairing erroneous data, as well as marking correct data, using well curated knowledge bases (KBs). We propose detective rules (DRs), a new type of data cleaning rules that can make actionable decisions on relational data, by building connections between a relation and a KB. The main invention is that a DR simultaneously models two opposite semantics of an attribute belonging to a relation using types and relationships in a KB: The positive semantics explains how its value should be linked to other attribute values in a correct tuple, and the negative semantics indicate how a wrong attribute value is connected to other correct attribute values within the same tuple. Naturally, a DR can mark correct values in a tuple if it matches the positive semantics. Meanwhile, a DR can detect/repair an error if it matches the negative semantics. We study fundamental problems associated with DRs, e.g., rule consistency and rule implication. We present efficient algorithms to apply DRs to clean a relation, based on rule order selection and inverted indexes. Moreover, we discuss approaches on how to generate DRs from examples. Extensive experiments, using both real-world and synthetic datasets, verify the effectiveness and efficiency of applying DRs in practice.

Original languageEnglish
Pages (from-to)1-23
Number of pages23
JournalVLDB Journal
DOIs
Publication statusAccepted/In press - 17 May 2018

Fingerprint

Semantics
Patents and inventions
Cleaning
Repair
Experiments

Keywords

  • Data cleaning
  • Detective rule
  • Knowledge base
  • Rule generation

ASJC Scopus subject areas

  • Information Systems
  • Hardware and Architecture

Cite this

Distilling relations using knowledge bases. / Hao, Shuang; Tang, Nan; Li, Guoliang; Li, Jian; Feng, Jianhua.

In: VLDB Journal, 17.05.2018, p. 1-23.

Research output: Contribution to journalArticle

Hao, Shuang ; Tang, Nan ; Li, Guoliang ; Li, Jian ; Feng, Jianhua. / Distilling relations using knowledge bases. In: VLDB Journal. 2018 ; pp. 1-23.
@article{3d10695fcd1a485bab8c5c9ff75acc64,
title = "Distilling relations using knowledge bases",
abstract = "Given a relational table, we study the problem of detecting and repairing erroneous data, as well as marking correct data, using well curated knowledge bases (KBs). We propose detective rules (DRs), a new type of data cleaning rules that can make actionable decisions on relational data, by building connections between a relation and a KB. The main invention is that a DR simultaneously models two opposite semantics of an attribute belonging to a relation using types and relationships in a KB: The positive semantics explains how its value should be linked to other attribute values in a correct tuple, and the negative semantics indicate how a wrong attribute value is connected to other correct attribute values within the same tuple. Naturally, a DR can mark correct values in a tuple if it matches the positive semantics. Meanwhile, a DR can detect/repair an error if it matches the negative semantics. We study fundamental problems associated with DRs, e.g., rule consistency and rule implication. We present efficient algorithms to apply DRs to clean a relation, based on rule order selection and inverted indexes. Moreover, we discuss approaches on how to generate DRs from examples. Extensive experiments, using both real-world and synthetic datasets, verify the effectiveness and efficiency of applying DRs in practice.",
keywords = "Data cleaning, Detective rule, Knowledge base, Rule generation",
author = "Shuang Hao and Nan Tang and Guoliang Li and Jian Li and Jianhua Feng",
year = "2018",
month = "5",
day = "17",
doi = "10.1007/s00778-018-0506-9",
language = "English",
pages = "1--23",
journal = "VLDB Journal",
issn = "1066-8888",
publisher = "Springer New York",

}

TY - JOUR

T1 - Distilling relations using knowledge bases

AU - Hao, Shuang

AU - Tang, Nan

AU - Li, Guoliang

AU - Li, Jian

AU - Feng, Jianhua

PY - 2018/5/17

Y1 - 2018/5/17

N2 - Given a relational table, we study the problem of detecting and repairing erroneous data, as well as marking correct data, using well curated knowledge bases (KBs). We propose detective rules (DRs), a new type of data cleaning rules that can make actionable decisions on relational data, by building connections between a relation and a KB. The main invention is that a DR simultaneously models two opposite semantics of an attribute belonging to a relation using types and relationships in a KB: The positive semantics explains how its value should be linked to other attribute values in a correct tuple, and the negative semantics indicate how a wrong attribute value is connected to other correct attribute values within the same tuple. Naturally, a DR can mark correct values in a tuple if it matches the positive semantics. Meanwhile, a DR can detect/repair an error if it matches the negative semantics. We study fundamental problems associated with DRs, e.g., rule consistency and rule implication. We present efficient algorithms to apply DRs to clean a relation, based on rule order selection and inverted indexes. Moreover, we discuss approaches on how to generate DRs from examples. Extensive experiments, using both real-world and synthetic datasets, verify the effectiveness and efficiency of applying DRs in practice.

AB - Given a relational table, we study the problem of detecting and repairing erroneous data, as well as marking correct data, using well curated knowledge bases (KBs). We propose detective rules (DRs), a new type of data cleaning rules that can make actionable decisions on relational data, by building connections between a relation and a KB. The main invention is that a DR simultaneously models two opposite semantics of an attribute belonging to a relation using types and relationships in a KB: The positive semantics explains how its value should be linked to other attribute values in a correct tuple, and the negative semantics indicate how a wrong attribute value is connected to other correct attribute values within the same tuple. Naturally, a DR can mark correct values in a tuple if it matches the positive semantics. Meanwhile, a DR can detect/repair an error if it matches the negative semantics. We study fundamental problems associated with DRs, e.g., rule consistency and rule implication. We present efficient algorithms to apply DRs to clean a relation, based on rule order selection and inverted indexes. Moreover, we discuss approaches on how to generate DRs from examples. Extensive experiments, using both real-world and synthetic datasets, verify the effectiveness and efficiency of applying DRs in practice.

KW - Data cleaning

KW - Detective rule

KW - Knowledge base

KW - Rule generation

UR - http://www.scopus.com/inward/record.url?scp=85047149990&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85047149990&partnerID=8YFLogxK

U2 - 10.1007/s00778-018-0506-9

DO - 10.1007/s00778-018-0506-9

M3 - Article

SP - 1

EP - 23

JO - VLDB Journal

JF - VLDB Journal

SN - 1066-8888

ER -