KATARA: Reliable data cleaning with knowledge bases and crowdsourcing

Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, Yin Ye

Research output: Chapter in Book/Report/Conference proceedingChapter

8 Citations (Scopus)

Abstract

Data cleaning with guaranteed reliability is hard to achieve without accessing external sources, since the truth is not necessarily discoverable from the data at hand. Furthermore, even in the presence of external sources, mainly knowledge bases and humans, effectively leveraging them still faces many challenges, such as aligning heterogeneous data sources and decomposing a complex task into simpler units that can be consumed by humans. We present Katara, a novel end-to-end data cleaning system powered by knowledge bases and crowdsourcing. Given a table, a kb, and a crowd, Katara (i) interprets the table semantics w.r.t. the given kb; (ii) identifies correct and wrong data; and (iii) generates top-k possible repairs for the wrong data. Users will have the opportunity to experience the following features of Katara: (1) Easy specification: Users can define a Katara job with a browser-based specification; (2) Pattern validation: Users can help the system to resolve the ambiguity of different table patterns (i.e., table semantics) discovered by Katara; (3) Data annotation: Users can play the role of internal crowd workers, helping Katara annotate data. Moreover, Katara will visualize the annotated data as correct data validated by the kb, correct data jointly validated by the kb and the crowd, or erroneous tuples along with their possible repairs.

Original languageEnglish
Title of host publicationProceedings of the VLDB Endowment
PublisherAssociation for Computing Machinery
Pages1952-1955
Number of pages4
Volume8
Edition12
Publication statusPublished - 2015
Event3rd Workshop on Spatio-Temporal Database Management, STDBM 2006, Co-located with the 32nd International Conference on Very Large Data Bases, VLDB 2006 - Seoul, Korea, Republic of
Duration: 11 Sep 200611 Sep 2006

Other

Other3rd Workshop on Spatio-Temporal Database Management, STDBM 2006, Co-located with the 32nd International Conference on Very Large Data Bases, VLDB 2006
CountryKorea, Republic of
CitySeoul
Period11/9/0611/9/06

Fingerprint

Cleaning
Repair
Semantics
Specifications

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Chu, X., Morcos, J., Ilyas, I. F., Ouzzani, M., Papotti, P., Tang, N., & Ye, Y. (2015). KATARA: Reliable data cleaning with knowledge bases and crowdsourcing. In Proceedings of the VLDB Endowment (12 ed., Vol. 8, pp. 1952-1955). Association for Computing Machinery.

KATARA : Reliable data cleaning with knowledge bases and crowdsourcing. / Chu, Xu; Morcos, John; Ilyas, Ihab F.; Ouzzani, Mourad; Papotti, Paolo; Tang, Nan; Ye, Yin.

Proceedings of the VLDB Endowment. Vol. 8 12. ed. Association for Computing Machinery, 2015. p. 1952-1955.

Research output: Chapter in Book/Report/Conference proceedingChapter

Chu, X, Morcos, J, Ilyas, IF, Ouzzani, M, Papotti, P, Tang, N & Ye, Y 2015, KATARA: Reliable data cleaning with knowledge bases and crowdsourcing. in Proceedings of the VLDB Endowment. 12 edn, vol. 8, Association for Computing Machinery, pp. 1952-1955, 3rd Workshop on Spatio-Temporal Database Management, STDBM 2006, Co-located with the 32nd International Conference on Very Large Data Bases, VLDB 2006, Seoul, Korea, Republic of, 11/9/06.
Chu X, Morcos J, Ilyas IF, Ouzzani M, Papotti P, Tang N et al. KATARA: Reliable data cleaning with knowledge bases and crowdsourcing. In Proceedings of the VLDB Endowment. 12 ed. Vol. 8. Association for Computing Machinery. 2015. p. 1952-1955
Chu, Xu ; Morcos, John ; Ilyas, Ihab F. ; Ouzzani, Mourad ; Papotti, Paolo ; Tang, Nan ; Ye, Yin. / KATARA : Reliable data cleaning with knowledge bases and crowdsourcing. Proceedings of the VLDB Endowment. Vol. 8 12. ed. Association for Computing Machinery, 2015. pp. 1952-1955
@inbook{b83dab86159346cb817ec33ede1187ef,
title = "KATARA: Reliable data cleaning with knowledge bases and crowdsourcing",
abstract = "Data cleaning with guaranteed reliability is hard to achieve without accessing external sources, since the truth is not necessarily discoverable from the data at hand. Furthermore, even in the presence of external sources, mainly knowledge bases and humans, effectively leveraging them still faces many challenges, such as aligning heterogeneous data sources and decomposing a complex task into simpler units that can be consumed by humans. We present Katara, a novel end-to-end data cleaning system powered by knowledge bases and crowdsourcing. Given a table, a kb, and a crowd, Katara (i) interprets the table semantics w.r.t. the given kb; (ii) identifies correct and wrong data; and (iii) generates top-k possible repairs for the wrong data. Users will have the opportunity to experience the following features of Katara: (1) Easy specification: Users can define a Katara job with a browser-based specification; (2) Pattern validation: Users can help the system to resolve the ambiguity of different table patterns (i.e., table semantics) discovered by Katara; (3) Data annotation: Users can play the role of internal crowd workers, helping Katara annotate data. Moreover, Katara will visualize the annotated data as correct data validated by the kb, correct data jointly validated by the kb and the crowd, or erroneous tuples along with their possible repairs.",
author = "Xu Chu and John Morcos and Ilyas, {Ihab F.} and Mourad Ouzzani and Paolo Papotti and Nan Tang and Yin Ye",
year = "2015",
language = "English",
volume = "8",
pages = "1952--1955",
booktitle = "Proceedings of the VLDB Endowment",
publisher = "Association for Computing Machinery",
edition = "12",

}

TY - CHAP

T1 - KATARA

T2 - Reliable data cleaning with knowledge bases and crowdsourcing

AU - Chu, Xu

AU - Morcos, John

AU - Ilyas, Ihab F.

AU - Ouzzani, Mourad

AU - Papotti, Paolo

AU - Tang, Nan

AU - Ye, Yin

PY - 2015

Y1 - 2015

N2 - Data cleaning with guaranteed reliability is hard to achieve without accessing external sources, since the truth is not necessarily discoverable from the data at hand. Furthermore, even in the presence of external sources, mainly knowledge bases and humans, effectively leveraging them still faces many challenges, such as aligning heterogeneous data sources and decomposing a complex task into simpler units that can be consumed by humans. We present Katara, a novel end-to-end data cleaning system powered by knowledge bases and crowdsourcing. Given a table, a kb, and a crowd, Katara (i) interprets the table semantics w.r.t. the given kb; (ii) identifies correct and wrong data; and (iii) generates top-k possible repairs for the wrong data. Users will have the opportunity to experience the following features of Katara: (1) Easy specification: Users can define a Katara job with a browser-based specification; (2) Pattern validation: Users can help the system to resolve the ambiguity of different table patterns (i.e., table semantics) discovered by Katara; (3) Data annotation: Users can play the role of internal crowd workers, helping Katara annotate data. Moreover, Katara will visualize the annotated data as correct data validated by the kb, correct data jointly validated by the kb and the crowd, or erroneous tuples along with their possible repairs.

AB - Data cleaning with guaranteed reliability is hard to achieve without accessing external sources, since the truth is not necessarily discoverable from the data at hand. Furthermore, even in the presence of external sources, mainly knowledge bases and humans, effectively leveraging them still faces many challenges, such as aligning heterogeneous data sources and decomposing a complex task into simpler units that can be consumed by humans. We present Katara, a novel end-to-end data cleaning system powered by knowledge bases and crowdsourcing. Given a table, a kb, and a crowd, Katara (i) interprets the table semantics w.r.t. the given kb; (ii) identifies correct and wrong data; and (iii) generates top-k possible repairs for the wrong data. Users will have the opportunity to experience the following features of Katara: (1) Easy specification: Users can define a Katara job with a browser-based specification; (2) Pattern validation: Users can help the system to resolve the ambiguity of different table patterns (i.e., table semantics) discovered by Katara; (3) Data annotation: Users can play the role of internal crowd workers, helping Katara annotate data. Moreover, Katara will visualize the annotated data as correct data validated by the kb, correct data jointly validated by the kb and the crowd, or erroneous tuples along with their possible repairs.

UR - http://www.scopus.com/inward/record.url?scp=84953868998&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84953868998&partnerID=8YFLogxK

M3 - Chapter

AN - SCOPUS:84953868998

VL - 8

SP - 1952

EP - 1955

BT - Proceedings of the VLDB Endowment

PB - Association for Computing Machinery

ER -