Divide and conquer-based inclusion dependency discovery

Thorsten Papenbrock, Sebastian Kruse, Jorge Arnulfo Quiane Ruiz, Felix Naumann

Research output: Contribution to journalArticle

23 Citations (Scopus)

Abstract

The discovery of all inclusion dependencies (INDs) in a dataset is an important part of any data profiling effort. Apart from the detection of foreign key relationships, INDs can help to perform data integration, query optimization, integrity checking, or schema (re-)design. However, the detection of INDs gets harder as datasets become larger in terms of number of tuples as well as attributes. To this end, we propose Binder, an IND detection system that is capable of detecting both unary and n-ary INDs. It is based on a divide & conquer approach, which allows to handle very large datasets - an important property on the face of the ever increasing size of today's data. In contrast to most related works, we do not rely on existing database functionality nor assume that inspected datasets fit into main memory. This renders Binder an efficient and scalable competitor. Our exhaustive experimental evaluation shows the high superiority of Binder over the state-of-the-art in both unary (Spider) and n-ary (Mind) IND discovery. Binder is up to 26x faster than Spider and more than 2500x faster than Mind.

Original languageEnglish
Pages (from-to)774-785
Number of pages12
JournalProceedings of the VLDB Endowment
Volume8
Issue number7
Publication statusPublished - 2015

Fingerprint

Binders
Data integration
Data storage equipment

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Divide and conquer-based inclusion dependency discovery. / Papenbrock, Thorsten; Kruse, Sebastian; Quiane Ruiz, Jorge Arnulfo; Naumann, Felix.

In: Proceedings of the VLDB Endowment, Vol. 8, No. 7, 2015, p. 774-785.

Research output: Contribution to journalArticle

Papenbrock, T, Kruse, S, Quiane Ruiz, JA & Naumann, F 2015, 'Divide and conquer-based inclusion dependency discovery', Proceedings of the VLDB Endowment, vol. 8, no. 7, pp. 774-785.
Papenbrock, Thorsten ; Kruse, Sebastian ; Quiane Ruiz, Jorge Arnulfo ; Naumann, Felix. / Divide and conquer-based inclusion dependency discovery. In: Proceedings of the VLDB Endowment. 2015 ; Vol. 8, No. 7. pp. 774-785.
@article{5502dce4d5c54323bae6d7c835a4a0dc,
title = "Divide and conquer-based inclusion dependency discovery",
abstract = "The discovery of all inclusion dependencies (INDs) in a dataset is an important part of any data profiling effort. Apart from the detection of foreign key relationships, INDs can help to perform data integration, query optimization, integrity checking, or schema (re-)design. However, the detection of INDs gets harder as datasets become larger in terms of number of tuples as well as attributes. To this end, we propose Binder, an IND detection system that is capable of detecting both unary and n-ary INDs. It is based on a divide & conquer approach, which allows to handle very large datasets - an important property on the face of the ever increasing size of today's data. In contrast to most related works, we do not rely on existing database functionality nor assume that inspected datasets fit into main memory. This renders Binder an efficient and scalable competitor. Our exhaustive experimental evaluation shows the high superiority of Binder over the state-of-the-art in both unary (Spider) and n-ary (Mind) IND discovery. Binder is up to 26x faster than Spider and more than 2500x faster than Mind.",
author = "Thorsten Papenbrock and Sebastian Kruse and {Quiane Ruiz}, {Jorge Arnulfo} and Felix Naumann",
year = "2015",
language = "English",
volume = "8",
pages = "774--785",
journal = "Proceedings of the VLDB Endowment",
issn = "2150-8097",
publisher = "Very Large Data Base Endowment Inc.",
number = "7",

}

TY - JOUR

T1 - Divide and conquer-based inclusion dependency discovery

AU - Papenbrock, Thorsten

AU - Kruse, Sebastian

AU - Quiane Ruiz, Jorge Arnulfo

AU - Naumann, Felix

PY - 2015

Y1 - 2015

N2 - The discovery of all inclusion dependencies (INDs) in a dataset is an important part of any data profiling effort. Apart from the detection of foreign key relationships, INDs can help to perform data integration, query optimization, integrity checking, or schema (re-)design. However, the detection of INDs gets harder as datasets become larger in terms of number of tuples as well as attributes. To this end, we propose Binder, an IND detection system that is capable of detecting both unary and n-ary INDs. It is based on a divide & conquer approach, which allows to handle very large datasets - an important property on the face of the ever increasing size of today's data. In contrast to most related works, we do not rely on existing database functionality nor assume that inspected datasets fit into main memory. This renders Binder an efficient and scalable competitor. Our exhaustive experimental evaluation shows the high superiority of Binder over the state-of-the-art in both unary (Spider) and n-ary (Mind) IND discovery. Binder is up to 26x faster than Spider and more than 2500x faster than Mind.

AB - The discovery of all inclusion dependencies (INDs) in a dataset is an important part of any data profiling effort. Apart from the detection of foreign key relationships, INDs can help to perform data integration, query optimization, integrity checking, or schema (re-)design. However, the detection of INDs gets harder as datasets become larger in terms of number of tuples as well as attributes. To this end, we propose Binder, an IND detection system that is capable of detecting both unary and n-ary INDs. It is based on a divide & conquer approach, which allows to handle very large datasets - an important property on the face of the ever increasing size of today's data. In contrast to most related works, we do not rely on existing database functionality nor assume that inspected datasets fit into main memory. This renders Binder an efficient and scalable competitor. Our exhaustive experimental evaluation shows the high superiority of Binder over the state-of-the-art in both unary (Spider) and n-ary (Mind) IND discovery. Binder is up to 26x faster than Spider and more than 2500x faster than Mind.

UR - http://www.scopus.com/inward/record.url?scp=85013638127&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85013638127&partnerID=8YFLogxK

M3 - Article

VL - 8

SP - 774

EP - 785

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

SN - 2150-8097

IS - 7

ER -