Scalable discovery of unique column combinations

Arvid Heise, Jorge Arnulfo Quiane Ruiz, Ziawasch Abedjan, Anja Jentzsch, Felix Naumann

Research output: Chapter in Book/Report/Conference proceedingChapter

42 Citations (Scopus)

Abstract

The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving effciency and scalability in this context is a tremendous challenge by itself. In this paper, we devise Ducc, a scalable and effcient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows Ducc to typically depend on the solution set size and hence to prune large swaths of the lattice. Ducc also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, Ducc runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate Ducc using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare Ducc with related work: Gordian and HCA. The results show that Ducc is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the effciency of Ducc to scale up and out.

Original languageEnglish
Title of host publicationProceedings of the VLDB Endowment
PublisherAssociation for Computing Machinery
Pages301-312
Number of pages12
Volume7
Edition4
Publication statusPublished - 2013

Fingerprint

Scalability
Data integration
Coloring
Information management
Program processors
Data structures
Computational complexity
Experiments

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Heise, A., Quiane Ruiz, J. A., Abedjan, Z., Jentzsch, A., & Naumann, F. (2013). Scalable discovery of unique column combinations. In Proceedings of the VLDB Endowment (4 ed., Vol. 7, pp. 301-312). Association for Computing Machinery.

Scalable discovery of unique column combinations. / Heise, Arvid; Quiane Ruiz, Jorge Arnulfo; Abedjan, Ziawasch; Jentzsch, Anja; Naumann, Felix.

Proceedings of the VLDB Endowment. Vol. 7 4. ed. Association for Computing Machinery, 2013. p. 301-312.

Research output: Chapter in Book/Report/Conference proceedingChapter

Heise, A, Quiane Ruiz, JA, Abedjan, Z, Jentzsch, A & Naumann, F 2013, Scalable discovery of unique column combinations. in Proceedings of the VLDB Endowment. 4 edn, vol. 7, Association for Computing Machinery, pp. 301-312.
Heise A, Quiane Ruiz JA, Abedjan Z, Jentzsch A, Naumann F. Scalable discovery of unique column combinations. In Proceedings of the VLDB Endowment. 4 ed. Vol. 7. Association for Computing Machinery. 2013. p. 301-312
Heise, Arvid ; Quiane Ruiz, Jorge Arnulfo ; Abedjan, Ziawasch ; Jentzsch, Anja ; Naumann, Felix. / Scalable discovery of unique column combinations. Proceedings of the VLDB Endowment. Vol. 7 4. ed. Association for Computing Machinery, 2013. pp. 301-312
@inbook{2e8b4b8101f447ef97b77f46484fccb8,
title = "Scalable discovery of unique column combinations",
abstract = "The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving effciency and scalability in this context is a tremendous challenge by itself. In this paper, we devise Ducc, a scalable and effcient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows Ducc to typically depend on the solution set size and hence to prune large swaths of the lattice. Ducc also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, Ducc runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate Ducc using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare Ducc with related work: Gordian and HCA. The results show that Ducc is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the effciency of Ducc to scale up and out.",
author = "Arvid Heise and {Quiane Ruiz}, {Jorge Arnulfo} and Ziawasch Abedjan and Anja Jentzsch and Felix Naumann",
year = "2013",
language = "English",
volume = "7",
pages = "301--312",
booktitle = "Proceedings of the VLDB Endowment",
publisher = "Association for Computing Machinery",
edition = "4",

}

TY - CHAP

T1 - Scalable discovery of unique column combinations

AU - Heise, Arvid

AU - Quiane Ruiz, Jorge Arnulfo

AU - Abedjan, Ziawasch

AU - Jentzsch, Anja

AU - Naumann, Felix

PY - 2013

Y1 - 2013

N2 - The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving effciency and scalability in this context is a tremendous challenge by itself. In this paper, we devise Ducc, a scalable and effcient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows Ducc to typically depend on the solution set size and hence to prune large swaths of the lattice. Ducc also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, Ducc runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate Ducc using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare Ducc with related work: Gordian and HCA. The results show that Ducc is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the effciency of Ducc to scale up and out.

AB - The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving effciency and scalability in this context is a tremendous challenge by itself. In this paper, we devise Ducc, a scalable and effcient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows Ducc to typically depend on the solution set size and hence to prune large swaths of the lattice. Ducc also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, Ducc runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate Ducc using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare Ducc with related work: Gordian and HCA. The results show that Ducc is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the effciency of Ducc to scale up and out.

UR - http://www.scopus.com/inward/record.url?scp=84896995312&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84896995312&partnerID=8YFLogxK

M3 - Chapter

VL - 7

SP - 301

EP - 312

BT - Proceedings of the VLDB Endowment

PB - Association for Computing Machinery

ER -