### Abstract

The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving effciency and scalability in this context is a tremendous challenge by itself. In this paper, we devise Ducc, a scalable and effcient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows Ducc to typically depend on the solution set size and hence to prune large swaths of the lattice. Ducc also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, Ducc runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate Ducc using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare Ducc with related work: Gordian and HCA. The results show that Ducc is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the effciency of Ducc to scale up and out.

Original language | English |
---|---|

Title of host publication | Proceedings of the VLDB Endowment |

Publisher | Association for Computing Machinery |

Pages | 301-312 |

Number of pages | 12 |

Volume | 7 |

Edition | 4 |

Publication status | Published - 2013 |

### Fingerprint

### ASJC Scopus subject areas

- Computer Science (miscellaneous)
- Computer Science(all)

### Cite this

*Proceedings of the VLDB Endowment*(4 ed., Vol. 7, pp. 301-312). Association for Computing Machinery.

**Scalable discovery of unique column combinations.** / Heise, Arvid; Quiane Ruiz, Jorge Arnulfo; Abedjan, Ziawasch; Jentzsch, Anja; Naumann, Felix.

Research output: Chapter in Book/Report/Conference proceeding › Chapter

*Proceedings of the VLDB Endowment.*4 edn, vol. 7, Association for Computing Machinery, pp. 301-312.

}

TY - CHAP

T1 - Scalable discovery of unique column combinations

AU - Heise, Arvid

AU - Quiane Ruiz, Jorge Arnulfo

AU - Abedjan, Ziawasch

AU - Jentzsch, Anja

AU - Naumann, Felix

PY - 2013

Y1 - 2013

N2 - The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving effciency and scalability in this context is a tremendous challenge by itself. In this paper, we devise Ducc, a scalable and effcient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows Ducc to typically depend on the solution set size and hence to prune large swaths of the lattice. Ducc also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, Ducc runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate Ducc using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare Ducc with related work: Gordian and HCA. The results show that Ducc is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the effciency of Ducc to scale up and out.

AB - The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving effciency and scalability in this context is a tremendous challenge by itself. In this paper, we devise Ducc, a scalable and effcient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows Ducc to typically depend on the solution set size and hence to prune large swaths of the lattice. Ducc also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, Ducc runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate Ducc using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare Ducc with related work: Gordian and HCA. The results show that Ducc is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the effciency of Ducc to scale up and out.

UR - http://www.scopus.com/inward/record.url?scp=84896995312&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84896995312&partnerID=8YFLogxK

M3 - Chapter

VL - 7

SP - 301

EP - 312

BT - Proceedings of the VLDB Endowment

PB - Association for Computing Machinery

ER -