Detecting unique column combinations on dynamic data

Ziawasch Abedjan, Jorge Arnulfo Quiane Ruiz, Felix Naumann

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Citations (Scopus)

Abstract

The discovery of all unique (and non-unique) column combinations in an unknown dataset is at the core of any data profiling effort. Unique column combinations resemble candidate keys of a relational dataset. Several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are suitable for applications on dynamic datasets, such as transactional databases, social networks, and scientific applications. In these cases, data profiling techniques should be able to efficiently discover new uniques and non-uniques (and validate old ones) after tuple inserts or deletes, without re-profiling the entire dataset. We present the first approach to efficiently discover unique and non-unique constraints on dynamic datasets that is independent of the initial dataset size. In particular, Swan makes use of intelligently chosen indices to minimize access to old data. We perform an exhaustive analysis of Swan and compare it with two state-of-the-art techniques for unique discovery: Gordian and Ducc. The results show that Swan significantly outperforms both, as well as their incremental adaptations. For inserts, Swan is more than 63x faster than Gordian and up to 50x faster than Ducc. For deletes, Swan is more than 15x faster than Gordian and up to 1 order of magnitude faster than Ducc. In fact, Swan even improves on the static case by dividing the dataset into a static part and a set of inserts.

Original languageEnglish
Title of host publicationProceedings - International Conference on Data Engineering
PublisherIEEE Computer Society
Pages1036-1047
Number of pages12
ISBN (Print)9781479925544
DOIs
Publication statusPublished - 1 Jan 2014
Event30th IEEE International Conference on Data Engineering, ICDE 2014 - Chicago, IL, United States
Duration: 31 Mar 20144 Apr 2014

Other

Other30th IEEE International Conference on Data Engineering, ICDE 2014
CountryUnited States
CityChicago, IL
Period31/3/144/4/14

ASJC Scopus subject areas

  • Information Systems
  • Signal Processing
  • Software

Cite this

Abedjan, Z., Quiane Ruiz, J. A., & Naumann, F. (2014). Detecting unique column combinations on dynamic data. In Proceedings - International Conference on Data Engineering (pp. 1036-1047). [6816721] IEEE Computer Society. https://doi.org/10.1109/ICDE.2014.6816721

Detecting unique column combinations on dynamic data. / Abedjan, Ziawasch; Quiane Ruiz, Jorge Arnulfo; Naumann, Felix.

Proceedings - International Conference on Data Engineering. IEEE Computer Society, 2014. p. 1036-1047 6816721.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abedjan, Z, Quiane Ruiz, JA & Naumann, F 2014, Detecting unique column combinations on dynamic data. in Proceedings - International Conference on Data Engineering., 6816721, IEEE Computer Society, pp. 1036-1047, 30th IEEE International Conference on Data Engineering, ICDE 2014, Chicago, IL, United States, 31/3/14. https://doi.org/10.1109/ICDE.2014.6816721
Abedjan Z, Quiane Ruiz JA, Naumann F. Detecting unique column combinations on dynamic data. In Proceedings - International Conference on Data Engineering. IEEE Computer Society. 2014. p. 1036-1047. 6816721 https://doi.org/10.1109/ICDE.2014.6816721
Abedjan, Ziawasch ; Quiane Ruiz, Jorge Arnulfo ; Naumann, Felix. / Detecting unique column combinations on dynamic data. Proceedings - International Conference on Data Engineering. IEEE Computer Society, 2014. pp. 1036-1047
@inproceedings{c351cc93bdfe4f5da2b137d9cdfd64a4,
title = "Detecting unique column combinations on dynamic data",
abstract = "The discovery of all unique (and non-unique) column combinations in an unknown dataset is at the core of any data profiling effort. Unique column combinations resemble candidate keys of a relational dataset. Several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are suitable for applications on dynamic datasets, such as transactional databases, social networks, and scientific applications. In these cases, data profiling techniques should be able to efficiently discover new uniques and non-uniques (and validate old ones) after tuple inserts or deletes, without re-profiling the entire dataset. We present the first approach to efficiently discover unique and non-unique constraints on dynamic datasets that is independent of the initial dataset size. In particular, Swan makes use of intelligently chosen indices to minimize access to old data. We perform an exhaustive analysis of Swan and compare it with two state-of-the-art techniques for unique discovery: Gordian and Ducc. The results show that Swan significantly outperforms both, as well as their incremental adaptations. For inserts, Swan is more than 63x faster than Gordian and up to 50x faster than Ducc. For deletes, Swan is more than 15x faster than Gordian and up to 1 order of magnitude faster than Ducc. In fact, Swan even improves on the static case by dividing the dataset into a static part and a set of inserts.",
author = "Ziawasch Abedjan and {Quiane Ruiz}, {Jorge Arnulfo} and Felix Naumann",
year = "2014",
month = "1",
day = "1",
doi = "10.1109/ICDE.2014.6816721",
language = "English",
isbn = "9781479925544",
pages = "1036--1047",
booktitle = "Proceedings - International Conference on Data Engineering",
publisher = "IEEE Computer Society",

}

TY - GEN

T1 - Detecting unique column combinations on dynamic data

AU - Abedjan, Ziawasch

AU - Quiane Ruiz, Jorge Arnulfo

AU - Naumann, Felix

PY - 2014/1/1

Y1 - 2014/1/1

N2 - The discovery of all unique (and non-unique) column combinations in an unknown dataset is at the core of any data profiling effort. Unique column combinations resemble candidate keys of a relational dataset. Several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are suitable for applications on dynamic datasets, such as transactional databases, social networks, and scientific applications. In these cases, data profiling techniques should be able to efficiently discover new uniques and non-uniques (and validate old ones) after tuple inserts or deletes, without re-profiling the entire dataset. We present the first approach to efficiently discover unique and non-unique constraints on dynamic datasets that is independent of the initial dataset size. In particular, Swan makes use of intelligently chosen indices to minimize access to old data. We perform an exhaustive analysis of Swan and compare it with two state-of-the-art techniques for unique discovery: Gordian and Ducc. The results show that Swan significantly outperforms both, as well as their incremental adaptations. For inserts, Swan is more than 63x faster than Gordian and up to 50x faster than Ducc. For deletes, Swan is more than 15x faster than Gordian and up to 1 order of magnitude faster than Ducc. In fact, Swan even improves on the static case by dividing the dataset into a static part and a set of inserts.

AB - The discovery of all unique (and non-unique) column combinations in an unknown dataset is at the core of any data profiling effort. Unique column combinations resemble candidate keys of a relational dataset. Several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are suitable for applications on dynamic datasets, such as transactional databases, social networks, and scientific applications. In these cases, data profiling techniques should be able to efficiently discover new uniques and non-uniques (and validate old ones) after tuple inserts or deletes, without re-profiling the entire dataset. We present the first approach to efficiently discover unique and non-unique constraints on dynamic datasets that is independent of the initial dataset size. In particular, Swan makes use of intelligently chosen indices to minimize access to old data. We perform an exhaustive analysis of Swan and compare it with two state-of-the-art techniques for unique discovery: Gordian and Ducc. The results show that Swan significantly outperforms both, as well as their incremental adaptations. For inserts, Swan is more than 63x faster than Gordian and up to 50x faster than Ducc. For deletes, Swan is more than 15x faster than Gordian and up to 1 order of magnitude faster than Ducc. In fact, Swan even improves on the static case by dividing the dataset into a static part and a set of inserts.

UR - http://www.scopus.com/inward/record.url?scp=84901781853&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84901781853&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2014.6816721

DO - 10.1109/ICDE.2014.6816721

M3 - Conference contribution

SN - 9781479925544

SP - 1036

EP - 1047

BT - Proceedings - International Conference on Data Engineering

PB - IEEE Computer Society

ER -