RDFind: Scalable conditional inclusion dependency discovery in RDF datasets

Sebastian Kruse, Anja Jentzsch, Thorsten Papenbrock, Zoi Kaoudi, Jorge Arnulfo Quiane Ruiz, Felix Naumann

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Inclusion dependencies (INDs) form an important integrity constraint on relational databases, supporting data management tasks, such as join path discovery and query optimization. Conditional inclusion dependencies (CINDs), which define including and included data in terms of conditions, allow to transfer these capabilities to RDF data. However, CIND discovery is computationally much more complex than IND discovery and the number of CINDs even on small RDF datasets is intractable. To cope with both problems, we first introduce the notion of pertinent CINDs with an adjustable relevance criterion to filter and rank CINDs based on their extent and implications among each other. Second, we present RDFind, a distributed system to efficiently discover all pertinent CINDs in RDF data. RDFind employs a lazy pruning strategy to drastically reduce the CIND search space. Also, its exhaustive parallelization strategy and robust data structures make it highly scalable. In our experimental evaluation, we show that RDFind is up to 419 times faster than the state-of-the-art, while considering a more general class of CINDs. Furthermore, it is capable of processing a very large dataset of billions of triples, which was entirely infeasible before.

Original languageEnglish
Title of host publicationSIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages953-967
Number of pages15
Volume26-June-2016
ISBN (Electronic)9781450335317
DOIs
Publication statusPublished - 26 Jun 2016
Event2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016 - San Francisco, United States
Duration: 26 Jun 20161 Jul 2016

Other

Other2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016
CountryUnited States
CitySan Francisco
Period26/6/161/7/16

Fingerprint

Information management
Data structures
Processing

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Kruse, S., Jentzsch, A., Papenbrock, T., Kaoudi, Z., Quiane Ruiz, J. A., & Naumann, F. (2016). RDFind: Scalable conditional inclusion dependency discovery in RDF datasets. In SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data (Vol. 26-June-2016, pp. 953-967). Association for Computing Machinery. https://doi.org/10.1145/2882903.2915206

RDFind : Scalable conditional inclusion dependency discovery in RDF datasets. / Kruse, Sebastian; Jentzsch, Anja; Papenbrock, Thorsten; Kaoudi, Zoi; Quiane Ruiz, Jorge Arnulfo; Naumann, Felix.

SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data. Vol. 26-June-2016 Association for Computing Machinery, 2016. p. 953-967.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kruse, S, Jentzsch, A, Papenbrock, T, Kaoudi, Z, Quiane Ruiz, JA & Naumann, F 2016, RDFind: Scalable conditional inclusion dependency discovery in RDF datasets. in SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data. vol. 26-June-2016, Association for Computing Machinery, pp. 953-967, 2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016, San Francisco, United States, 26/6/16. https://doi.org/10.1145/2882903.2915206
Kruse S, Jentzsch A, Papenbrock T, Kaoudi Z, Quiane Ruiz JA, Naumann F. RDFind: Scalable conditional inclusion dependency discovery in RDF datasets. In SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data. Vol. 26-June-2016. Association for Computing Machinery. 2016. p. 953-967 https://doi.org/10.1145/2882903.2915206
Kruse, Sebastian ; Jentzsch, Anja ; Papenbrock, Thorsten ; Kaoudi, Zoi ; Quiane Ruiz, Jorge Arnulfo ; Naumann, Felix. / RDFind : Scalable conditional inclusion dependency discovery in RDF datasets. SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data. Vol. 26-June-2016 Association for Computing Machinery, 2016. pp. 953-967
@inproceedings{08dc1a65a8e549e8bb81c11ae0631812,
title = "RDFind: Scalable conditional inclusion dependency discovery in RDF datasets",
abstract = "Inclusion dependencies (INDs) form an important integrity constraint on relational databases, supporting data management tasks, such as join path discovery and query optimization. Conditional inclusion dependencies (CINDs), which define including and included data in terms of conditions, allow to transfer these capabilities to RDF data. However, CIND discovery is computationally much more complex than IND discovery and the number of CINDs even on small RDF datasets is intractable. To cope with both problems, we first introduce the notion of pertinent CINDs with an adjustable relevance criterion to filter and rank CINDs based on their extent and implications among each other. Second, we present RDFind, a distributed system to efficiently discover all pertinent CINDs in RDF data. RDFind employs a lazy pruning strategy to drastically reduce the CIND search space. Also, its exhaustive parallelization strategy and robust data structures make it highly scalable. In our experimental evaluation, we show that RDFind is up to 419 times faster than the state-of-the-art, while considering a more general class of CINDs. Furthermore, it is capable of processing a very large dataset of billions of triples, which was entirely infeasible before.",
author = "Sebastian Kruse and Anja Jentzsch and Thorsten Papenbrock and Zoi Kaoudi and {Quiane Ruiz}, {Jorge Arnulfo} and Felix Naumann",
year = "2016",
month = "6",
day = "26",
doi = "10.1145/2882903.2915206",
language = "English",
volume = "26-June-2016",
pages = "953--967",
booktitle = "SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data",
publisher = "Association for Computing Machinery",

}

TY - GEN

T1 - RDFind

T2 - Scalable conditional inclusion dependency discovery in RDF datasets

AU - Kruse, Sebastian

AU - Jentzsch, Anja

AU - Papenbrock, Thorsten

AU - Kaoudi, Zoi

AU - Quiane Ruiz, Jorge Arnulfo

AU - Naumann, Felix

PY - 2016/6/26

Y1 - 2016/6/26

N2 - Inclusion dependencies (INDs) form an important integrity constraint on relational databases, supporting data management tasks, such as join path discovery and query optimization. Conditional inclusion dependencies (CINDs), which define including and included data in terms of conditions, allow to transfer these capabilities to RDF data. However, CIND discovery is computationally much more complex than IND discovery and the number of CINDs even on small RDF datasets is intractable. To cope with both problems, we first introduce the notion of pertinent CINDs with an adjustable relevance criterion to filter and rank CINDs based on their extent and implications among each other. Second, we present RDFind, a distributed system to efficiently discover all pertinent CINDs in RDF data. RDFind employs a lazy pruning strategy to drastically reduce the CIND search space. Also, its exhaustive parallelization strategy and robust data structures make it highly scalable. In our experimental evaluation, we show that RDFind is up to 419 times faster than the state-of-the-art, while considering a more general class of CINDs. Furthermore, it is capable of processing a very large dataset of billions of triples, which was entirely infeasible before.

AB - Inclusion dependencies (INDs) form an important integrity constraint on relational databases, supporting data management tasks, such as join path discovery and query optimization. Conditional inclusion dependencies (CINDs), which define including and included data in terms of conditions, allow to transfer these capabilities to RDF data. However, CIND discovery is computationally much more complex than IND discovery and the number of CINDs even on small RDF datasets is intractable. To cope with both problems, we first introduce the notion of pertinent CINDs with an adjustable relevance criterion to filter and rank CINDs based on their extent and implications among each other. Second, we present RDFind, a distributed system to efficiently discover all pertinent CINDs in RDF data. RDFind employs a lazy pruning strategy to drastically reduce the CIND search space. Also, its exhaustive parallelization strategy and robust data structures make it highly scalable. In our experimental evaluation, we show that RDFind is up to 419 times faster than the state-of-the-art, while considering a more general class of CINDs. Furthermore, it is capable of processing a very large dataset of billions of triples, which was entirely infeasible before.

UR - http://www.scopus.com/inward/record.url?scp=84979659133&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84979659133&partnerID=8YFLogxK

U2 - 10.1145/2882903.2915206

DO - 10.1145/2882903.2915206

M3 - Conference contribution

AN - SCOPUS:84979659133

VL - 26-June-2016

SP - 953

EP - 967

BT - SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data

PB - Association for Computing Machinery

ER -