Similarity Group-by Operators for Multi-Dimensional Relational Data

Mingjie Tang, Ruby Y. Tahboub, Walid G. Aref, Mikhail J. Atallah, Qutaibah M. Malluhi, Mourad Ouzzani, Yasin N. Silva

Research output: Contribution to journalArticle

9 Citations (Scopus)

Abstract

The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. While the standard group-by operator, which is based on equality, is useful in several applications, allowing similarity aware grouping provides a more realistic view on real-world data that could lead to better insights. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently realize these approximate semantics, they primarily focus on one-dimensional attributes and treat multi-dimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multi-dimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multi-dimensional data. The first operator is the clique (or distance-to-all) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. Since a tuple may satisfy the membership criterion of multiple groups, we introduce three different semantics to deal with such a case: (i) eliminate the tuple, (ii) put the tuple in any one group, and (iii) create a new group for this tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Group-by. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in performance over baseline methods developed to solve the same problem.

Original languageEnglish
Article number7289415
Pages (from-to)510-523
Number of pages14
JournalIEEE Transactions on Knowledge and Data Engineering
Volume28
Issue number2
DOIs
Publication statusPublished - 1 Feb 2016

Fingerprint

Semantics
Mathematical operators

Keywords

  • multidimensional data
  • query processing
  • relational database
  • similarity query
  • SQL operators

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Information Systems
  • Computer Science Applications

Cite this

Tang, M., Tahboub, R. Y., Aref, W. G., Atallah, M. J., Malluhi, Q. M., Ouzzani, M., & Silva, Y. N. (2016). Similarity Group-by Operators for Multi-Dimensional Relational Data. IEEE Transactions on Knowledge and Data Engineering, 28(2), 510-523. [7289415]. https://doi.org/10.1109/TKDE.2015.2480400

Similarity Group-by Operators for Multi-Dimensional Relational Data. / Tang, Mingjie; Tahboub, Ruby Y.; Aref, Walid G.; Atallah, Mikhail J.; Malluhi, Qutaibah M.; Ouzzani, Mourad; Silva, Yasin N.

In: IEEE Transactions on Knowledge and Data Engineering, Vol. 28, No. 2, 7289415, 01.02.2016, p. 510-523.

Research output: Contribution to journalArticle

Tang, M, Tahboub, RY, Aref, WG, Atallah, MJ, Malluhi, QM, Ouzzani, M & Silva, YN 2016, 'Similarity Group-by Operators for Multi-Dimensional Relational Data', IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 2, 7289415, pp. 510-523. https://doi.org/10.1109/TKDE.2015.2480400
Tang, Mingjie ; Tahboub, Ruby Y. ; Aref, Walid G. ; Atallah, Mikhail J. ; Malluhi, Qutaibah M. ; Ouzzani, Mourad ; Silva, Yasin N. / Similarity Group-by Operators for Multi-Dimensional Relational Data. In: IEEE Transactions on Knowledge and Data Engineering. 2016 ; Vol. 28, No. 2. pp. 510-523.
@article{b4bfe0971ba64567828015aac96f4615,
title = "Similarity Group-by Operators for Multi-Dimensional Relational Data",
abstract = "The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. While the standard group-by operator, which is based on equality, is useful in several applications, allowing similarity aware grouping provides a more realistic view on real-world data that could lead to better insights. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently realize these approximate semantics, they primarily focus on one-dimensional attributes and treat multi-dimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multi-dimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multi-dimensional data. The first operator is the clique (or distance-to-all) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. Since a tuple may satisfy the membership criterion of multiple groups, we introduce three different semantics to deal with such a case: (i) eliminate the tuple, (ii) put the tuple in any one group, and (iii) create a new group for this tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Group-by. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in performance over baseline methods developed to solve the same problem.",
keywords = "multidimensional data, query processing, relational database, similarity query, SQL operators",
author = "Mingjie Tang and Tahboub, {Ruby Y.} and Aref, {Walid G.} and Atallah, {Mikhail J.} and Malluhi, {Qutaibah M.} and Mourad Ouzzani and Silva, {Yasin N.}",
year = "2016",
month = "2",
day = "1",
doi = "10.1109/TKDE.2015.2480400",
language = "English",
volume = "28",
pages = "510--523",
journal = "IEEE Transactions on Knowledge and Data Engineering",
issn = "1041-4347",
publisher = "IEEE Computer Society",
number = "2",

}

TY - JOUR

T1 - Similarity Group-by Operators for Multi-Dimensional Relational Data

AU - Tang, Mingjie

AU - Tahboub, Ruby Y.

AU - Aref, Walid G.

AU - Atallah, Mikhail J.

AU - Malluhi, Qutaibah M.

AU - Ouzzani, Mourad

AU - Silva, Yasin N.

PY - 2016/2/1

Y1 - 2016/2/1

N2 - The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. While the standard group-by operator, which is based on equality, is useful in several applications, allowing similarity aware grouping provides a more realistic view on real-world data that could lead to better insights. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently realize these approximate semantics, they primarily focus on one-dimensional attributes and treat multi-dimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multi-dimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multi-dimensional data. The first operator is the clique (or distance-to-all) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. Since a tuple may satisfy the membership criterion of multiple groups, we introduce three different semantics to deal with such a case: (i) eliminate the tuple, (ii) put the tuple in any one group, and (iii) create a new group for this tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Group-by. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in performance over baseline methods developed to solve the same problem.

AB - The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. While the standard group-by operator, which is based on equality, is useful in several applications, allowing similarity aware grouping provides a more realistic view on real-world data that could lead to better insights. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently realize these approximate semantics, they primarily focus on one-dimensional attributes and treat multi-dimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multi-dimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multi-dimensional data. The first operator is the clique (or distance-to-all) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. Since a tuple may satisfy the membership criterion of multiple groups, we introduce three different semantics to deal with such a case: (i) eliminate the tuple, (ii) put the tuple in any one group, and (iii) create a new group for this tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Group-by. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in performance over baseline methods developed to solve the same problem.

KW - multidimensional data

KW - query processing

KW - relational database

KW - similarity query

KW - SQL operators

UR - http://www.scopus.com/inward/record.url?scp=84962467098&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84962467098&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2015.2480400

DO - 10.1109/TKDE.2015.2480400

M3 - Article

AN - SCOPUS:84962467098

VL - 28

SP - 510

EP - 523

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

SN - 1041-4347

IS - 2

M1 - 7289415

ER -