Efficient parallel skyline query processing for high-dimensional data

Mingjie Tang, Yongyang Yu, Walid G. Aref, Qutaibah M. Malluhi, Mourad Ouzzani

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Given a set of multidimensional data points, skyline queries retrieve those points that are not dominated by any other points in the set. Due to the ubiquitous use of skyline queries, such as in preference-based query answering and decision making, and the large amount of data that these queries have to deal with, enabling their scalable processing is of critical importance. However, there are several outstanding challenges that have not been well addressed. More specifically, in this paper, we are tackling the data straggler and data skew challenges introduced by distributed skyline query processing, as well as the ensuing high computation cost of merging skyline candidates. We thus introduce a new efficient three-phase approach for large scale processing of skyline queries. In the first preprocessing phase, the data is partitioned along the Z-order curve. We utilize a novel data partitioning approach that formulates data partitioning as an optimization problem to minimize the size of intermediate data. In the second phase, each compute node partitions the input data points into disjoint subsets, and then performs the skyline computation on each subset to produce skyline candidates in parallel. In the final phase, we build an index and employ an efficient algorithm to merge the generated skyline candidates. Extensive experiments demonstrate that the proposed skyline algorithm achieves more than one order of magnitude enhancement in performance compared to existing state-of-the-art approaches.

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019
PublisherIEEE Computer Society
Pages2113-2114
Number of pages2
ISBN (Electronic)9781538674741
DOIs
Publication statusPublished - 1 Apr 2019
Event35th IEEE International Conference on Data Engineering, ICDE 2019 - Macau, China
Duration: 8 Apr 201911 Apr 2019

Publication series

NameProceedings - International Conference on Data Engineering
Volume2019-April
ISSN (Print)1084-4627

Conference

Conference35th IEEE International Conference on Data Engineering, ICDE 2019
CountryChina
CityMacau
Period8/4/1911/4/19

Fingerprint

Query processing
Set theory
Processing
Merging
Decision making
Costs
Experiments

Keywords

  • Big data
  • Parallel computation
  • Skyline query

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems

Cite this

Tang, M., Yu, Y., Aref, W. G., Malluhi, Q. M., & Ouzzani, M. (2019). Efficient parallel skyline query processing for high-dimensional data. In Proceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019 (pp. 2113-2114). [8731496] (Proceedings - International Conference on Data Engineering; Vol. 2019-April). IEEE Computer Society. https://doi.org/10.1109/ICDE.2019.00251

Efficient parallel skyline query processing for high-dimensional data. / Tang, Mingjie; Yu, Yongyang; Aref, Walid G.; Malluhi, Qutaibah M.; Ouzzani, Mourad.

Proceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019. IEEE Computer Society, 2019. p. 2113-2114 8731496 (Proceedings - International Conference on Data Engineering; Vol. 2019-April).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Tang, M, Yu, Y, Aref, WG, Malluhi, QM & Ouzzani, M 2019, Efficient parallel skyline query processing for high-dimensional data. in Proceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019., 8731496, Proceedings - International Conference on Data Engineering, vol. 2019-April, IEEE Computer Society, pp. 2113-2114, 35th IEEE International Conference on Data Engineering, ICDE 2019, Macau, China, 8/4/19. https://doi.org/10.1109/ICDE.2019.00251
Tang M, Yu Y, Aref WG, Malluhi QM, Ouzzani M. Efficient parallel skyline query processing for high-dimensional data. In Proceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019. IEEE Computer Society. 2019. p. 2113-2114. 8731496. (Proceedings - International Conference on Data Engineering). https://doi.org/10.1109/ICDE.2019.00251
Tang, Mingjie ; Yu, Yongyang ; Aref, Walid G. ; Malluhi, Qutaibah M. ; Ouzzani, Mourad. / Efficient parallel skyline query processing for high-dimensional data. Proceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019. IEEE Computer Society, 2019. pp. 2113-2114 (Proceedings - International Conference on Data Engineering).
@inproceedings{4d7ebe912c5c447888a52271d428157c,
title = "Efficient parallel skyline query processing for high-dimensional data",
abstract = "Given a set of multidimensional data points, skyline queries retrieve those points that are not dominated by any other points in the set. Due to the ubiquitous use of skyline queries, such as in preference-based query answering and decision making, and the large amount of data that these queries have to deal with, enabling their scalable processing is of critical importance. However, there are several outstanding challenges that have not been well addressed. More specifically, in this paper, we are tackling the data straggler and data skew challenges introduced by distributed skyline query processing, as well as the ensuing high computation cost of merging skyline candidates. We thus introduce a new efficient three-phase approach for large scale processing of skyline queries. In the first preprocessing phase, the data is partitioned along the Z-order curve. We utilize a novel data partitioning approach that formulates data partitioning as an optimization problem to minimize the size of intermediate data. In the second phase, each compute node partitions the input data points into disjoint subsets, and then performs the skyline computation on each subset to produce skyline candidates in parallel. In the final phase, we build an index and employ an efficient algorithm to merge the generated skyline candidates. Extensive experiments demonstrate that the proposed skyline algorithm achieves more than one order of magnitude enhancement in performance compared to existing state-of-the-art approaches.",
keywords = "Big data, Parallel computation, Skyline query",
author = "Mingjie Tang and Yongyang Yu and Aref, {Walid G.} and Malluhi, {Qutaibah M.} and Mourad Ouzzani",
year = "2019",
month = "4",
day = "1",
doi = "10.1109/ICDE.2019.00251",
language = "English",
series = "Proceedings - International Conference on Data Engineering",
publisher = "IEEE Computer Society",
pages = "2113--2114",
booktitle = "Proceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019",

}

TY - GEN

T1 - Efficient parallel skyline query processing for high-dimensional data

AU - Tang, Mingjie

AU - Yu, Yongyang

AU - Aref, Walid G.

AU - Malluhi, Qutaibah M.

AU - Ouzzani, Mourad

PY - 2019/4/1

Y1 - 2019/4/1

N2 - Given a set of multidimensional data points, skyline queries retrieve those points that are not dominated by any other points in the set. Due to the ubiquitous use of skyline queries, such as in preference-based query answering and decision making, and the large amount of data that these queries have to deal with, enabling their scalable processing is of critical importance. However, there are several outstanding challenges that have not been well addressed. More specifically, in this paper, we are tackling the data straggler and data skew challenges introduced by distributed skyline query processing, as well as the ensuing high computation cost of merging skyline candidates. We thus introduce a new efficient three-phase approach for large scale processing of skyline queries. In the first preprocessing phase, the data is partitioned along the Z-order curve. We utilize a novel data partitioning approach that formulates data partitioning as an optimization problem to minimize the size of intermediate data. In the second phase, each compute node partitions the input data points into disjoint subsets, and then performs the skyline computation on each subset to produce skyline candidates in parallel. In the final phase, we build an index and employ an efficient algorithm to merge the generated skyline candidates. Extensive experiments demonstrate that the proposed skyline algorithm achieves more than one order of magnitude enhancement in performance compared to existing state-of-the-art approaches.

AB - Given a set of multidimensional data points, skyline queries retrieve those points that are not dominated by any other points in the set. Due to the ubiquitous use of skyline queries, such as in preference-based query answering and decision making, and the large amount of data that these queries have to deal with, enabling their scalable processing is of critical importance. However, there are several outstanding challenges that have not been well addressed. More specifically, in this paper, we are tackling the data straggler and data skew challenges introduced by distributed skyline query processing, as well as the ensuing high computation cost of merging skyline candidates. We thus introduce a new efficient three-phase approach for large scale processing of skyline queries. In the first preprocessing phase, the data is partitioned along the Z-order curve. We utilize a novel data partitioning approach that formulates data partitioning as an optimization problem to minimize the size of intermediate data. In the second phase, each compute node partitions the input data points into disjoint subsets, and then performs the skyline computation on each subset to produce skyline candidates in parallel. In the final phase, we build an index and employ an efficient algorithm to merge the generated skyline candidates. Extensive experiments demonstrate that the proposed skyline algorithm achieves more than one order of magnitude enhancement in performance compared to existing state-of-the-art approaches.

KW - Big data

KW - Parallel computation

KW - Skyline query

UR - http://www.scopus.com/inward/record.url?scp=85067990764&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85067990764&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2019.00251

DO - 10.1109/ICDE.2019.00251

M3 - Conference contribution

AN - SCOPUS:85067990764

T3 - Proceedings - International Conference on Data Engineering

SP - 2113

EP - 2114

BT - Proceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019

PB - IEEE Computer Society

ER -