Efficient Parallel Skyline Query Processing for High-Dimensional Data

Tang Mingjie, Yongyang Yu, Walid G. Aref, Qutaibah Malluhi, Mourad Ouzzani

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

Given a set of multidimensional data points, skyline queries retrieve those points that are not dominated by any other points in the set. Due to the ubiquitous use of skyline queries, there are several outstanding challenges that have not been well addressed. More specifically, in this paper, we are tackling the data straggler and data skew challenges introduced by distributed skyline query processing as well as the ensuing high computing cost of merging skyline candidates. We thus introduce a new efficient three-phase approach for large scale processing of skyline queries. In the first preprocessing phase, the data is partitioned along the Z-order curve. We utilize a novel data partitioning approach that formulates data partitioning as an optimization problem to minimize the size of intermediate data. In the second phase, each computation node partitions the input data points into separate sets, and then performs the skyline computation on each set to produce skyline candidates in parallel. In the final phase, we build an index and employ an efficient algorithm to merge the generated skyline candidates. Extensive experiments demonstrate that the proposed skyline algorithm achieves more than one order of magnitude enhancement in performance compared to existing state-of-the-art approaches.

Original languageEnglish
JournalIEEE Transactions on Knowledge and Data Engineering
DOIs
Publication statusAccepted/In press - 23 Feb 2018

Fingerprint

Query processing
Merging
Processing
Costs
Experiments

Keywords

  • Computer science
  • Distributed databases
  • high-dimensional data
  • Indexes
  • Merging
  • parallel computing
  • Partitioning algorithms
  • query processing
  • Query processing
  • skyline query
  • Task analysis

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this

Efficient Parallel Skyline Query Processing for High-Dimensional Data. / Mingjie, Tang; Yu, Yongyang; Aref, Walid G.; Malluhi, Qutaibah; Ouzzani, Mourad.

In: IEEE Transactions on Knowledge and Data Engineering, 23.02.2018.

Research output: Contribution to journalArticle

@article{abf0bb9af8f541f3b0585d51af247de3,
title = "Efficient Parallel Skyline Query Processing for High-Dimensional Data",
abstract = "Given a set of multidimensional data points, skyline queries retrieve those points that are not dominated by any other points in the set. Due to the ubiquitous use of skyline queries, there are several outstanding challenges that have not been well addressed. More specifically, in this paper, we are tackling the data straggler and data skew challenges introduced by distributed skyline query processing as well as the ensuing high computing cost of merging skyline candidates. We thus introduce a new efficient three-phase approach for large scale processing of skyline queries. In the first preprocessing phase, the data is partitioned along the Z-order curve. We utilize a novel data partitioning approach that formulates data partitioning as an optimization problem to minimize the size of intermediate data. In the second phase, each computation node partitions the input data points into separate sets, and then performs the skyline computation on each set to produce skyline candidates in parallel. In the final phase, we build an index and employ an efficient algorithm to merge the generated skyline candidates. Extensive experiments demonstrate that the proposed skyline algorithm achieves more than one order of magnitude enhancement in performance compared to existing state-of-the-art approaches.",
keywords = "Computer science, Distributed databases, high-dimensional data, Indexes, Merging, parallel computing, Partitioning algorithms, query processing, Query processing, skyline query, Task analysis",
author = "Tang Mingjie and Yongyang Yu and Aref, {Walid G.} and Qutaibah Malluhi and Mourad Ouzzani",
year = "2018",
month = "2",
day = "23",
doi = "10.1109/TKDE.2018.2809598",
language = "English",
journal = "IEEE Transactions on Knowledge and Data Engineering",
issn = "1041-4347",
publisher = "IEEE Computer Society",

}

TY - JOUR

T1 - Efficient Parallel Skyline Query Processing for High-Dimensional Data

AU - Mingjie, Tang

AU - Yu, Yongyang

AU - Aref, Walid G.

AU - Malluhi, Qutaibah

AU - Ouzzani, Mourad

PY - 2018/2/23

Y1 - 2018/2/23

N2 - Given a set of multidimensional data points, skyline queries retrieve those points that are not dominated by any other points in the set. Due to the ubiquitous use of skyline queries, there are several outstanding challenges that have not been well addressed. More specifically, in this paper, we are tackling the data straggler and data skew challenges introduced by distributed skyline query processing as well as the ensuing high computing cost of merging skyline candidates. We thus introduce a new efficient three-phase approach for large scale processing of skyline queries. In the first preprocessing phase, the data is partitioned along the Z-order curve. We utilize a novel data partitioning approach that formulates data partitioning as an optimization problem to minimize the size of intermediate data. In the second phase, each computation node partitions the input data points into separate sets, and then performs the skyline computation on each set to produce skyline candidates in parallel. In the final phase, we build an index and employ an efficient algorithm to merge the generated skyline candidates. Extensive experiments demonstrate that the proposed skyline algorithm achieves more than one order of magnitude enhancement in performance compared to existing state-of-the-art approaches.

AB - Given a set of multidimensional data points, skyline queries retrieve those points that are not dominated by any other points in the set. Due to the ubiquitous use of skyline queries, there are several outstanding challenges that have not been well addressed. More specifically, in this paper, we are tackling the data straggler and data skew challenges introduced by distributed skyline query processing as well as the ensuing high computing cost of merging skyline candidates. We thus introduce a new efficient three-phase approach for large scale processing of skyline queries. In the first preprocessing phase, the data is partitioned along the Z-order curve. We utilize a novel data partitioning approach that formulates data partitioning as an optimization problem to minimize the size of intermediate data. In the second phase, each computation node partitions the input data points into separate sets, and then performs the skyline computation on each set to produce skyline candidates in parallel. In the final phase, we build an index and employ an efficient algorithm to merge the generated skyline candidates. Extensive experiments demonstrate that the proposed skyline algorithm achieves more than one order of magnitude enhancement in performance compared to existing state-of-the-art approaches.

KW - Computer science

KW - Distributed databases

KW - high-dimensional data

KW - Indexes

KW - Merging

KW - parallel computing

KW - Partitioning algorithms

KW - query processing

KW - Query processing

KW - skyline query

KW - Task analysis

UR - http://www.scopus.com/inward/record.url?scp=85042717489&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85042717489&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2018.2809598

DO - 10.1109/TKDE.2018.2809598

M3 - Article

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

SN - 1041-4347

ER -