Spatial coding-based approach for partitioning big spatial data in Hadoop

Xiaochuang Yao, Mohamed Mokbel, Louai Alarabi, Ahmed Eldawy, Jianyu Yang, Wenju Yun, Lin Li, Sijing Ye, Dehai Zhu

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

Spatial data partitioning (SDP) plays a powerful role in distributed storage and parallel computing for spatial data. However, due to skew distribution of spatial data and varying volume of spatial vector objects, it leads to a significant challenge to ensure both optimal performance of spatial operation and data balance in the cluster. To tackle this problem, we proposed a spatial coding-based approach for partitioning big spatial data in Hadoop. This approach, firstly, compressed the whole big spatial data based on spatial coding matrix to create a sensing information set (SIS), including spatial code, size, count and other information. SIS was then employed to build spatial partitioning matrix, which was used to spilt all spatial objects into different partitions in the cluster finally. Based on our approach, the neighbouring spatial objects can be partitioned into the same block. At the same time, it also can minimize the data skew in Hadoop distributed file system (HDFS). The presented approach with a case study in this paper is compared against random sampling based partitioning, with three measurement standards, namely, the spatial index quality, data skew in HDFS, and range query performance. The experimental results show that our method based on spatial coding technique can improve the query performance of big spatial data, as well as the data balance in HDFS. We implemented and deployed this approach in Hadoop, and it is also able to support efficiently any other distributed big spatial data systems.

Original languageEnglish
Pages (from-to)60-67
Number of pages8
JournalComputers and Geosciences
Volume106
DOIs
Publication statusPublished - 1 Sep 2017
Externally publishedYes

Fingerprint

spatial data
partitioning
Hazardous materials spills
Parallel processing systems
Sampling
parallel computing
matrix
data quality
sampling

Keywords

  • Big spatial data
  • Hadoop
  • Spatial coding-based approach
  • Spatial data partitioning

ASJC Scopus subject areas

  • Information Systems
  • Computers in Earth Sciences

Cite this

Spatial coding-based approach for partitioning big spatial data in Hadoop. / Yao, Xiaochuang; Mokbel, Mohamed; Alarabi, Louai; Eldawy, Ahmed; Yang, Jianyu; Yun, Wenju; Li, Lin; Ye, Sijing; Zhu, Dehai.

In: Computers and Geosciences, Vol. 106, 01.09.2017, p. 60-67.

Research output: Contribution to journalArticle

Yao, X, Mokbel, M, Alarabi, L, Eldawy, A, Yang, J, Yun, W, Li, L, Ye, S & Zhu, D 2017, 'Spatial coding-based approach for partitioning big spatial data in Hadoop', Computers and Geosciences, vol. 106, pp. 60-67. https://doi.org/10.1016/j.cageo.2017.05.014
Yao, Xiaochuang ; Mokbel, Mohamed ; Alarabi, Louai ; Eldawy, Ahmed ; Yang, Jianyu ; Yun, Wenju ; Li, Lin ; Ye, Sijing ; Zhu, Dehai. / Spatial coding-based approach for partitioning big spatial data in Hadoop. In: Computers and Geosciences. 2017 ; Vol. 106. pp. 60-67.
@article{2847d0a7bd9642d684c9c51e823776a6,
title = "Spatial coding-based approach for partitioning big spatial data in Hadoop",
abstract = "Spatial data partitioning (SDP) plays a powerful role in distributed storage and parallel computing for spatial data. However, due to skew distribution of spatial data and varying volume of spatial vector objects, it leads to a significant challenge to ensure both optimal performance of spatial operation and data balance in the cluster. To tackle this problem, we proposed a spatial coding-based approach for partitioning big spatial data in Hadoop. This approach, firstly, compressed the whole big spatial data based on spatial coding matrix to create a sensing information set (SIS), including spatial code, size, count and other information. SIS was then employed to build spatial partitioning matrix, which was used to spilt all spatial objects into different partitions in the cluster finally. Based on our approach, the neighbouring spatial objects can be partitioned into the same block. At the same time, it also can minimize the data skew in Hadoop distributed file system (HDFS). The presented approach with a case study in this paper is compared against random sampling based partitioning, with three measurement standards, namely, the spatial index quality, data skew in HDFS, and range query performance. The experimental results show that our method based on spatial coding technique can improve the query performance of big spatial data, as well as the data balance in HDFS. We implemented and deployed this approach in Hadoop, and it is also able to support efficiently any other distributed big spatial data systems.",
keywords = "Big spatial data, Hadoop, Spatial coding-based approach, Spatial data partitioning",
author = "Xiaochuang Yao and Mohamed Mokbel and Louai Alarabi and Ahmed Eldawy and Jianyu Yang and Wenju Yun and Lin Li and Sijing Ye and Dehai Zhu",
year = "2017",
month = "9",
day = "1",
doi = "10.1016/j.cageo.2017.05.014",
language = "English",
volume = "106",
pages = "60--67",
journal = "Computers and Geosciences",
issn = "0098-3004",
publisher = "Elsevier Limited",

}

TY - JOUR

T1 - Spatial coding-based approach for partitioning big spatial data in Hadoop

AU - Yao, Xiaochuang

AU - Mokbel, Mohamed

AU - Alarabi, Louai

AU - Eldawy, Ahmed

AU - Yang, Jianyu

AU - Yun, Wenju

AU - Li, Lin

AU - Ye, Sijing

AU - Zhu, Dehai

PY - 2017/9/1

Y1 - 2017/9/1

N2 - Spatial data partitioning (SDP) plays a powerful role in distributed storage and parallel computing for spatial data. However, due to skew distribution of spatial data and varying volume of spatial vector objects, it leads to a significant challenge to ensure both optimal performance of spatial operation and data balance in the cluster. To tackle this problem, we proposed a spatial coding-based approach for partitioning big spatial data in Hadoop. This approach, firstly, compressed the whole big spatial data based on spatial coding matrix to create a sensing information set (SIS), including spatial code, size, count and other information. SIS was then employed to build spatial partitioning matrix, which was used to spilt all spatial objects into different partitions in the cluster finally. Based on our approach, the neighbouring spatial objects can be partitioned into the same block. At the same time, it also can minimize the data skew in Hadoop distributed file system (HDFS). The presented approach with a case study in this paper is compared against random sampling based partitioning, with three measurement standards, namely, the spatial index quality, data skew in HDFS, and range query performance. The experimental results show that our method based on spatial coding technique can improve the query performance of big spatial data, as well as the data balance in HDFS. We implemented and deployed this approach in Hadoop, and it is also able to support efficiently any other distributed big spatial data systems.

AB - Spatial data partitioning (SDP) plays a powerful role in distributed storage and parallel computing for spatial data. However, due to skew distribution of spatial data and varying volume of spatial vector objects, it leads to a significant challenge to ensure both optimal performance of spatial operation and data balance in the cluster. To tackle this problem, we proposed a spatial coding-based approach for partitioning big spatial data in Hadoop. This approach, firstly, compressed the whole big spatial data based on spatial coding matrix to create a sensing information set (SIS), including spatial code, size, count and other information. SIS was then employed to build spatial partitioning matrix, which was used to spilt all spatial objects into different partitions in the cluster finally. Based on our approach, the neighbouring spatial objects can be partitioned into the same block. At the same time, it also can minimize the data skew in Hadoop distributed file system (HDFS). The presented approach with a case study in this paper is compared against random sampling based partitioning, with three measurement standards, namely, the spatial index quality, data skew in HDFS, and range query performance. The experimental results show that our method based on spatial coding technique can improve the query performance of big spatial data, as well as the data balance in HDFS. We implemented and deployed this approach in Hadoop, and it is also able to support efficiently any other distributed big spatial data systems.

KW - Big spatial data

KW - Hadoop

KW - Spatial coding-based approach

KW - Spatial data partitioning

UR - http://www.scopus.com/inward/record.url?scp=85020376450&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85020376450&partnerID=8YFLogxK

U2 - 10.1016/j.cageo.2017.05.014

DO - 10.1016/j.cageo.2017.05.014

M3 - Article

VL - 106

SP - 60

EP - 67

JO - Computers and Geosciences

JF - Computers and Geosciences

SN - 0098-3004

ER -