LocationSpark

A distributed in-memory data management system for big spatial data

Mingjie Tangy, Yongyang Yuy, Qutaibah M. Malluhiz, Mourad Ouzzani, Walid G. Arefy

Research output: Contribution to journalConference article

45 Citations (Scopus)

Abstract

We present LocationSpark, a spatial data processing system built on top of Apache Spark, a widely used distributed data processing system. LocationSpark offers a rich set of spatial query operators, e.g., range search, kNN, spatio-textual operation, spatial-join, and kNN-join. To achieve high performance, LocationSpark employs various spatial indexes for in-memory data, and guarantees that immutable spatial indexes have low overhead with fault tolerance. In addition, we build two new layers over Spark, namely a query scheduler and a query executor. The query scheduler is responsible for mitigating skew in spatial queries, while the query executor selects the best plan based on the indexes and the nature of the spatial queries. Furthermore, to avoid unnecessary network communication overhead when processing overlapped spatial data, We embed an efficient spatial Bloom filter into LocationSpark's indexes. Finally, LocationSpark tracks frequently accessed spatial data, and dynamically ushes less frequently accessed data into disk. We evaluate our system on real workloads and demonstrate that it achieves an order of magnitude performance gain over a baseline framework.

Original languageEnglish
Pages (from-to)1565-1568
Number of pages4
JournalProceedings of the VLDB Endowment
Volume9
Issue number13
Publication statusPublished - 1 Jan 2015

Fingerprint

Electric sparks
Information management
Data storage equipment
Fault tolerance
Telecommunication networks
Processing

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

LocationSpark : A distributed in-memory data management system for big spatial data. / Tangy, Mingjie; Yuy, Yongyang; Malluhiz, Qutaibah M.; Ouzzani, Mourad; Arefy, Walid G.

In: Proceedings of the VLDB Endowment, Vol. 9, No. 13, 01.01.2015, p. 1565-1568.

Research output: Contribution to journalConference article

Tangy, M, Yuy, Y, Malluhiz, QM, Ouzzani, M & Arefy, WG 2015, 'LocationSpark: A distributed in-memory data management system for big spatial data', Proceedings of the VLDB Endowment, vol. 9, no. 13, pp. 1565-1568.
Tangy, Mingjie ; Yuy, Yongyang ; Malluhiz, Qutaibah M. ; Ouzzani, Mourad ; Arefy, Walid G. / LocationSpark : A distributed in-memory data management system for big spatial data. In: Proceedings of the VLDB Endowment. 2015 ; Vol. 9, No. 13. pp. 1565-1568.
@article{876832758d4f44f68547836c01159722,
title = "LocationSpark: A distributed in-memory data management system for big spatial data",
abstract = "We present LocationSpark, a spatial data processing system built on top of Apache Spark, a widely used distributed data processing system. LocationSpark offers a rich set of spatial query operators, e.g., range search, kNN, spatio-textual operation, spatial-join, and kNN-join. To achieve high performance, LocationSpark employs various spatial indexes for in-memory data, and guarantees that immutable spatial indexes have low overhead with fault tolerance. In addition, we build two new layers over Spark, namely a query scheduler and a query executor. The query scheduler is responsible for mitigating skew in spatial queries, while the query executor selects the best plan based on the indexes and the nature of the spatial queries. Furthermore, to avoid unnecessary network communication overhead when processing overlapped spatial data, We embed an efficient spatial Bloom filter into LocationSpark's indexes. Finally, LocationSpark tracks frequently accessed spatial data, and dynamically ushes less frequently accessed data into disk. We evaluate our system on real workloads and demonstrate that it achieves an order of magnitude performance gain over a baseline framework.",
author = "Mingjie Tangy and Yongyang Yuy and Malluhiz, {Qutaibah M.} and Mourad Ouzzani and Arefy, {Walid G.}",
year = "2015",
month = "1",
day = "1",
language = "English",
volume = "9",
pages = "1565--1568",
journal = "Proceedings of the VLDB Endowment",
issn = "2150-8097",
publisher = "Very Large Data Base Endowment Inc.",
number = "13",

}

TY - JOUR

T1 - LocationSpark

T2 - A distributed in-memory data management system for big spatial data

AU - Tangy, Mingjie

AU - Yuy, Yongyang

AU - Malluhiz, Qutaibah M.

AU - Ouzzani, Mourad

AU - Arefy, Walid G.

PY - 2015/1/1

Y1 - 2015/1/1

N2 - We present LocationSpark, a spatial data processing system built on top of Apache Spark, a widely used distributed data processing system. LocationSpark offers a rich set of spatial query operators, e.g., range search, kNN, spatio-textual operation, spatial-join, and kNN-join. To achieve high performance, LocationSpark employs various spatial indexes for in-memory data, and guarantees that immutable spatial indexes have low overhead with fault tolerance. In addition, we build two new layers over Spark, namely a query scheduler and a query executor. The query scheduler is responsible for mitigating skew in spatial queries, while the query executor selects the best plan based on the indexes and the nature of the spatial queries. Furthermore, to avoid unnecessary network communication overhead when processing overlapped spatial data, We embed an efficient spatial Bloom filter into LocationSpark's indexes. Finally, LocationSpark tracks frequently accessed spatial data, and dynamically ushes less frequently accessed data into disk. We evaluate our system on real workloads and demonstrate that it achieves an order of magnitude performance gain over a baseline framework.

AB - We present LocationSpark, a spatial data processing system built on top of Apache Spark, a widely used distributed data processing system. LocationSpark offers a rich set of spatial query operators, e.g., range search, kNN, spatio-textual operation, spatial-join, and kNN-join. To achieve high performance, LocationSpark employs various spatial indexes for in-memory data, and guarantees that immutable spatial indexes have low overhead with fault tolerance. In addition, we build two new layers over Spark, namely a query scheduler and a query executor. The query scheduler is responsible for mitigating skew in spatial queries, while the query executor selects the best plan based on the indexes and the nature of the spatial queries. Furthermore, to avoid unnecessary network communication overhead when processing overlapped spatial data, We embed an efficient spatial Bloom filter into LocationSpark's indexes. Finally, LocationSpark tracks frequently accessed spatial data, and dynamically ushes less frequently accessed data into disk. We evaluate our system on real workloads and demonstrate that it achieves an order of magnitude performance gain over a baseline framework.

UR - http://www.scopus.com/inward/record.url?scp=85018667195&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85018667195&partnerID=8YFLogxK

M3 - Conference article

VL - 9

SP - 1565

EP - 1568

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

SN - 2150-8097

IS - 13

ER -