Fast and scalable inequality joins

Zuhair Khayyat, William Lucia, Meghna Singh, Mourad Ouzzani, Paolo Papotti, Jorge Arnulfo Quiane Ruiz, Nan Tang, Panos Kalnis

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

Inequality joins, which is to join relations with inequality conditions, are used in various applications. Optimizing joins has been the subject of intensive research ranging from efficient join algorithms such as sort-merge join, to the use of efficient indices such as (Formula presented.)-tree, (Formula presented.)-tree and Bitmap. However, inequality joins have received little attention and queries containing such joins are notably very slow. In this paper, we introduce fast inequality join algorithms based on sorted arrays and space-efficient bit-arrays. We further introduce a simple method to estimate the selectivity of inequality joins which is then used to optimize multiple predicate queries and multi-way joins. Moreover, we study an incremental inequality join algorithm to handle scenarios where data keeps changing. We have implemented a centralized version of these algorithms on top of PostgreSQL, a distributed version on top of Spark SQL, and an existing data cleaning system, Nadeef. By comparing our algorithms against well-known optimization techniques for inequality joins, we show our solution is more scalable and several orders of magnitude faster.

Original languageEnglish
Pages (from-to)1-26
Number of pages26
JournalVLDB Journal
DOIs
Publication statusAccepted/In press - 7 Sep 2016

Fingerprint

Electric sparks
Cleaning

Keywords

  • Incremental
  • Inequality join
  • PostgreSQL
  • Selectivity estimation
  • Spark SQL

ASJC Scopus subject areas

  • Information Systems
  • Hardware and Architecture

Cite this

Fast and scalable inequality joins. / Khayyat, Zuhair; Lucia, William; Singh, Meghna; Ouzzani, Mourad; Papotti, Paolo; Quiane Ruiz, Jorge Arnulfo; Tang, Nan; Kalnis, Panos.

In: VLDB Journal, 07.09.2016, p. 1-26.

Research output: Contribution to journalArticle

Khayyat, Zuhair ; Lucia, William ; Singh, Meghna ; Ouzzani, Mourad ; Papotti, Paolo ; Quiane Ruiz, Jorge Arnulfo ; Tang, Nan ; Kalnis, Panos. / Fast and scalable inequality joins. In: VLDB Journal. 2016 ; pp. 1-26.
@article{622df0bbf1fa4e7d92769dc06b491ba3,
title = "Fast and scalable inequality joins",
abstract = "Inequality joins, which is to join relations with inequality conditions, are used in various applications. Optimizing joins has been the subject of intensive research ranging from efficient join algorithms such as sort-merge join, to the use of efficient indices such as (Formula presented.)-tree, (Formula presented.)-tree and Bitmap. However, inequality joins have received little attention and queries containing such joins are notably very slow. In this paper, we introduce fast inequality join algorithms based on sorted arrays and space-efficient bit-arrays. We further introduce a simple method to estimate the selectivity of inequality joins which is then used to optimize multiple predicate queries and multi-way joins. Moreover, we study an incremental inequality join algorithm to handle scenarios where data keeps changing. We have implemented a centralized version of these algorithms on top of PostgreSQL, a distributed version on top of Spark SQL, and an existing data cleaning system, Nadeef. By comparing our algorithms against well-known optimization techniques for inequality joins, we show our solution is more scalable and several orders of magnitude faster.",
keywords = "Incremental, Inequality join, PostgreSQL, Selectivity estimation, Spark SQL",
author = "Zuhair Khayyat and William Lucia and Meghna Singh and Mourad Ouzzani and Paolo Papotti and {Quiane Ruiz}, {Jorge Arnulfo} and Nan Tang and Panos Kalnis",
year = "2016",
month = "9",
day = "7",
doi = "10.1007/s00778-016-0441-6",
language = "English",
pages = "1--26",
journal = "VLDB Journal",
issn = "1066-8888",
publisher = "Springer New York",

}

TY - JOUR

T1 - Fast and scalable inequality joins

AU - Khayyat, Zuhair

AU - Lucia, William

AU - Singh, Meghna

AU - Ouzzani, Mourad

AU - Papotti, Paolo

AU - Quiane Ruiz, Jorge Arnulfo

AU - Tang, Nan

AU - Kalnis, Panos

PY - 2016/9/7

Y1 - 2016/9/7

N2 - Inequality joins, which is to join relations with inequality conditions, are used in various applications. Optimizing joins has been the subject of intensive research ranging from efficient join algorithms such as sort-merge join, to the use of efficient indices such as (Formula presented.)-tree, (Formula presented.)-tree and Bitmap. However, inequality joins have received little attention and queries containing such joins are notably very slow. In this paper, we introduce fast inequality join algorithms based on sorted arrays and space-efficient bit-arrays. We further introduce a simple method to estimate the selectivity of inequality joins which is then used to optimize multiple predicate queries and multi-way joins. Moreover, we study an incremental inequality join algorithm to handle scenarios where data keeps changing. We have implemented a centralized version of these algorithms on top of PostgreSQL, a distributed version on top of Spark SQL, and an existing data cleaning system, Nadeef. By comparing our algorithms against well-known optimization techniques for inequality joins, we show our solution is more scalable and several orders of magnitude faster.

AB - Inequality joins, which is to join relations with inequality conditions, are used in various applications. Optimizing joins has been the subject of intensive research ranging from efficient join algorithms such as sort-merge join, to the use of efficient indices such as (Formula presented.)-tree, (Formula presented.)-tree and Bitmap. However, inequality joins have received little attention and queries containing such joins are notably very slow. In this paper, we introduce fast inequality join algorithms based on sorted arrays and space-efficient bit-arrays. We further introduce a simple method to estimate the selectivity of inequality joins which is then used to optimize multiple predicate queries and multi-way joins. Moreover, we study an incremental inequality join algorithm to handle scenarios where data keeps changing. We have implemented a centralized version of these algorithms on top of PostgreSQL, a distributed version on top of Spark SQL, and an existing data cleaning system, Nadeef. By comparing our algorithms against well-known optimization techniques for inequality joins, we show our solution is more scalable and several orders of magnitude faster.

KW - Incremental

KW - Inequality join

KW - PostgreSQL

KW - Selectivity estimation

KW - Spark SQL

UR - http://www.scopus.com/inward/record.url?scp=84986275869&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84986275869&partnerID=8YFLogxK

U2 - 10.1007/s00778-016-0441-6

DO - 10.1007/s00778-016-0441-6

M3 - Article

SP - 1

EP - 26

JO - VLDB Journal

JF - VLDB Journal

SN - 1066-8888

ER -