Efficient processing of hamming-distance-based similarity-search queries over MapReduce

Mingjie Tang, Yongyang Yu, Walid G. Aref, Qutaibah M. Malluhi, Mourad Ouzzani

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Citations (Scopus)

Abstract

Similarity search is crucial to many applications. Of particular interest are two flavors of the Hamming distance range query, namely, the Hamming select and the Hamming join (Hamming-select and Hamming-join, respectively). Hamming distance is widely used in approximate near neighbor search for high dimensional data, such as images and document collections. For example, using predefined similarity hash functions, high-dimensional data is mapped into one-dimensional binary codes that are, then linearly scanned to perform Hamming-distance comparisons. These distance comparisons on the binary codes are usually costly and, often involves excessive redundancies. This paper introduces a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries. An efficient search algorithm based on the HA-index is presented. A distributed version of the HA-index is introduced and algorithms for realizing Hamming distance-select and Hamming distance-join operations on a MapReduce platform are prototyped. Extensive experiments using real datasets demonstrates that the HA-index and the corresponding search algorithms achieve up to two orders of magnitude speedup over existing state-of-the-art approaches, while saving more than ten times in memory space.

Original languageEnglish
Title of host publicationEDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings
PublisherOpenProceedings.org, University of Konstanz, University Library
Pages361-372
Number of pages12
ISBN (Electronic)9783893180677
DOIs
Publication statusPublished - 2015
Event18th International Conference on Extending Database Technology, EDBT 2015 - Brussels, Belgium
Duration: 23 Mar 201527 Mar 2015

Other

Other18th International Conference on Extending Database Technology, EDBT 2015
CountryBelgium
CityBrussels
Period23/3/1527/3/15

Fingerprint

Hamming distance
Processing
Binary codes
Flavors
Redundancy
Hash functions
Data storage equipment

ASJC Scopus subject areas

  • Information Systems
  • Software

Cite this

Tang, M., Yu, Y., Aref, W. G., Malluhi, Q. M., & Ouzzani, M. (2015). Efficient processing of hamming-distance-based similarity-search queries over MapReduce. In EDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings (pp. 361-372). OpenProceedings.org, University of Konstanz, University Library. https://doi.org/10.5441/002/edbt.2015.32

Efficient processing of hamming-distance-based similarity-search queries over MapReduce. / Tang, Mingjie; Yu, Yongyang; Aref, Walid G.; Malluhi, Qutaibah M.; Ouzzani, Mourad.

EDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings. OpenProceedings.org, University of Konstanz, University Library, 2015. p. 361-372.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Tang, M, Yu, Y, Aref, WG, Malluhi, QM & Ouzzani, M 2015, Efficient processing of hamming-distance-based similarity-search queries over MapReduce. in EDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings. OpenProceedings.org, University of Konstanz, University Library, pp. 361-372, 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, Belgium, 23/3/15. https://doi.org/10.5441/002/edbt.2015.32
Tang M, Yu Y, Aref WG, Malluhi QM, Ouzzani M. Efficient processing of hamming-distance-based similarity-search queries over MapReduce. In EDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings. OpenProceedings.org, University of Konstanz, University Library. 2015. p. 361-372 https://doi.org/10.5441/002/edbt.2015.32
Tang, Mingjie ; Yu, Yongyang ; Aref, Walid G. ; Malluhi, Qutaibah M. ; Ouzzani, Mourad. / Efficient processing of hamming-distance-based similarity-search queries over MapReduce. EDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings. OpenProceedings.org, University of Konstanz, University Library, 2015. pp. 361-372
@inproceedings{327bf88d80f340c69c584788371ed4de,
title = "Efficient processing of hamming-distance-based similarity-search queries over MapReduce",
abstract = "Similarity search is crucial to many applications. Of particular interest are two flavors of the Hamming distance range query, namely, the Hamming select and the Hamming join (Hamming-select and Hamming-join, respectively). Hamming distance is widely used in approximate near neighbor search for high dimensional data, such as images and document collections. For example, using predefined similarity hash functions, high-dimensional data is mapped into one-dimensional binary codes that are, then linearly scanned to perform Hamming-distance comparisons. These distance comparisons on the binary codes are usually costly and, often involves excessive redundancies. This paper introduces a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries. An efficient search algorithm based on the HA-index is presented. A distributed version of the HA-index is introduced and algorithms for realizing Hamming distance-select and Hamming distance-join operations on a MapReduce platform are prototyped. Extensive experiments using real datasets demonstrates that the HA-index and the corresponding search algorithms achieve up to two orders of magnitude speedup over existing state-of-the-art approaches, while saving more than ten times in memory space.",
author = "Mingjie Tang and Yongyang Yu and Aref, {Walid G.} and Malluhi, {Qutaibah M.} and Mourad Ouzzani",
year = "2015",
doi = "10.5441/002/edbt.2015.32",
language = "English",
pages = "361--372",
booktitle = "EDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings",
publisher = "OpenProceedings.org, University of Konstanz, University Library",

}

TY - GEN

T1 - Efficient processing of hamming-distance-based similarity-search queries over MapReduce

AU - Tang, Mingjie

AU - Yu, Yongyang

AU - Aref, Walid G.

AU - Malluhi, Qutaibah M.

AU - Ouzzani, Mourad

PY - 2015

Y1 - 2015

N2 - Similarity search is crucial to many applications. Of particular interest are two flavors of the Hamming distance range query, namely, the Hamming select and the Hamming join (Hamming-select and Hamming-join, respectively). Hamming distance is widely used in approximate near neighbor search for high dimensional data, such as images and document collections. For example, using predefined similarity hash functions, high-dimensional data is mapped into one-dimensional binary codes that are, then linearly scanned to perform Hamming-distance comparisons. These distance comparisons on the binary codes are usually costly and, often involves excessive redundancies. This paper introduces a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries. An efficient search algorithm based on the HA-index is presented. A distributed version of the HA-index is introduced and algorithms for realizing Hamming distance-select and Hamming distance-join operations on a MapReduce platform are prototyped. Extensive experiments using real datasets demonstrates that the HA-index and the corresponding search algorithms achieve up to two orders of magnitude speedup over existing state-of-the-art approaches, while saving more than ten times in memory space.

AB - Similarity search is crucial to many applications. Of particular interest are two flavors of the Hamming distance range query, namely, the Hamming select and the Hamming join (Hamming-select and Hamming-join, respectively). Hamming distance is widely used in approximate near neighbor search for high dimensional data, such as images and document collections. For example, using predefined similarity hash functions, high-dimensional data is mapped into one-dimensional binary codes that are, then linearly scanned to perform Hamming-distance comparisons. These distance comparisons on the binary codes are usually costly and, often involves excessive redundancies. This paper introduces a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries. An efficient search algorithm based on the HA-index is presented. A distributed version of the HA-index is introduced and algorithms for realizing Hamming distance-select and Hamming distance-join operations on a MapReduce platform are prototyped. Extensive experiments using real datasets demonstrates that the HA-index and the corresponding search algorithms achieve up to two orders of magnitude speedup over existing state-of-the-art approaches, while saving more than ten times in memory space.

UR - http://www.scopus.com/inward/record.url?scp=84976287044&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84976287044&partnerID=8YFLogxK

U2 - 10.5441/002/edbt.2015.32

DO - 10.5441/002/edbt.2015.32

M3 - Conference contribution

SP - 361

EP - 372

BT - EDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings

PB - OpenProceedings.org, University of Konstanz, University Library

ER -