High dimensional nearest neighbor searching

Hakan Ferhatosmanoglu, Ertem Tuncel, Divyakant Agrawal, Amr El Abbadi

Research output: Contribution to journalArticle

22 Citations (Scopus)

Abstract

As databases increasingly integrate different types of information such as time-series, multimedia and scientific data, it becomes necessary to support efficient retrieval of multi-dimensional data. Both the dimensionality and the amount of data that needs to be processed are increasing rapidly. As a result of the scale and high dimensional nature, the traditional techniques have proven inadequate. In this paper, we propose search techniques that are effective especially for large high dimensional data sets. We first propose VA+-file technique which is based on scalar quantization of the data. VA+-file is especially useful for searching exact nearest neighbors (NN) in non-uniform high dimensional data sets. We then discuss how to improve the search and make it progressive by allowing some approximations in the query result. We develop a general framework for approximate NN queries, discuss various approaches for progressive processing of similarity queries, and develop a metric for evaluation of such techniques. Finally, a new technique based on clustering is proposed, which merges the benefits of various approaches for progressive similarity searching. Extensive experimental evaluation is performed on several real-life data sets. The evaluation establishes the superiority of the proposed techniques over the existing techniques for high dimensional similarity searching. The techniques proposed in this paper are effective for real-life data sets, which are typically non-uniform, and they are scalable with respect to both dimensionality and size of the data set.

Original languageEnglish
Pages (from-to)512-540
Number of pages29
JournalInformation Systems
Volume31
Issue number6
DOIs
Publication statusPublished - 1 Sep 2006
Externally publishedYes

Fingerprint

Time series
Processing
Nearest neighbor search
Nearest neighbor
Evaluation
Query

Keywords

  • Approximate and progressive search
  • High dimensional data
  • Indexing
  • Nearest neighbor queries
  • Non-uniform data
  • Performance
  • Scalability
  • Similarity search

ASJC Scopus subject areas

  • Management Information Systems
  • Management of Technology and Innovation
  • Hardware and Architecture
  • Information Systems
  • Software

Cite this

Ferhatosmanoglu, H., Tuncel, E., Agrawal, D., & Abbadi, A. E. (2006). High dimensional nearest neighbor searching. Information Systems, 31(6), 512-540. https://doi.org/10.1016/j.is.2005.01.001

High dimensional nearest neighbor searching. / Ferhatosmanoglu, Hakan; Tuncel, Ertem; Agrawal, Divyakant; Abbadi, Amr El.

In: Information Systems, Vol. 31, No. 6, 01.09.2006, p. 512-540.

Research output: Contribution to journalArticle

Ferhatosmanoglu, H, Tuncel, E, Agrawal, D & Abbadi, AE 2006, 'High dimensional nearest neighbor searching', Information Systems, vol. 31, no. 6, pp. 512-540. https://doi.org/10.1016/j.is.2005.01.001
Ferhatosmanoglu H, Tuncel E, Agrawal D, Abbadi AE. High dimensional nearest neighbor searching. Information Systems. 2006 Sep 1;31(6):512-540. https://doi.org/10.1016/j.is.2005.01.001
Ferhatosmanoglu, Hakan ; Tuncel, Ertem ; Agrawal, Divyakant ; Abbadi, Amr El. / High dimensional nearest neighbor searching. In: Information Systems. 2006 ; Vol. 31, No. 6. pp. 512-540.
@article{cd972fcedf2c4f44bbc1a5b7e6cd606f,
title = "High dimensional nearest neighbor searching",
abstract = "As databases increasingly integrate different types of information such as time-series, multimedia and scientific data, it becomes necessary to support efficient retrieval of multi-dimensional data. Both the dimensionality and the amount of data that needs to be processed are increasing rapidly. As a result of the scale and high dimensional nature, the traditional techniques have proven inadequate. In this paper, we propose search techniques that are effective especially for large high dimensional data sets. We first propose VA+-file technique which is based on scalar quantization of the data. VA+-file is especially useful for searching exact nearest neighbors (NN) in non-uniform high dimensional data sets. We then discuss how to improve the search and make it progressive by allowing some approximations in the query result. We develop a general framework for approximate NN queries, discuss various approaches for progressive processing of similarity queries, and develop a metric for evaluation of such techniques. Finally, a new technique based on clustering is proposed, which merges the benefits of various approaches for progressive similarity searching. Extensive experimental evaluation is performed on several real-life data sets. The evaluation establishes the superiority of the proposed techniques over the existing techniques for high dimensional similarity searching. The techniques proposed in this paper are effective for real-life data sets, which are typically non-uniform, and they are scalable with respect to both dimensionality and size of the data set.",
keywords = "Approximate and progressive search, High dimensional data, Indexing, Nearest neighbor queries, Non-uniform data, Performance, Scalability, Similarity search",
author = "Hakan Ferhatosmanoglu and Ertem Tuncel and Divyakant Agrawal and Abbadi, {Amr El}",
year = "2006",
month = "9",
day = "1",
doi = "10.1016/j.is.2005.01.001",
language = "English",
volume = "31",
pages = "512--540",
journal = "Information Systems",
issn = "0306-4379",
publisher = "Elsevier Limited",
number = "6",

}

TY - JOUR

T1 - High dimensional nearest neighbor searching

AU - Ferhatosmanoglu, Hakan

AU - Tuncel, Ertem

AU - Agrawal, Divyakant

AU - Abbadi, Amr El

PY - 2006/9/1

Y1 - 2006/9/1

N2 - As databases increasingly integrate different types of information such as time-series, multimedia and scientific data, it becomes necessary to support efficient retrieval of multi-dimensional data. Both the dimensionality and the amount of data that needs to be processed are increasing rapidly. As a result of the scale and high dimensional nature, the traditional techniques have proven inadequate. In this paper, we propose search techniques that are effective especially for large high dimensional data sets. We first propose VA+-file technique which is based on scalar quantization of the data. VA+-file is especially useful for searching exact nearest neighbors (NN) in non-uniform high dimensional data sets. We then discuss how to improve the search and make it progressive by allowing some approximations in the query result. We develop a general framework for approximate NN queries, discuss various approaches for progressive processing of similarity queries, and develop a metric for evaluation of such techniques. Finally, a new technique based on clustering is proposed, which merges the benefits of various approaches for progressive similarity searching. Extensive experimental evaluation is performed on several real-life data sets. The evaluation establishes the superiority of the proposed techniques over the existing techniques for high dimensional similarity searching. The techniques proposed in this paper are effective for real-life data sets, which are typically non-uniform, and they are scalable with respect to both dimensionality and size of the data set.

AB - As databases increasingly integrate different types of information such as time-series, multimedia and scientific data, it becomes necessary to support efficient retrieval of multi-dimensional data. Both the dimensionality and the amount of data that needs to be processed are increasing rapidly. As a result of the scale and high dimensional nature, the traditional techniques have proven inadequate. In this paper, we propose search techniques that are effective especially for large high dimensional data sets. We first propose VA+-file technique which is based on scalar quantization of the data. VA+-file is especially useful for searching exact nearest neighbors (NN) in non-uniform high dimensional data sets. We then discuss how to improve the search and make it progressive by allowing some approximations in the query result. We develop a general framework for approximate NN queries, discuss various approaches for progressive processing of similarity queries, and develop a metric for evaluation of such techniques. Finally, a new technique based on clustering is proposed, which merges the benefits of various approaches for progressive similarity searching. Extensive experimental evaluation is performed on several real-life data sets. The evaluation establishes the superiority of the proposed techniques over the existing techniques for high dimensional similarity searching. The techniques proposed in this paper are effective for real-life data sets, which are typically non-uniform, and they are scalable with respect to both dimensionality and size of the data set.

KW - Approximate and progressive search

KW - High dimensional data

KW - Indexing

KW - Nearest neighbor queries

KW - Non-uniform data

KW - Performance

KW - Scalability

KW - Similarity search

UR - http://www.scopus.com/inward/record.url?scp=33646721550&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33646721550&partnerID=8YFLogxK

U2 - 10.1016/j.is.2005.01.001

DO - 10.1016/j.is.2005.01.001

M3 - Article

VL - 31

SP - 512

EP - 540

JO - Information Systems

JF - Information Systems

SN - 0306-4379

IS - 6

ER -