HashFile

An efficient index structure for multimedia data

Dongxiang Zhang, Divyakant Agrawal, Gang Chen, Anthony K H Tung

Research output: Chapter in Book/Report/Conference proceedingConference contribution

17 Citations (Scopus)

Abstract

Nearest neighbor (NN) search in high dimensional space is an essential query in many multimedia retrieval applications. Due to the curse of dimensionality, existing index structures might perform even worse than a simple sequential scan of data when answering exact NN query. To improve the efficiency of NN search, locality sensitive hashing (LSH) and its variants have been proposed to find approximate NN. They adopt hash functions that can preserve the Euclidean distance so that similar objects have a high probability of colliding in the same bucket. Given a query object, candidate for the query result is obtained by accessing the points that are located in the same bucket. To improve the precision, each hash table is associated with m hash functions to recursively hash the data points into smaller buckets and remove the false positives. On the other hand, multiple hash tables are required to guarantee a high retrieval recall. Thus, tuning a good tradeoff between precision and recall becomes the main challenge for LSH. Recently, locality sensitive B-tree(LSB-tree) has been proposed to ensure both quality and efficiency. However, the index uses random I/O access. When the multimedia database is large, it requires considerable disk I/O cost to obtain an approximate ratio that works in practice. In this paper, we propose a novel index structure, named HashFile, for efficient retrieval of multimedia objects. It combines the advantages of random projection and linear scan. Unlike the LSH family in which each bucket is associated with a concatenation of m hash values, we only recursively partition the dense buckets and organize them as a tree structure. Given a query point q, the search algorithm explores the buckets near the query object in a top-down manner. The candidate buckets in each node are stored sequentially in increasing order of the hash value and can be efficiently loaded into memory for linear scan. HashFile can support both exact and approximate NN queries. Experimental results show that HashFile performs better than existing indexes both in answering both types of NN queries.

Original languageEnglish
Title of host publicationProceedings - International Conference on Data Engineering
Pages1103-1114
Number of pages12
DOIs
Publication statusPublished - 6 Jun 2011
Externally publishedYes
Event2011 IEEE 27th International Conference on Data Engineering, ICDE 2011 - Hannover, Germany
Duration: 11 Apr 201116 Apr 2011

Other

Other2011 IEEE 27th International Conference on Data Engineering, ICDE 2011
CountryGermany
CityHannover
Period11/4/1116/4/11

Fingerprint

Hash functions
Tuning
Data storage equipment
Nearest neighbor search
Costs

ASJC Scopus subject areas

  • Information Systems
  • Signal Processing
  • Software

Cite this

Zhang, D., Agrawal, D., Chen, G., & Tung, A. K. H. (2011). HashFile: An efficient index structure for multimedia data. In Proceedings - International Conference on Data Engineering (pp. 1103-1114). [5767837] https://doi.org/10.1109/ICDE.2011.5767837

HashFile : An efficient index structure for multimedia data. / Zhang, Dongxiang; Agrawal, Divyakant; Chen, Gang; Tung, Anthony K H.

Proceedings - International Conference on Data Engineering. 2011. p. 1103-1114 5767837.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zhang, D, Agrawal, D, Chen, G & Tung, AKH 2011, HashFile: An efficient index structure for multimedia data. in Proceedings - International Conference on Data Engineering., 5767837, pp. 1103-1114, 2011 IEEE 27th International Conference on Data Engineering, ICDE 2011, Hannover, Germany, 11/4/11. https://doi.org/10.1109/ICDE.2011.5767837
Zhang D, Agrawal D, Chen G, Tung AKH. HashFile: An efficient index structure for multimedia data. In Proceedings - International Conference on Data Engineering. 2011. p. 1103-1114. 5767837 https://doi.org/10.1109/ICDE.2011.5767837
Zhang, Dongxiang ; Agrawal, Divyakant ; Chen, Gang ; Tung, Anthony K H. / HashFile : An efficient index structure for multimedia data. Proceedings - International Conference on Data Engineering. 2011. pp. 1103-1114
@inproceedings{24fac386d8c34ceabb78b5438e99c0a5,
title = "HashFile: An efficient index structure for multimedia data",
abstract = "Nearest neighbor (NN) search in high dimensional space is an essential query in many multimedia retrieval applications. Due to the curse of dimensionality, existing index structures might perform even worse than a simple sequential scan of data when answering exact NN query. To improve the efficiency of NN search, locality sensitive hashing (LSH) and its variants have been proposed to find approximate NN. They adopt hash functions that can preserve the Euclidean distance so that similar objects have a high probability of colliding in the same bucket. Given a query object, candidate for the query result is obtained by accessing the points that are located in the same bucket. To improve the precision, each hash table is associated with m hash functions to recursively hash the data points into smaller buckets and remove the false positives. On the other hand, multiple hash tables are required to guarantee a high retrieval recall. Thus, tuning a good tradeoff between precision and recall becomes the main challenge for LSH. Recently, locality sensitive B-tree(LSB-tree) has been proposed to ensure both quality and efficiency. However, the index uses random I/O access. When the multimedia database is large, it requires considerable disk I/O cost to obtain an approximate ratio that works in practice. In this paper, we propose a novel index structure, named HashFile, for efficient retrieval of multimedia objects. It combines the advantages of random projection and linear scan. Unlike the LSH family in which each bucket is associated with a concatenation of m hash values, we only recursively partition the dense buckets and organize them as a tree structure. Given a query point q, the search algorithm explores the buckets near the query object in a top-down manner. The candidate buckets in each node are stored sequentially in increasing order of the hash value and can be efficiently loaded into memory for linear scan. HashFile can support both exact and approximate NN queries. Experimental results show that HashFile performs better than existing indexes both in answering both types of NN queries.",
author = "Dongxiang Zhang and Divyakant Agrawal and Gang Chen and Tung, {Anthony K H}",
year = "2011",
month = "6",
day = "6",
doi = "10.1109/ICDE.2011.5767837",
language = "English",
isbn = "9781424489589",
pages = "1103--1114",
booktitle = "Proceedings - International Conference on Data Engineering",

}

TY - GEN

T1 - HashFile

T2 - An efficient index structure for multimedia data

AU - Zhang, Dongxiang

AU - Agrawal, Divyakant

AU - Chen, Gang

AU - Tung, Anthony K H

PY - 2011/6/6

Y1 - 2011/6/6

N2 - Nearest neighbor (NN) search in high dimensional space is an essential query in many multimedia retrieval applications. Due to the curse of dimensionality, existing index structures might perform even worse than a simple sequential scan of data when answering exact NN query. To improve the efficiency of NN search, locality sensitive hashing (LSH) and its variants have been proposed to find approximate NN. They adopt hash functions that can preserve the Euclidean distance so that similar objects have a high probability of colliding in the same bucket. Given a query object, candidate for the query result is obtained by accessing the points that are located in the same bucket. To improve the precision, each hash table is associated with m hash functions to recursively hash the data points into smaller buckets and remove the false positives. On the other hand, multiple hash tables are required to guarantee a high retrieval recall. Thus, tuning a good tradeoff between precision and recall becomes the main challenge for LSH. Recently, locality sensitive B-tree(LSB-tree) has been proposed to ensure both quality and efficiency. However, the index uses random I/O access. When the multimedia database is large, it requires considerable disk I/O cost to obtain an approximate ratio that works in practice. In this paper, we propose a novel index structure, named HashFile, for efficient retrieval of multimedia objects. It combines the advantages of random projection and linear scan. Unlike the LSH family in which each bucket is associated with a concatenation of m hash values, we only recursively partition the dense buckets and organize them as a tree structure. Given a query point q, the search algorithm explores the buckets near the query object in a top-down manner. The candidate buckets in each node are stored sequentially in increasing order of the hash value and can be efficiently loaded into memory for linear scan. HashFile can support both exact and approximate NN queries. Experimental results show that HashFile performs better than existing indexes both in answering both types of NN queries.

AB - Nearest neighbor (NN) search in high dimensional space is an essential query in many multimedia retrieval applications. Due to the curse of dimensionality, existing index structures might perform even worse than a simple sequential scan of data when answering exact NN query. To improve the efficiency of NN search, locality sensitive hashing (LSH) and its variants have been proposed to find approximate NN. They adopt hash functions that can preserve the Euclidean distance so that similar objects have a high probability of colliding in the same bucket. Given a query object, candidate for the query result is obtained by accessing the points that are located in the same bucket. To improve the precision, each hash table is associated with m hash functions to recursively hash the data points into smaller buckets and remove the false positives. On the other hand, multiple hash tables are required to guarantee a high retrieval recall. Thus, tuning a good tradeoff between precision and recall becomes the main challenge for LSH. Recently, locality sensitive B-tree(LSB-tree) has been proposed to ensure both quality and efficiency. However, the index uses random I/O access. When the multimedia database is large, it requires considerable disk I/O cost to obtain an approximate ratio that works in practice. In this paper, we propose a novel index structure, named HashFile, for efficient retrieval of multimedia objects. It combines the advantages of random projection and linear scan. Unlike the LSH family in which each bucket is associated with a concatenation of m hash values, we only recursively partition the dense buckets and organize them as a tree structure. Given a query point q, the search algorithm explores the buckets near the query object in a top-down manner. The candidate buckets in each node are stored sequentially in increasing order of the hash value and can be efficiently loaded into memory for linear scan. HashFile can support both exact and approximate NN queries. Experimental results show that HashFile performs better than existing indexes both in answering both types of NN queries.

UR - http://www.scopus.com/inward/record.url?scp=79957804884&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79957804884&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2011.5767837

DO - 10.1109/ICDE.2011.5767837

M3 - Conference contribution

SN - 9781424489589

SP - 1103

EP - 1114

BT - Proceedings - International Conference on Data Engineering

ER -