Space-economical partial gram indices for exact substring matching

Nan Tang, Lefteris Sidirourgos, Peter Boncz

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

Exact substring matching queries on large data collections can be answered using q-gram indices, that store for each occurring q-byte pattern an (ordered) posting list with the positions of all occurrences. Such gram indices are known to provide fast query response time and to allow the index to be created quickly even on huge disk-based datasets. Their main drawback is relatively large storage space, that is a constant multiple (typically >2) of the original data size, even when compression is used. In this work, we study methods to conserve the scalable creation time and efficient exact substring query properties of gram indices, while reducing storage space. To this end, we first propose a partial gram index based on a reduction from the problem of omitting indexed q-grams to the set cover problem. While this method is successful in reducing the size of the index, it generates false positives at query time, reducing efficiency. We then increase the accuracy of partial grams by splitting posting lists of frequent grams in a frequency-tuned set of signatures that take the bytes surrounding the grams into account. The resulting qs-gram scheme is tested on huge collections (up to 426GB) and is shown to achieve an almost 1:1 data:index size, and query performance even faster than normal gram methods, thanks to the reduced size and access cost.

Original languageEnglish
Title of host publicationInternational Conference on Information and Knowledge Management, Proceedings
Pages285-294
Number of pages10
DOIs
Publication statusPublished - 1 Dec 2009
Externally publishedYes
EventACM 18th International Conference on Information and Knowledge Management, CIKM 2009 - Hong Kong, China
Duration: 2 Nov 20096 Nov 2009

Other

OtherACM 18th International Conference on Information and Knowledge Management, CIKM 2009
CountryChina
CityHong Kong
Period2/11/096/11/09

Fingerprint

Query
Response time
Costs
Data collection
Compression

Keywords

  • Q-gram string matching

ASJC Scopus subject areas

  • Business, Management and Accounting(all)
  • Decision Sciences(all)

Cite this

Tang, N., Sidirourgos, L., & Boncz, P. (2009). Space-economical partial gram indices for exact substring matching. In International Conference on Information and Knowledge Management, Proceedings (pp. 285-294) https://doi.org/10.1145/1645953.1645992

Space-economical partial gram indices for exact substring matching. / Tang, Nan; Sidirourgos, Lefteris; Boncz, Peter.

International Conference on Information and Knowledge Management, Proceedings. 2009. p. 285-294.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Tang, N, Sidirourgos, L & Boncz, P 2009, Space-economical partial gram indices for exact substring matching. in International Conference on Information and Knowledge Management, Proceedings. pp. 285-294, ACM 18th International Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, 2/11/09. https://doi.org/10.1145/1645953.1645992
Tang N, Sidirourgos L, Boncz P. Space-economical partial gram indices for exact substring matching. In International Conference on Information and Knowledge Management, Proceedings. 2009. p. 285-294 https://doi.org/10.1145/1645953.1645992
Tang, Nan ; Sidirourgos, Lefteris ; Boncz, Peter. / Space-economical partial gram indices for exact substring matching. International Conference on Information and Knowledge Management, Proceedings. 2009. pp. 285-294
@inproceedings{7fc2b45f425040ab901155b691e23434,
title = "Space-economical partial gram indices for exact substring matching",
abstract = "Exact substring matching queries on large data collections can be answered using q-gram indices, that store for each occurring q-byte pattern an (ordered) posting list with the positions of all occurrences. Such gram indices are known to provide fast query response time and to allow the index to be created quickly even on huge disk-based datasets. Their main drawback is relatively large storage space, that is a constant multiple (typically >2) of the original data size, even when compression is used. In this work, we study methods to conserve the scalable creation time and efficient exact substring query properties of gram indices, while reducing storage space. To this end, we first propose a partial gram index based on a reduction from the problem of omitting indexed q-grams to the set cover problem. While this method is successful in reducing the size of the index, it generates false positives at query time, reducing efficiency. We then increase the accuracy of partial grams by splitting posting lists of frequent grams in a frequency-tuned set of signatures that take the bytes surrounding the grams into account. The resulting qs-gram scheme is tested on huge collections (up to 426GB) and is shown to achieve an almost 1:1 data:index size, and query performance even faster than normal gram methods, thanks to the reduced size and access cost.",
keywords = "Q-gram string matching",
author = "Nan Tang and Lefteris Sidirourgos and Peter Boncz",
year = "2009",
month = "12",
day = "1",
doi = "10.1145/1645953.1645992",
language = "English",
isbn = "9781605585123",
pages = "285--294",
booktitle = "International Conference on Information and Knowledge Management, Proceedings",

}

TY - GEN

T1 - Space-economical partial gram indices for exact substring matching

AU - Tang, Nan

AU - Sidirourgos, Lefteris

AU - Boncz, Peter

PY - 2009/12/1

Y1 - 2009/12/1

N2 - Exact substring matching queries on large data collections can be answered using q-gram indices, that store for each occurring q-byte pattern an (ordered) posting list with the positions of all occurrences. Such gram indices are known to provide fast query response time and to allow the index to be created quickly even on huge disk-based datasets. Their main drawback is relatively large storage space, that is a constant multiple (typically >2) of the original data size, even when compression is used. In this work, we study methods to conserve the scalable creation time and efficient exact substring query properties of gram indices, while reducing storage space. To this end, we first propose a partial gram index based on a reduction from the problem of omitting indexed q-grams to the set cover problem. While this method is successful in reducing the size of the index, it generates false positives at query time, reducing efficiency. We then increase the accuracy of partial grams by splitting posting lists of frequent grams in a frequency-tuned set of signatures that take the bytes surrounding the grams into account. The resulting qs-gram scheme is tested on huge collections (up to 426GB) and is shown to achieve an almost 1:1 data:index size, and query performance even faster than normal gram methods, thanks to the reduced size and access cost.

AB - Exact substring matching queries on large data collections can be answered using q-gram indices, that store for each occurring q-byte pattern an (ordered) posting list with the positions of all occurrences. Such gram indices are known to provide fast query response time and to allow the index to be created quickly even on huge disk-based datasets. Their main drawback is relatively large storage space, that is a constant multiple (typically >2) of the original data size, even when compression is used. In this work, we study methods to conserve the scalable creation time and efficient exact substring query properties of gram indices, while reducing storage space. To this end, we first propose a partial gram index based on a reduction from the problem of omitting indexed q-grams to the set cover problem. While this method is successful in reducing the size of the index, it generates false positives at query time, reducing efficiency. We then increase the accuracy of partial grams by splitting posting lists of frequent grams in a frequency-tuned set of signatures that take the bytes surrounding the grams into account. The resulting qs-gram scheme is tested on huge collections (up to 426GB) and is shown to achieve an almost 1:1 data:index size, and query performance even faster than normal gram methods, thanks to the reduced size and access cost.

KW - Q-gram string matching

UR - http://www.scopus.com/inward/record.url?scp=74549215724&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=74549215724&partnerID=8YFLogxK

U2 - 10.1145/1645953.1645992

DO - 10.1145/1645953.1645992

M3 - Conference contribution

SN - 9781605585123

SP - 285

EP - 294

BT - International Conference on Information and Knowledge Management, Proceedings

ER -