A robust index for regular expression queries

Dominic Tsang, Sanjay Chawla

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

The like regular expression predicate has been part of the SQL standard since at least 1989. However, despite its popularity and wide usage, database vendors provide only limited indexing support for regular expression queries which almost always require a full table scan. In this paper we propose a rigorous and robust approach for providing indexing support for regular expression queries. Our approach consists of formulating the indexing problem as a combinatorial optimization problem. We begin with a database, abstracted as a collection of strings. From this data set we generate a query workload. The input to the optimization problem is the database and the workload. The output is a set of multigrams (substrings) which can be used as keys to records which satisfy the query workload. The multigrams can then be integrated with the data structure (like B+ trees) to provide indexing support for the queries. We provide a deterministic and a randomized approximation algorithm (with provable guarantees) to solve the optimization problem. Extensive experiments on synthetic data sets demonstrate that our approach is accurate and efficient. We also present a case study on PROSITE patterns - which are complex regular expression signatures for classes of proteins. Again, we are able to demonstrate the utility of our indexing approach in terms of accuracy and efficiency. Thus, perhaps for the first time, there is a robust and practical indexing mechanism for an important class of database queries.

Original languageEnglish
Title of host publicationInternational Conference on Information and Knowledge Management, Proceedings
Pages2365-2368
Number of pages4
DOIs
Publication statusPublished - 2011
Externally publishedYes
Event20th ACM Conference on Information and Knowledge Management, CIKM'11 - Glasgow, United Kingdom
Duration: 24 Oct 201128 Oct 2011

Other

Other20th ACM Conference on Information and Knowledge Management, CIKM'11
CountryUnited Kingdom
CityGlasgow
Period24/10/1128/10/11

Fingerprint

Query
Indexing
Data base
Optimization problem
Workload
Guarantee
Protein
Approximation algorithms
Vendors
Experiment
Integrated
Data structures
Combinatorial optimization

Keywords

  • index for regular expression queries

ASJC Scopus subject areas

  • Business, Management and Accounting(all)
  • Decision Sciences(all)

Cite this

Tsang, D., & Chawla, S. (2011). A robust index for regular expression queries. In International Conference on Information and Knowledge Management, Proceedings (pp. 2365-2368) https://doi.org/10.1145/2063576.2063968

A robust index for regular expression queries. / Tsang, Dominic; Chawla, Sanjay.

International Conference on Information and Knowledge Management, Proceedings. 2011. p. 2365-2368.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Tsang, D & Chawla, S 2011, A robust index for regular expression queries. in International Conference on Information and Knowledge Management, Proceedings. pp. 2365-2368, 20th ACM Conference on Information and Knowledge Management, CIKM'11, Glasgow, United Kingdom, 24/10/11. https://doi.org/10.1145/2063576.2063968
Tsang D, Chawla S. A robust index for regular expression queries. In International Conference on Information and Knowledge Management, Proceedings. 2011. p. 2365-2368 https://doi.org/10.1145/2063576.2063968
Tsang, Dominic ; Chawla, Sanjay. / A robust index for regular expression queries. International Conference on Information and Knowledge Management, Proceedings. 2011. pp. 2365-2368
@inproceedings{92e48011574b4dcf9570e9bf7135dea1,
title = "A robust index for regular expression queries",
abstract = "The like regular expression predicate has been part of the SQL standard since at least 1989. However, despite its popularity and wide usage, database vendors provide only limited indexing support for regular expression queries which almost always require a full table scan. In this paper we propose a rigorous and robust approach for providing indexing support for regular expression queries. Our approach consists of formulating the indexing problem as a combinatorial optimization problem. We begin with a database, abstracted as a collection of strings. From this data set we generate a query workload. The input to the optimization problem is the database and the workload. The output is a set of multigrams (substrings) which can be used as keys to records which satisfy the query workload. The multigrams can then be integrated with the data structure (like B+ trees) to provide indexing support for the queries. We provide a deterministic and a randomized approximation algorithm (with provable guarantees) to solve the optimization problem. Extensive experiments on synthetic data sets demonstrate that our approach is accurate and efficient. We also present a case study on PROSITE patterns - which are complex regular expression signatures for classes of proteins. Again, we are able to demonstrate the utility of our indexing approach in terms of accuracy and efficiency. Thus, perhaps for the first time, there is a robust and practical indexing mechanism for an important class of database queries.",
keywords = "index for regular expression queries",
author = "Dominic Tsang and Sanjay Chawla",
year = "2011",
doi = "10.1145/2063576.2063968",
language = "English",
isbn = "9781450307178",
pages = "2365--2368",
booktitle = "International Conference on Information and Knowledge Management, Proceedings",

}

TY - GEN

T1 - A robust index for regular expression queries

AU - Tsang, Dominic

AU - Chawla, Sanjay

PY - 2011

Y1 - 2011

N2 - The like regular expression predicate has been part of the SQL standard since at least 1989. However, despite its popularity and wide usage, database vendors provide only limited indexing support for regular expression queries which almost always require a full table scan. In this paper we propose a rigorous and robust approach for providing indexing support for regular expression queries. Our approach consists of formulating the indexing problem as a combinatorial optimization problem. We begin with a database, abstracted as a collection of strings. From this data set we generate a query workload. The input to the optimization problem is the database and the workload. The output is a set of multigrams (substrings) which can be used as keys to records which satisfy the query workload. The multigrams can then be integrated with the data structure (like B+ trees) to provide indexing support for the queries. We provide a deterministic and a randomized approximation algorithm (with provable guarantees) to solve the optimization problem. Extensive experiments on synthetic data sets demonstrate that our approach is accurate and efficient. We also present a case study on PROSITE patterns - which are complex regular expression signatures for classes of proteins. Again, we are able to demonstrate the utility of our indexing approach in terms of accuracy and efficiency. Thus, perhaps for the first time, there is a robust and practical indexing mechanism for an important class of database queries.

AB - The like regular expression predicate has been part of the SQL standard since at least 1989. However, despite its popularity and wide usage, database vendors provide only limited indexing support for regular expression queries which almost always require a full table scan. In this paper we propose a rigorous and robust approach for providing indexing support for regular expression queries. Our approach consists of formulating the indexing problem as a combinatorial optimization problem. We begin with a database, abstracted as a collection of strings. From this data set we generate a query workload. The input to the optimization problem is the database and the workload. The output is a set of multigrams (substrings) which can be used as keys to records which satisfy the query workload. The multigrams can then be integrated with the data structure (like B+ trees) to provide indexing support for the queries. We provide a deterministic and a randomized approximation algorithm (with provable guarantees) to solve the optimization problem. Extensive experiments on synthetic data sets demonstrate that our approach is accurate and efficient. We also present a case study on PROSITE patterns - which are complex regular expression signatures for classes of proteins. Again, we are able to demonstrate the utility of our indexing approach in terms of accuracy and efficiency. Thus, perhaps for the first time, there is a robust and practical indexing mechanism for an important class of database queries.

KW - index for regular expression queries

UR - http://www.scopus.com/inward/record.url?scp=83055165903&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=83055165903&partnerID=8YFLogxK

U2 - 10.1145/2063576.2063968

DO - 10.1145/2063576.2063968

M3 - Conference contribution

AN - SCOPUS:83055165903

SN - 9781450307178

SP - 2365

EP - 2368

BT - International Conference on Information and Knowledge Management, Proceedings

ER -