Querying and mining strings made easy

Majed Sahli, Essam Mansour, Panos Kalnis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

With the advent of large string datasets in several scientific and business applications, there is a growing need to perform ad-hoc analysis on strings. Currently, strings are stored, managed, and queried using procedural codes. This limits users to certain operations supported by existing procedural applications and requires manual query planning with limited tuning opportunities. This paper presents StarQL, a generic and declarative query language for strings. StarQL is based on a native string data model that allows StarQL to support a large variety of string operations and provide semantic-based query optimization. String analytic queries are too intricate to be solved on one machine. Therefore, we propose a scalable and efficient data structure that allows StarQL implementations to handle large sets of strings and utilize large computing infrastructures. Our evaluation shows that StarQL is able to express workloads of application-specific tools, such as BLAST and KAT in bioinformatics, and to mine Wikipedia text for interesting patterns using declarative queries. Furthermore, the StarQL query optimizer shows an order of magnitude reduction in query execution time.

Original languageEnglish
Title of host publicationAdvanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings
PublisherSpringer Verlag
Pages3-17
Number of pages15
ISBN (Print)9783319691787
DOIs
Publication statusPublished - 1 Jan 2017
Event13th International Conference on Advanced Data Mining and Applications, ADMA 2017 - Singapore, Singapore
Duration: 5 Nov 20176 Nov 2017

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10604 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other13th International Conference on Advanced Data Mining and Applications, ADMA 2017
CountrySingapore
CitySingapore
Period5/11/176/11/17

Fingerprint

Mining
Strings
Data structures
Query
Query languages
Bioinformatics
Tuning
Semantics
Planning
Query Optimization
Wikipedia
Industry
Query Language
Large Set
Data Model
Execution Time
Workload
Data Structures
Express
Infrastructure

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Sahli, M., Mansour, E., & Kalnis, P. (2017). Querying and mining strings made easy. In Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings (pp. 3-17). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10604 LNAI). Springer Verlag. https://doi.org/10.1007/978-3-319-69179-4_1

Querying and mining strings made easy. / Sahli, Majed; Mansour, Essam; Kalnis, Panos.

Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings. Springer Verlag, 2017. p. 3-17 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10604 LNAI).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Sahli, M, Mansour, E & Kalnis, P 2017, Querying and mining strings made easy. in Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10604 LNAI, Springer Verlag, pp. 3-17, 13th International Conference on Advanced Data Mining and Applications, ADMA 2017, Singapore, Singapore, 5/11/17. https://doi.org/10.1007/978-3-319-69179-4_1
Sahli M, Mansour E, Kalnis P. Querying and mining strings made easy. In Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings. Springer Verlag. 2017. p. 3-17. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-69179-4_1
Sahli, Majed ; Mansour, Essam ; Kalnis, Panos. / Querying and mining strings made easy. Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings. Springer Verlag, 2017. pp. 3-17 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{002810d848024e3a83369b7bcbafe83b,
title = "Querying and mining strings made easy",
abstract = "With the advent of large string datasets in several scientific and business applications, there is a growing need to perform ad-hoc analysis on strings. Currently, strings are stored, managed, and queried using procedural codes. This limits users to certain operations supported by existing procedural applications and requires manual query planning with limited tuning opportunities. This paper presents StarQL, a generic and declarative query language for strings. StarQL is based on a native string data model that allows StarQL to support a large variety of string operations and provide semantic-based query optimization. String analytic queries are too intricate to be solved on one machine. Therefore, we propose a scalable and efficient data structure that allows StarQL implementations to handle large sets of strings and utilize large computing infrastructures. Our evaluation shows that StarQL is able to express workloads of application-specific tools, such as BLAST and KAT in bioinformatics, and to mine Wikipedia text for interesting patterns using declarative queries. Furthermore, the StarQL query optimizer shows an order of magnitude reduction in query execution time.",
author = "Majed Sahli and Essam Mansour and Panos Kalnis",
year = "2017",
month = "1",
day = "1",
doi = "10.1007/978-3-319-69179-4_1",
language = "English",
isbn = "9783319691787",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "3--17",
booktitle = "Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings",

}

TY - GEN

T1 - Querying and mining strings made easy

AU - Sahli, Majed

AU - Mansour, Essam

AU - Kalnis, Panos

PY - 2017/1/1

Y1 - 2017/1/1

N2 - With the advent of large string datasets in several scientific and business applications, there is a growing need to perform ad-hoc analysis on strings. Currently, strings are stored, managed, and queried using procedural codes. This limits users to certain operations supported by existing procedural applications and requires manual query planning with limited tuning opportunities. This paper presents StarQL, a generic and declarative query language for strings. StarQL is based on a native string data model that allows StarQL to support a large variety of string operations and provide semantic-based query optimization. String analytic queries are too intricate to be solved on one machine. Therefore, we propose a scalable and efficient data structure that allows StarQL implementations to handle large sets of strings and utilize large computing infrastructures. Our evaluation shows that StarQL is able to express workloads of application-specific tools, such as BLAST and KAT in bioinformatics, and to mine Wikipedia text for interesting patterns using declarative queries. Furthermore, the StarQL query optimizer shows an order of magnitude reduction in query execution time.

AB - With the advent of large string datasets in several scientific and business applications, there is a growing need to perform ad-hoc analysis on strings. Currently, strings are stored, managed, and queried using procedural codes. This limits users to certain operations supported by existing procedural applications and requires manual query planning with limited tuning opportunities. This paper presents StarQL, a generic and declarative query language for strings. StarQL is based on a native string data model that allows StarQL to support a large variety of string operations and provide semantic-based query optimization. String analytic queries are too intricate to be solved on one machine. Therefore, we propose a scalable and efficient data structure that allows StarQL implementations to handle large sets of strings and utilize large computing infrastructures. Our evaluation shows that StarQL is able to express workloads of application-specific tools, such as BLAST and KAT in bioinformatics, and to mine Wikipedia text for interesting patterns using declarative queries. Furthermore, the StarQL query optimizer shows an order of magnitude reduction in query execution time.

UR - http://www.scopus.com/inward/record.url?scp=85033668417&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85033668417&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-69179-4_1

DO - 10.1007/978-3-319-69179-4_1

M3 - Conference contribution

AN - SCOPUS:85033668417

SN - 9783319691787

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 3

EP - 17

BT - Advanced Data Mining and Applications - 13th International Conference, ADMA 2017, Proceedings

PB - Springer Verlag

ER -