Genome-scale disk-based suffix tree indexing

Benjarath Phoophakdee, Mohammed J. Zaki

Research output: Chapter in Book/Report/Conference proceedingConference contribution

50 Citations (Scopus)

Abstract

With the exponential growth of biological sequence databases, it has become critical to develop effective techniques for storing, querying, and analyzing these massive data. Suffix trees are widely used to solve many sequence-based problems, and they can be built in linear time and space, provided the resulting tree fits in main-memory. To index larger sequences, several external suffix tree algorithms have been proposed in recent years. However, they suffer from several problems such as susceptibility to data skew, non-scalability to genome-scale sequences, and non-existence of suffix links, which are crucial in various suffix tree based algorithms. In this paper, we target DNA sequences and propose a novel disk-based suffix tree algorithm called TRELLIS, which effectively scales up to genome-scale sequences. Specifically, it can index the entire human genome using 2GB of memory, in about 4 hours and can recover all its suffix links within 2 hours. TRELLIS was compared to various state-of-the-art persistent disk-based suffix tree construction algorithms, and was shown to outperform the best previous methods, both in terms of indexing time and querying time.

Original languageEnglish
Title of host publicationProceedings of the ACM SIGMOD International Conference on Management of Data
Pages833-844
Number of pages12
DOIs
Publication statusPublished - 30 Oct 2007
Externally publishedYes
EventSIGMOD 2007: ACM SIGMOD International Conference on Management of Data - Beijing, China
Duration: 12 Jun 200714 Jun 2007

Other

OtherSIGMOD 2007: ACM SIGMOD International Conference on Management of Data
CountryChina
CityBeijing
Period12/6/0714/6/07

Fingerprint

Genes
Data storage equipment
DNA sequences

Keywords

  • Disk-based
  • External memory
  • Genome-scale
  • Sequence indexing
  • Suffix tree

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Phoophakdee, B., & Zaki, M. J. (2007). Genome-scale disk-based suffix tree indexing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 833-844) https://doi.org/10.1145/1247480.1247572

Genome-scale disk-based suffix tree indexing. / Phoophakdee, Benjarath; Zaki, Mohammed J.

Proceedings of the ACM SIGMOD International Conference on Management of Data. 2007. p. 833-844.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Phoophakdee, B & Zaki, MJ 2007, Genome-scale disk-based suffix tree indexing. in Proceedings of the ACM SIGMOD International Conference on Management of Data. pp. 833-844, SIGMOD 2007: ACM SIGMOD International Conference on Management of Data, Beijing, China, 12/6/07. https://doi.org/10.1145/1247480.1247572
Phoophakdee B, Zaki MJ. Genome-scale disk-based suffix tree indexing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 2007. p. 833-844 https://doi.org/10.1145/1247480.1247572
Phoophakdee, Benjarath ; Zaki, Mohammed J. / Genome-scale disk-based suffix tree indexing. Proceedings of the ACM SIGMOD International Conference on Management of Data. 2007. pp. 833-844
@inproceedings{4342fc2c2ba848229a178c048e913d76,
title = "Genome-scale disk-based suffix tree indexing",
abstract = "With the exponential growth of biological sequence databases, it has become critical to develop effective techniques for storing, querying, and analyzing these massive data. Suffix trees are widely used to solve many sequence-based problems, and they can be built in linear time and space, provided the resulting tree fits in main-memory. To index larger sequences, several external suffix tree algorithms have been proposed in recent years. However, they suffer from several problems such as susceptibility to data skew, non-scalability to genome-scale sequences, and non-existence of suffix links, which are crucial in various suffix tree based algorithms. In this paper, we target DNA sequences and propose a novel disk-based suffix tree algorithm called TRELLIS, which effectively scales up to genome-scale sequences. Specifically, it can index the entire human genome using 2GB of memory, in about 4 hours and can recover all its suffix links within 2 hours. TRELLIS was compared to various state-of-the-art persistent disk-based suffix tree construction algorithms, and was shown to outperform the best previous methods, both in terms of indexing time and querying time.",
keywords = "Disk-based, External memory, Genome-scale, Sequence indexing, Suffix tree",
author = "Benjarath Phoophakdee and Zaki, {Mohammed J.}",
year = "2007",
month = "10",
day = "30",
doi = "10.1145/1247480.1247572",
language = "English",
isbn = "1595936866",
pages = "833--844",
booktitle = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

}

TY - GEN

T1 - Genome-scale disk-based suffix tree indexing

AU - Phoophakdee, Benjarath

AU - Zaki, Mohammed J.

PY - 2007/10/30

Y1 - 2007/10/30

N2 - With the exponential growth of biological sequence databases, it has become critical to develop effective techniques for storing, querying, and analyzing these massive data. Suffix trees are widely used to solve many sequence-based problems, and they can be built in linear time and space, provided the resulting tree fits in main-memory. To index larger sequences, several external suffix tree algorithms have been proposed in recent years. However, they suffer from several problems such as susceptibility to data skew, non-scalability to genome-scale sequences, and non-existence of suffix links, which are crucial in various suffix tree based algorithms. In this paper, we target DNA sequences and propose a novel disk-based suffix tree algorithm called TRELLIS, which effectively scales up to genome-scale sequences. Specifically, it can index the entire human genome using 2GB of memory, in about 4 hours and can recover all its suffix links within 2 hours. TRELLIS was compared to various state-of-the-art persistent disk-based suffix tree construction algorithms, and was shown to outperform the best previous methods, both in terms of indexing time and querying time.

AB - With the exponential growth of biological sequence databases, it has become critical to develop effective techniques for storing, querying, and analyzing these massive data. Suffix trees are widely used to solve many sequence-based problems, and they can be built in linear time and space, provided the resulting tree fits in main-memory. To index larger sequences, several external suffix tree algorithms have been proposed in recent years. However, they suffer from several problems such as susceptibility to data skew, non-scalability to genome-scale sequences, and non-existence of suffix links, which are crucial in various suffix tree based algorithms. In this paper, we target DNA sequences and propose a novel disk-based suffix tree algorithm called TRELLIS, which effectively scales up to genome-scale sequences. Specifically, it can index the entire human genome using 2GB of memory, in about 4 hours and can recover all its suffix links within 2 hours. TRELLIS was compared to various state-of-the-art persistent disk-based suffix tree construction algorithms, and was shown to outperform the best previous methods, both in terms of indexing time and querying time.

KW - Disk-based

KW - External memory

KW - Genome-scale

KW - Sequence indexing

KW - Suffix tree

UR - http://www.scopus.com/inward/record.url?scp=35448952633&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=35448952633&partnerID=8YFLogxK

U2 - 10.1145/1247480.1247572

DO - 10.1145/1247480.1247572

M3 - Conference contribution

SN - 1595936866

SN - 9781595936868

SP - 833

EP - 844

BT - Proceedings of the ACM SIGMOD International Conference on Management of Data

ER -