Efficient data access for parallel BLAST

Heshan Lin, Xiaosong Ma, Praveen Chandramohan, Al Geist, Nagiza Samatova

Research output: Chapter in Book/Report/Conference proceedingConference contribution

57 Citations (Scopus)

Abstract

Searching biological sequence databases is one of the most routine tasks in computational biology. This task is significantly hampered by the exponential growth in sequence database sizes. Recent advances in parallelization of biological sequence search applications have enabled bioinformatics researchers to utilize high-performance computing platforms and, as a result, greatly reduce the execution time of their sequence database searches. However, existing parallel sequence search tools have been focusing mostly on parallelizing the sequence alignment engine. While the computation-intensive alignment tasks become cheaper with larger machines, data-intensive initial preparation and result merging tasks become more expensive. Inefficient handling of input and output data can easily create performance bottlenecks even on supercomputers. It also causes a considerable data management overhead. In this paper, we present a set of techniques for efficient and flexible data handling in parallel sequence search applications. We demonstrate our optimizations through improving mpiBLAST, an open-source parallel BLAST tool rapidly gaining popularity. These optimization techniques aim at enabling flexible database partitioning, reducing I/O by caching small auxiliary files and results, enabling parallel I/O on shared files, and performing scalable result processing protocols. As a result, we reduce mpiBLAST users' operational overhead by removing the requirement of prepartitioning databases. Meanwhile, our experiments show that these techniques can bring by an order of magnitude improvement to both the overall performance and scalability of mpiBLAST.

Original languageEnglish
Title of host publicationProceedings - 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005
Volume2005
DOIs
Publication statusPublished - 1 Dec 2005
Externally publishedYes
Event19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005 - Denver, CO, United States
Duration: 4 Apr 20058 Apr 2005

Other

Other19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005
CountryUnited States
CityDenver, CO
Period4/4/058/4/05

Fingerprint

Data handling
Supercomputers
Bioinformatics
Merging
Information management
Scalability
Engines
Processing
Experiments

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Lin, H., Ma, X., Chandramohan, P., Geist, A., & Samatova, N. (2005). Efficient data access for parallel BLAST. In Proceedings - 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005 (Vol. 2005). [1419898] https://doi.org/10.1109/IPDPS.2005.190

Efficient data access for parallel BLAST. / Lin, Heshan; Ma, Xiaosong; Chandramohan, Praveen; Geist, Al; Samatova, Nagiza.

Proceedings - 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005. Vol. 2005 2005. 1419898.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Lin, H, Ma, X, Chandramohan, P, Geist, A & Samatova, N 2005, Efficient data access for parallel BLAST. in Proceedings - 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005. vol. 2005, 1419898, 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005, Denver, CO, United States, 4/4/05. https://doi.org/10.1109/IPDPS.2005.190
Lin H, Ma X, Chandramohan P, Geist A, Samatova N. Efficient data access for parallel BLAST. In Proceedings - 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005. Vol. 2005. 2005. 1419898 https://doi.org/10.1109/IPDPS.2005.190
Lin, Heshan ; Ma, Xiaosong ; Chandramohan, Praveen ; Geist, Al ; Samatova, Nagiza. / Efficient data access for parallel BLAST. Proceedings - 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005. Vol. 2005 2005.
@inproceedings{7f4c1b3e17c449e1891bea969de3db26,
title = "Efficient data access for parallel BLAST",
abstract = "Searching biological sequence databases is one of the most routine tasks in computational biology. This task is significantly hampered by the exponential growth in sequence database sizes. Recent advances in parallelization of biological sequence search applications have enabled bioinformatics researchers to utilize high-performance computing platforms and, as a result, greatly reduce the execution time of their sequence database searches. However, existing parallel sequence search tools have been focusing mostly on parallelizing the sequence alignment engine. While the computation-intensive alignment tasks become cheaper with larger machines, data-intensive initial preparation and result merging tasks become more expensive. Inefficient handling of input and output data can easily create performance bottlenecks even on supercomputers. It also causes a considerable data management overhead. In this paper, we present a set of techniques for efficient and flexible data handling in parallel sequence search applications. We demonstrate our optimizations through improving mpiBLAST, an open-source parallel BLAST tool rapidly gaining popularity. These optimization techniques aim at enabling flexible database partitioning, reducing I/O by caching small auxiliary files and results, enabling parallel I/O on shared files, and performing scalable result processing protocols. As a result, we reduce mpiBLAST users' operational overhead by removing the requirement of prepartitioning databases. Meanwhile, our experiments show that these techniques can bring by an order of magnitude improvement to both the overall performance and scalability of mpiBLAST.",
author = "Heshan Lin and Xiaosong Ma and Praveen Chandramohan and Al Geist and Nagiza Samatova",
year = "2005",
month = "12",
day = "1",
doi = "10.1109/IPDPS.2005.190",
language = "English",
isbn = "0769523129",
volume = "2005",
booktitle = "Proceedings - 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005",

}

TY - GEN

T1 - Efficient data access for parallel BLAST

AU - Lin, Heshan

AU - Ma, Xiaosong

AU - Chandramohan, Praveen

AU - Geist, Al

AU - Samatova, Nagiza

PY - 2005/12/1

Y1 - 2005/12/1

N2 - Searching biological sequence databases is one of the most routine tasks in computational biology. This task is significantly hampered by the exponential growth in sequence database sizes. Recent advances in parallelization of biological sequence search applications have enabled bioinformatics researchers to utilize high-performance computing platforms and, as a result, greatly reduce the execution time of their sequence database searches. However, existing parallel sequence search tools have been focusing mostly on parallelizing the sequence alignment engine. While the computation-intensive alignment tasks become cheaper with larger machines, data-intensive initial preparation and result merging tasks become more expensive. Inefficient handling of input and output data can easily create performance bottlenecks even on supercomputers. It also causes a considerable data management overhead. In this paper, we present a set of techniques for efficient and flexible data handling in parallel sequence search applications. We demonstrate our optimizations through improving mpiBLAST, an open-source parallel BLAST tool rapidly gaining popularity. These optimization techniques aim at enabling flexible database partitioning, reducing I/O by caching small auxiliary files and results, enabling parallel I/O on shared files, and performing scalable result processing protocols. As a result, we reduce mpiBLAST users' operational overhead by removing the requirement of prepartitioning databases. Meanwhile, our experiments show that these techniques can bring by an order of magnitude improvement to both the overall performance and scalability of mpiBLAST.

AB - Searching biological sequence databases is one of the most routine tasks in computational biology. This task is significantly hampered by the exponential growth in sequence database sizes. Recent advances in parallelization of biological sequence search applications have enabled bioinformatics researchers to utilize high-performance computing platforms and, as a result, greatly reduce the execution time of their sequence database searches. However, existing parallel sequence search tools have been focusing mostly on parallelizing the sequence alignment engine. While the computation-intensive alignment tasks become cheaper with larger machines, data-intensive initial preparation and result merging tasks become more expensive. Inefficient handling of input and output data can easily create performance bottlenecks even on supercomputers. It also causes a considerable data management overhead. In this paper, we present a set of techniques for efficient and flexible data handling in parallel sequence search applications. We demonstrate our optimizations through improving mpiBLAST, an open-source parallel BLAST tool rapidly gaining popularity. These optimization techniques aim at enabling flexible database partitioning, reducing I/O by caching small auxiliary files and results, enabling parallel I/O on shared files, and performing scalable result processing protocols. As a result, we reduce mpiBLAST users' operational overhead by removing the requirement of prepartitioning databases. Meanwhile, our experiments show that these techniques can bring by an order of magnitude improvement to both the overall performance and scalability of mpiBLAST.

UR - http://www.scopus.com/inward/record.url?scp=33746293354&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33746293354&partnerID=8YFLogxK

U2 - 10.1109/IPDPS.2005.190

DO - 10.1109/IPDPS.2005.190

M3 - Conference contribution

SN - 0769523129

SN - 0769523129

SN - 9780769523125

VL - 2005

BT - Proceedings - 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005

ER -