Semantics-based distributed I/O for mpiBLAST

P. Balaji, W. Feng, J. Archuleta, H. Lin, R. Kettimuthu, R. Thakur, Xiaosong Ma

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

BLAST is a widely used software toolkit for genomic sequence search. mpiBLAST is a freely available, open-source parallelization of BLAST that uses database segmentation to allow different worker processes to search (in parallel) unique segments of the database. After searching, the workers write their output to a filesystem. While mpiBLAST has been shown to achieve high performance in clusters with fast local filesystems, its I/O processing remains a concern for scalability, especially in systems having limited I/O capabilities such as distributed filesystems spread across a wide-area network. Thus, we present ParaMEDIC - a novel environment that uses applicationspecific semantic information to compress I/O data and improve performance in distributed environments. Specifically, for mpiBLAST, ParaMEDIC partitions worker processes into compute and I/O workers. Compute workers, instead of directly writing the output to the filesystem, the workers process the output using semantic knowledge about the application to generate metadata and write the metadata to the filesystem. I/O workers, which physically reside closer to the actual storage, then process this metadata to re-create the actual output and write it to the filesystem. This approach allows ParaMEDIC to reduce I/O time, thus accelerating mpiBLAST by as much as 25-fold.

Original languageEnglish
Title of host publicationProceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
Pages293-294
Number of pages2
Publication statusPublished - 1 Dec 2008
Externally publishedYes
Event13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'08 - Salt Lake City, UT, United States
Duration: 20 Feb 200823 Feb 2008

Other

Other13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'08
CountryUnited States
CitySalt Lake City, UT
Period20/2/0823/2/08

Fingerprint

Metadata
Semantics
Wide area networks
Scalability
Processing

Keywords

  • Distributed filesystem
  • I/O
  • MpiBLAST

ASJC Scopus subject areas

  • Software

Cite this

Balaji, P., Feng, W., Archuleta, J., Lin, H., Kettimuthu, R., Thakur, R., & Ma, X. (2008). Semantics-based distributed I/O for mpiBLAST. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (pp. 293-294)

Semantics-based distributed I/O for mpiBLAST. / Balaji, P.; Feng, W.; Archuleta, J.; Lin, H.; Kettimuthu, R.; Thakur, R.; Ma, Xiaosong.

Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP. 2008. p. 293-294.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Balaji, P, Feng, W, Archuleta, J, Lin, H, Kettimuthu, R, Thakur, R & Ma, X 2008, Semantics-based distributed I/O for mpiBLAST. in Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP. pp. 293-294, 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'08, Salt Lake City, UT, United States, 20/2/08.
Balaji P, Feng W, Archuleta J, Lin H, Kettimuthu R, Thakur R et al. Semantics-based distributed I/O for mpiBLAST. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP. 2008. p. 293-294
Balaji, P. ; Feng, W. ; Archuleta, J. ; Lin, H. ; Kettimuthu, R. ; Thakur, R. ; Ma, Xiaosong. / Semantics-based distributed I/O for mpiBLAST. Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP. 2008. pp. 293-294
@inproceedings{889bd08252214565a2bbb3d8fdcea6d0,
title = "Semantics-based distributed I/O for mpiBLAST",
abstract = "BLAST is a widely used software toolkit for genomic sequence search. mpiBLAST is a freely available, open-source parallelization of BLAST that uses database segmentation to allow different worker processes to search (in parallel) unique segments of the database. After searching, the workers write their output to a filesystem. While mpiBLAST has been shown to achieve high performance in clusters with fast local filesystems, its I/O processing remains a concern for scalability, especially in systems having limited I/O capabilities such as distributed filesystems spread across a wide-area network. Thus, we present ParaMEDIC - a novel environment that uses applicationspecific semantic information to compress I/O data and improve performance in distributed environments. Specifically, for mpiBLAST, ParaMEDIC partitions worker processes into compute and I/O workers. Compute workers, instead of directly writing the output to the filesystem, the workers process the output using semantic knowledge about the application to generate metadata and write the metadata to the filesystem. I/O workers, which physically reside closer to the actual storage, then process this metadata to re-create the actual output and write it to the filesystem. This approach allows ParaMEDIC to reduce I/O time, thus accelerating mpiBLAST by as much as 25-fold.",
keywords = "Distributed filesystem, I/O, MpiBLAST",
author = "P. Balaji and W. Feng and J. Archuleta and H. Lin and R. Kettimuthu and R. Thakur and Xiaosong Ma",
year = "2008",
month = "12",
day = "1",
language = "English",
isbn = "9781595939609",
pages = "293--294",
booktitle = "Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP",

}

TY - GEN

T1 - Semantics-based distributed I/O for mpiBLAST

AU - Balaji, P.

AU - Feng, W.

AU - Archuleta, J.

AU - Lin, H.

AU - Kettimuthu, R.

AU - Thakur, R.

AU - Ma, Xiaosong

PY - 2008/12/1

Y1 - 2008/12/1

N2 - BLAST is a widely used software toolkit for genomic sequence search. mpiBLAST is a freely available, open-source parallelization of BLAST that uses database segmentation to allow different worker processes to search (in parallel) unique segments of the database. After searching, the workers write their output to a filesystem. While mpiBLAST has been shown to achieve high performance in clusters with fast local filesystems, its I/O processing remains a concern for scalability, especially in systems having limited I/O capabilities such as distributed filesystems spread across a wide-area network. Thus, we present ParaMEDIC - a novel environment that uses applicationspecific semantic information to compress I/O data and improve performance in distributed environments. Specifically, for mpiBLAST, ParaMEDIC partitions worker processes into compute and I/O workers. Compute workers, instead of directly writing the output to the filesystem, the workers process the output using semantic knowledge about the application to generate metadata and write the metadata to the filesystem. I/O workers, which physically reside closer to the actual storage, then process this metadata to re-create the actual output and write it to the filesystem. This approach allows ParaMEDIC to reduce I/O time, thus accelerating mpiBLAST by as much as 25-fold.

AB - BLAST is a widely used software toolkit for genomic sequence search. mpiBLAST is a freely available, open-source parallelization of BLAST that uses database segmentation to allow different worker processes to search (in parallel) unique segments of the database. After searching, the workers write their output to a filesystem. While mpiBLAST has been shown to achieve high performance in clusters with fast local filesystems, its I/O processing remains a concern for scalability, especially in systems having limited I/O capabilities such as distributed filesystems spread across a wide-area network. Thus, we present ParaMEDIC - a novel environment that uses applicationspecific semantic information to compress I/O data and improve performance in distributed environments. Specifically, for mpiBLAST, ParaMEDIC partitions worker processes into compute and I/O workers. Compute workers, instead of directly writing the output to the filesystem, the workers process the output using semantic knowledge about the application to generate metadata and write the metadata to the filesystem. I/O workers, which physically reside closer to the actual storage, then process this metadata to re-create the actual output and write it to the filesystem. This approach allows ParaMEDIC to reduce I/O time, thus accelerating mpiBLAST by as much as 25-fold.

KW - Distributed filesystem

KW - I/O

KW - MpiBLAST

UR - http://www.scopus.com/inward/record.url?scp=69249246712&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=69249246712&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9781595939609

SP - 293

EP - 294

BT - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP

ER -