Optimization of data-intensive next generation sequencing in high performance computing

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Advancement in Next Generation Sequencing (NGS) technology are associated with ever-increasing volume of genomic data every year. These genomic data are efficiently processed by empirical parallelism using High Performance Computing (HPC). The processed data can be used for genome-wide association studies, genetics, personalized medicine and many other areas. There are different kind of algorithms and implementations used in different phases of genome processing. In this paper, we used BWAKIT and GATK based software for processing larger volume of genomic data that are referred as "NGS workflow at SIDRA". We used BWAKIT for genome alignment and GATK for variant discovery in the NGS workflow that required larger computation and huge memory requirement respectively. We observed, the CPU utilization is not more than 45% during variant discovery and hence, it is necessary to understand the optimal selection (in terms of number of threads or cores) of the resources during the NGS workflow automation. We analyzed the performance bottleneck and application optimization in terms of "scalability" (use maximum available CPUs and memory) and "multiple instances of NGS workflow with different genome data within a node" (process more volume of genome data concurrently with limited set of CPUs and memory). We observed that, 40%, 65%, 71% and 76% improvement in performance while processing 2, 4, 8 and 16 samples concurrently using our own scheduling heuristics. As a result, our proposed NGS workflow automation will improve the performance upto 76% compared to application scalability based workflows.

Original languageEnglish
Title of host publication2015 IEEE 15th International Conference on Bioinformatics and Bioengineering, BIBE 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781467379830
DOIs
Publication statusPublished - 28 Dec 2015
Event15th IEEE International Conference on Bioinformatics and Bioengineering, BIBE 2015 - Belgrade, Serbia
Duration: 2 Nov 20154 Nov 2015

Other

Other15th IEEE International Conference on Bioinformatics and Bioengineering, BIBE 2015
CountrySerbia
CityBelgrade
Period2/11/154/11/15

Fingerprint

Computing Methodologies
Workflow
Genes
Program processors
Genome
Data storage equipment
Automation
Scalability
Processing
Precision Medicine
Genome-Wide Association Study
Medicine
Scheduling
Software
Technology

Keywords

  • BWA
  • Data-Intensive Workload and Concurrent Parallelization
  • High Performance Computing
  • Human Genome Sequence
  • Next Generation Sequencing
  • Thread Scalability

ASJC Scopus subject areas

  • Biotechnology
  • Computer Science Applications
  • Biomedical Engineering
  • Health Informatics

Cite this

Kathiresan, N., Al-Ali, R. J., Jithesh, P. V., AbuZaid, T., Temanni, R., & Ptitsyn, A. (2015). Optimization of data-intensive next generation sequencing in high performance computing. In 2015 IEEE 15th International Conference on Bioinformatics and Bioengineering, BIBE 2015 [7367654] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BIBE.2015.7367654

Optimization of data-intensive next generation sequencing in high performance computing. / Kathiresan, Nagarajan; Al-Ali, Rashid J.; Jithesh, Puthen V.; AbuZaid, Tariq; Temanni, Ramzi; Ptitsyn, Andrey.

2015 IEEE 15th International Conference on Bioinformatics and Bioengineering, BIBE 2015. Institute of Electrical and Electronics Engineers Inc., 2015. 7367654.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kathiresan, N, Al-Ali, RJ, Jithesh, PV, AbuZaid, T, Temanni, R & Ptitsyn, A 2015, Optimization of data-intensive next generation sequencing in high performance computing. in 2015 IEEE 15th International Conference on Bioinformatics and Bioengineering, BIBE 2015., 7367654, Institute of Electrical and Electronics Engineers Inc., 15th IEEE International Conference on Bioinformatics and Bioengineering, BIBE 2015, Belgrade, Serbia, 2/11/15. https://doi.org/10.1109/BIBE.2015.7367654
Kathiresan N, Al-Ali RJ, Jithesh PV, AbuZaid T, Temanni R, Ptitsyn A. Optimization of data-intensive next generation sequencing in high performance computing. In 2015 IEEE 15th International Conference on Bioinformatics and Bioengineering, BIBE 2015. Institute of Electrical and Electronics Engineers Inc. 2015. 7367654 https://doi.org/10.1109/BIBE.2015.7367654
Kathiresan, Nagarajan ; Al-Ali, Rashid J. ; Jithesh, Puthen V. ; AbuZaid, Tariq ; Temanni, Ramzi ; Ptitsyn, Andrey. / Optimization of data-intensive next generation sequencing in high performance computing. 2015 IEEE 15th International Conference on Bioinformatics and Bioengineering, BIBE 2015. Institute of Electrical and Electronics Engineers Inc., 2015.
@inproceedings{1d0b30184f53427c818c226f6b727f86,
title = "Optimization of data-intensive next generation sequencing in high performance computing",
abstract = "Advancement in Next Generation Sequencing (NGS) technology are associated with ever-increasing volume of genomic data every year. These genomic data are efficiently processed by empirical parallelism using High Performance Computing (HPC). The processed data can be used for genome-wide association studies, genetics, personalized medicine and many other areas. There are different kind of algorithms and implementations used in different phases of genome processing. In this paper, we used BWAKIT and GATK based software for processing larger volume of genomic data that are referred as {"}NGS workflow at SIDRA{"}. We used BWAKIT for genome alignment and GATK for variant discovery in the NGS workflow that required larger computation and huge memory requirement respectively. We observed, the CPU utilization is not more than 45{\%} during variant discovery and hence, it is necessary to understand the optimal selection (in terms of number of threads or cores) of the resources during the NGS workflow automation. We analyzed the performance bottleneck and application optimization in terms of {"}scalability{"} (use maximum available CPUs and memory) and {"}multiple instances of NGS workflow with different genome data within a node{"} (process more volume of genome data concurrently with limited set of CPUs and memory). We observed that, 40{\%}, 65{\%}, 71{\%} and 76{\%} improvement in performance while processing 2, 4, 8 and 16 samples concurrently using our own scheduling heuristics. As a result, our proposed NGS workflow automation will improve the performance upto 76{\%} compared to application scalability based workflows.",
keywords = "BWA, Data-Intensive Workload and Concurrent Parallelization, High Performance Computing, Human Genome Sequence, Next Generation Sequencing, Thread Scalability",
author = "Nagarajan Kathiresan and Al-Ali, {Rashid J.} and Jithesh, {Puthen V.} and Tariq AbuZaid and Ramzi Temanni and Andrey Ptitsyn",
year = "2015",
month = "12",
day = "28",
doi = "10.1109/BIBE.2015.7367654",
language = "English",
booktitle = "2015 IEEE 15th International Conference on Bioinformatics and Bioengineering, BIBE 2015",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Optimization of data-intensive next generation sequencing in high performance computing

AU - Kathiresan, Nagarajan

AU - Al-Ali, Rashid J.

AU - Jithesh, Puthen V.

AU - AbuZaid, Tariq

AU - Temanni, Ramzi

AU - Ptitsyn, Andrey

PY - 2015/12/28

Y1 - 2015/12/28

N2 - Advancement in Next Generation Sequencing (NGS) technology are associated with ever-increasing volume of genomic data every year. These genomic data are efficiently processed by empirical parallelism using High Performance Computing (HPC). The processed data can be used for genome-wide association studies, genetics, personalized medicine and many other areas. There are different kind of algorithms and implementations used in different phases of genome processing. In this paper, we used BWAKIT and GATK based software for processing larger volume of genomic data that are referred as "NGS workflow at SIDRA". We used BWAKIT for genome alignment and GATK for variant discovery in the NGS workflow that required larger computation and huge memory requirement respectively. We observed, the CPU utilization is not more than 45% during variant discovery and hence, it is necessary to understand the optimal selection (in terms of number of threads or cores) of the resources during the NGS workflow automation. We analyzed the performance bottleneck and application optimization in terms of "scalability" (use maximum available CPUs and memory) and "multiple instances of NGS workflow with different genome data within a node" (process more volume of genome data concurrently with limited set of CPUs and memory). We observed that, 40%, 65%, 71% and 76% improvement in performance while processing 2, 4, 8 and 16 samples concurrently using our own scheduling heuristics. As a result, our proposed NGS workflow automation will improve the performance upto 76% compared to application scalability based workflows.

AB - Advancement in Next Generation Sequencing (NGS) technology are associated with ever-increasing volume of genomic data every year. These genomic data are efficiently processed by empirical parallelism using High Performance Computing (HPC). The processed data can be used for genome-wide association studies, genetics, personalized medicine and many other areas. There are different kind of algorithms and implementations used in different phases of genome processing. In this paper, we used BWAKIT and GATK based software for processing larger volume of genomic data that are referred as "NGS workflow at SIDRA". We used BWAKIT for genome alignment and GATK for variant discovery in the NGS workflow that required larger computation and huge memory requirement respectively. We observed, the CPU utilization is not more than 45% during variant discovery and hence, it is necessary to understand the optimal selection (in terms of number of threads or cores) of the resources during the NGS workflow automation. We analyzed the performance bottleneck and application optimization in terms of "scalability" (use maximum available CPUs and memory) and "multiple instances of NGS workflow with different genome data within a node" (process more volume of genome data concurrently with limited set of CPUs and memory). We observed that, 40%, 65%, 71% and 76% improvement in performance while processing 2, 4, 8 and 16 samples concurrently using our own scheduling heuristics. As a result, our proposed NGS workflow automation will improve the performance upto 76% compared to application scalability based workflows.

KW - BWA

KW - Data-Intensive Workload and Concurrent Parallelization

KW - High Performance Computing

KW - Human Genome Sequence

KW - Next Generation Sequencing

KW - Thread Scalability

UR - http://www.scopus.com/inward/record.url?scp=84962844476&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84962844476&partnerID=8YFLogxK

U2 - 10.1109/BIBE.2015.7367654

DO - 10.1109/BIBE.2015.7367654

M3 - Conference contribution

BT - 2015 IEEE 15th International Conference on Bioinformatics and Bioengineering, BIBE 2015

PB - Institute of Electrical and Electronics Engineers Inc.

ER -