Optimization of data-intensive next generation sequencing in high performance computing

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Advancement in Next Generation Sequencing (NGS) technology are associated with ever-increasing volume of genomic data every year. These genomic data are efficiently processed by empirical parallelism using High Performance Computing (HPC). The processed data can be used for genome-wide association studies, genetics, personalized medicine and many other areas. There are different kind of algorithms and implementations used in different phases of genome processing. In this paper, we used BWAKIT and GATK based software for processing larger volume of genomic data that are referred as "NGS workflow at SIDRA". We used BWAKIT for genome alignment and GATK for variant discovery in the NGS workflow that required larger computation and huge memory requirement respectively. We observed, the CPU utilization is not more than 45% during variant discovery and hence, it is necessary to understand the optimal selection (in terms of number of threads or cores) of the resources during the NGS workflow automation. We analyzed the performance bottleneck and application optimization in terms of "scalability" (use maximum available CPUs and memory) and "multiple instances of NGS workflow with different genome data within a node" (process more volume of genome data concurrently with limited set of CPUs and memory). We observed that, 40%, 65%, 71% and 76% improvement in performance while processing 2, 4, 8 and 16 samples concurrently using our own scheduling heuristics. As a result, our proposed NGS workflow automation will improve the performance upto 76% compared to application scalability based workflows.

Original languageEnglish
Title of host publication2015 IEEE 15th International Conference on Bioinformatics and Bioengineering, BIBE 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781467379830
DOIs
Publication statusPublished - 28 Dec 2015
Event15th IEEE International Conference on Bioinformatics and Bioengineering, BIBE 2015 - Belgrade, Serbia
Duration: 2 Nov 20154 Nov 2015

Other

Other15th IEEE International Conference on Bioinformatics and Bioengineering, BIBE 2015
CountrySerbia
CityBelgrade
Period2/11/154/11/15

    Fingerprint

Keywords

  • BWA
  • Data-Intensive Workload and Concurrent Parallelization
  • High Performance Computing
  • Human Genome Sequence
  • Next Generation Sequencing
  • Thread Scalability

ASJC Scopus subject areas

  • Biotechnology
  • Computer Science Applications
  • Biomedical Engineering
  • Health Informatics

Cite this

Kathiresan, N., Al-Ali, R. J., Jithesh, P. V., AbuZaid, T., Temanni, R., & Ptitsyn, A. (2015). Optimization of data-intensive next generation sequencing in high performance computing. In 2015 IEEE 15th International Conference on Bioinformatics and Bioengineering, BIBE 2015 [7367654] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BIBE.2015.7367654