Accelerating next generation sequencing data analysis with system level optimizations /631/114 /631/114/2398 /38/43 /139 article

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

Next generation sequencing (NGS) data analysis is highly compute intensive. In-memory computing, vectorization, bulk data transfer, CPU frequency scaling are some of the hardware features in the modern computing architectures. To get the best execution time and utilize these hardware features, it is necessary to tune the system level parameters before running the application. We studied the GATK-HaplotypeCaller which is part of common NGS workflows, that consume more than 43% of the total execution time. Multiple GATK 3.x versions were benchmarked and the execution time of HaplotypeCaller was optimized by various system level parameters which included: (i) tuning the parallel garbage collection and kernel shared memory to simulate in-memory computing, (ii) architecture-specific tuning in the PairHMM library for vectorization, (iii) including Java 1.8 features through GATK source code compilation and building a runtime environment for parallel sorting and bulk data transfer (iv) the default 'on-demand' mode of CPU frequency is over-clocked by using 'performance-mode' to accelerate the Java multi-threads. As a result, the HaplotypeCaller execution time was reduced by 82.66% in GATK 3.3 and 42.61% in GATK 3.7. Overall, the execution time of NGS pipeline was reduced to 70.60% and 34.14% for GATK 3.3 and GATK 3.7 respectively.

Original languageEnglish
Article number9058
JournalScientific Reports
Volume7
Issue number1
DOIs
Publication statusPublished - 1 Dec 2017

Fingerprint

Data transfer
Data storage equipment
Program processors
Tuning
Sorting
Computer hardware
Pipelines
Hardware

ASJC Scopus subject areas

  • General

Cite this

@article{1e59c5b57e7f4e6f85d492297434b615,
title = "Accelerating next generation sequencing data analysis with system level optimizations /631/114 /631/114/2398 /38/43 /139 article",
abstract = "Next generation sequencing (NGS) data analysis is highly compute intensive. In-memory computing, vectorization, bulk data transfer, CPU frequency scaling are some of the hardware features in the modern computing architectures. To get the best execution time and utilize these hardware features, it is necessary to tune the system level parameters before running the application. We studied the GATK-HaplotypeCaller which is part of common NGS workflows, that consume more than 43{\%} of the total execution time. Multiple GATK 3.x versions were benchmarked and the execution time of HaplotypeCaller was optimized by various system level parameters which included: (i) tuning the parallel garbage collection and kernel shared memory to simulate in-memory computing, (ii) architecture-specific tuning in the PairHMM library for vectorization, (iii) including Java 1.8 features through GATK source code compilation and building a runtime environment for parallel sorting and bulk data transfer (iv) the default 'on-demand' mode of CPU frequency is over-clocked by using 'performance-mode' to accelerate the Java multi-threads. As a result, the HaplotypeCaller execution time was reduced by 82.66{\%} in GATK 3.3 and 42.61{\%} in GATK 3.7. Overall, the execution time of NGS pipeline was reduced to 70.60{\%} and 34.14{\%} for GATK 3.3 and GATK 3.7 respectively.",
author = "Nagarajan Kathiresan and Ramzi Temanni and Hakeem Almabrazi and Najeeb Syed and Jithesh, {Puthen V.} and Al-Ali, {Rashid J.}",
year = "2017",
month = "12",
day = "1",
doi = "10.1038/s41598-017-09089-1",
language = "English",
volume = "7",
journal = "Scientific Reports",
issn = "2045-2322",
publisher = "Nature Publishing Group",
number = "1",

}

TY - JOUR

T1 - Accelerating next generation sequencing data analysis with system level optimizations /631/114 /631/114/2398 /38/43 /139 article

AU - Kathiresan, Nagarajan

AU - Temanni, Ramzi

AU - Almabrazi, Hakeem

AU - Syed, Najeeb

AU - Jithesh, Puthen V.

AU - Al-Ali, Rashid J.

PY - 2017/12/1

Y1 - 2017/12/1

N2 - Next generation sequencing (NGS) data analysis is highly compute intensive. In-memory computing, vectorization, bulk data transfer, CPU frequency scaling are some of the hardware features in the modern computing architectures. To get the best execution time and utilize these hardware features, it is necessary to tune the system level parameters before running the application. We studied the GATK-HaplotypeCaller which is part of common NGS workflows, that consume more than 43% of the total execution time. Multiple GATK 3.x versions were benchmarked and the execution time of HaplotypeCaller was optimized by various system level parameters which included: (i) tuning the parallel garbage collection and kernel shared memory to simulate in-memory computing, (ii) architecture-specific tuning in the PairHMM library for vectorization, (iii) including Java 1.8 features through GATK source code compilation and building a runtime environment for parallel sorting and bulk data transfer (iv) the default 'on-demand' mode of CPU frequency is over-clocked by using 'performance-mode' to accelerate the Java multi-threads. As a result, the HaplotypeCaller execution time was reduced by 82.66% in GATK 3.3 and 42.61% in GATK 3.7. Overall, the execution time of NGS pipeline was reduced to 70.60% and 34.14% for GATK 3.3 and GATK 3.7 respectively.

AB - Next generation sequencing (NGS) data analysis is highly compute intensive. In-memory computing, vectorization, bulk data transfer, CPU frequency scaling are some of the hardware features in the modern computing architectures. To get the best execution time and utilize these hardware features, it is necessary to tune the system level parameters before running the application. We studied the GATK-HaplotypeCaller which is part of common NGS workflows, that consume more than 43% of the total execution time. Multiple GATK 3.x versions were benchmarked and the execution time of HaplotypeCaller was optimized by various system level parameters which included: (i) tuning the parallel garbage collection and kernel shared memory to simulate in-memory computing, (ii) architecture-specific tuning in the PairHMM library for vectorization, (iii) including Java 1.8 features through GATK source code compilation and building a runtime environment for parallel sorting and bulk data transfer (iv) the default 'on-demand' mode of CPU frequency is over-clocked by using 'performance-mode' to accelerate the Java multi-threads. As a result, the HaplotypeCaller execution time was reduced by 82.66% in GATK 3.3 and 42.61% in GATK 3.7. Overall, the execution time of NGS pipeline was reduced to 70.60% and 34.14% for GATK 3.3 and GATK 3.7 respectively.

UR - http://www.scopus.com/inward/record.url?scp=85028064173&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85028064173&partnerID=8YFLogxK

U2 - 10.1038/s41598-017-09089-1

DO - 10.1038/s41598-017-09089-1

M3 - Article

VL - 7

JO - Scientific Reports

JF - Scientific Reports

SN - 2045-2322

IS - 1

M1 - 9058

ER -