Evaluation of SNP calling using single and multiple-sample calling algorithms by validation against array base genotyping and Mendelian inheritance

Pankaj Kumar, Mashael Al-Shafai, Wadha Ahmed Al Muftah, Nader Chalhoub, Mahmoud F. Elsaid, Alice Kamal Abd El Aleem, Karsten Suhre

Research output: Contribution to journalArticle

9 Citations (Scopus)

Abstract

Background: With diminishing costs of next generation sequencing (NGS), whole genome analysis becomes a standard tool for identifying genetic causes of inherited diseases. Commercial NGS service providers in general not only provide raw genomic reads, but further deliver SNP calls to their clients. However, the question for the user arises whether to use the SNP data as is, or process the raw sequencing data further through more sophisticated SNP calling pipelines with more advanced algorithms. Results: Here we report a detailed comparison of SNPs called using the popular GATK multiple-sample calling protocol to SNPs delivered as part of a 40x whole genome sequencing project by Illumina Inc of 171 human genomes of Arab descent (108 unrelated Qatari genomes, 19 trios, and 2 families with rare diseases) and compare them to variants provided by the Illumina CASAVA pipeline. GATK multi-sample calling identifies more variants than the CASAVA pipeline. The additional variants from GATK are robust for Mendelian consistencies but weak in terms of statistical parameters such as TsTv ratio. However, these additional variants do not make a difference in detecting the causative variants in the studied phenotype. Conclusion: Both pipelines, GATK multi-sample calling and Illumina CASAVA single sample calling, have highly similar performance in SNP calling at the level of putatively causative variants.

Original languageEnglish
Article number747
JournalBMC Research Notes
Volume7
Issue number1
DOIs
Publication statusPublished - 2014

Fingerprint

Single Nucleotide Polymorphism
Pipelines
Genes
Genome
Human Genome
Rare Diseases
Phenotype
Costs
Costs and Cost Analysis

Keywords

  • CASAVA
  • GATK
  • Genotype calling
  • Illumina
  • Mendelian inheritance
  • Multi-sample calling
  • NGS
  • Qatari population
  • Trios
  • Variant
  • WGS pipeline

ASJC Scopus subject areas

  • Medicine(all)
  • Biochemistry, Genetics and Molecular Biology(all)

Cite this

Evaluation of SNP calling using single and multiple-sample calling algorithms by validation against array base genotyping and Mendelian inheritance. / Kumar, Pankaj; Al-Shafai, Mashael; Al Muftah, Wadha Ahmed; Chalhoub, Nader; Elsaid, Mahmoud F.; Kamal Abd El Aleem, Alice; Suhre, Karsten.

In: BMC Research Notes, Vol. 7, No. 1, 747, 2014.

Research output: Contribution to journalArticle

@article{54dbfe47403240818983f3c8aacb910b,
title = "Evaluation of SNP calling using single and multiple-sample calling algorithms by validation against array base genotyping and Mendelian inheritance",
abstract = "Background: With diminishing costs of next generation sequencing (NGS), whole genome analysis becomes a standard tool for identifying genetic causes of inherited diseases. Commercial NGS service providers in general not only provide raw genomic reads, but further deliver SNP calls to their clients. However, the question for the user arises whether to use the SNP data as is, or process the raw sequencing data further through more sophisticated SNP calling pipelines with more advanced algorithms. Results: Here we report a detailed comparison of SNPs called using the popular GATK multiple-sample calling protocol to SNPs delivered as part of a 40x whole genome sequencing project by Illumina Inc of 171 human genomes of Arab descent (108 unrelated Qatari genomes, 19 trios, and 2 families with rare diseases) and compare them to variants provided by the Illumina CASAVA pipeline. GATK multi-sample calling identifies more variants than the CASAVA pipeline. The additional variants from GATK are robust for Mendelian consistencies but weak in terms of statistical parameters such as TsTv ratio. However, these additional variants do not make a difference in detecting the causative variants in the studied phenotype. Conclusion: Both pipelines, GATK multi-sample calling and Illumina CASAVA single sample calling, have highly similar performance in SNP calling at the level of putatively causative variants.",
keywords = "CASAVA, GATK, Genotype calling, Illumina, Mendelian inheritance, Multi-sample calling, NGS, Qatari population, Trios, Variant, WGS pipeline",
author = "Pankaj Kumar and Mashael Al-Shafai and {Al Muftah}, {Wadha Ahmed} and Nader Chalhoub and Elsaid, {Mahmoud F.} and {Kamal Abd El Aleem}, Alice and Karsten Suhre",
year = "2014",
doi = "10.1186/1756-0500-7-747",
language = "English",
volume = "7",
journal = "BMC Research Notes",
issn = "1756-0500",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Evaluation of SNP calling using single and multiple-sample calling algorithms by validation against array base genotyping and Mendelian inheritance

AU - Kumar, Pankaj

AU - Al-Shafai, Mashael

AU - Al Muftah, Wadha Ahmed

AU - Chalhoub, Nader

AU - Elsaid, Mahmoud F.

AU - Kamal Abd El Aleem, Alice

AU - Suhre, Karsten

PY - 2014

Y1 - 2014

N2 - Background: With diminishing costs of next generation sequencing (NGS), whole genome analysis becomes a standard tool for identifying genetic causes of inherited diseases. Commercial NGS service providers in general not only provide raw genomic reads, but further deliver SNP calls to their clients. However, the question for the user arises whether to use the SNP data as is, or process the raw sequencing data further through more sophisticated SNP calling pipelines with more advanced algorithms. Results: Here we report a detailed comparison of SNPs called using the popular GATK multiple-sample calling protocol to SNPs delivered as part of a 40x whole genome sequencing project by Illumina Inc of 171 human genomes of Arab descent (108 unrelated Qatari genomes, 19 trios, and 2 families with rare diseases) and compare them to variants provided by the Illumina CASAVA pipeline. GATK multi-sample calling identifies more variants than the CASAVA pipeline. The additional variants from GATK are robust for Mendelian consistencies but weak in terms of statistical parameters such as TsTv ratio. However, these additional variants do not make a difference in detecting the causative variants in the studied phenotype. Conclusion: Both pipelines, GATK multi-sample calling and Illumina CASAVA single sample calling, have highly similar performance in SNP calling at the level of putatively causative variants.

AB - Background: With diminishing costs of next generation sequencing (NGS), whole genome analysis becomes a standard tool for identifying genetic causes of inherited diseases. Commercial NGS service providers in general not only provide raw genomic reads, but further deliver SNP calls to their clients. However, the question for the user arises whether to use the SNP data as is, or process the raw sequencing data further through more sophisticated SNP calling pipelines with more advanced algorithms. Results: Here we report a detailed comparison of SNPs called using the popular GATK multiple-sample calling protocol to SNPs delivered as part of a 40x whole genome sequencing project by Illumina Inc of 171 human genomes of Arab descent (108 unrelated Qatari genomes, 19 trios, and 2 families with rare diseases) and compare them to variants provided by the Illumina CASAVA pipeline. GATK multi-sample calling identifies more variants than the CASAVA pipeline. The additional variants from GATK are robust for Mendelian consistencies but weak in terms of statistical parameters such as TsTv ratio. However, these additional variants do not make a difference in detecting the causative variants in the studied phenotype. Conclusion: Both pipelines, GATK multi-sample calling and Illumina CASAVA single sample calling, have highly similar performance in SNP calling at the level of putatively causative variants.

KW - CASAVA

KW - GATK

KW - Genotype calling

KW - Illumina

KW - Mendelian inheritance

KW - Multi-sample calling

KW - NGS

KW - Qatari population

KW - Trios

KW - Variant

KW - WGS pipeline

UR - http://www.scopus.com/inward/record.url?scp=84932169705&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84932169705&partnerID=8YFLogxK

U2 - 10.1186/1756-0500-7-747

DO - 10.1186/1756-0500-7-747

M3 - Article

VL - 7

JO - BMC Research Notes

JF - BMC Research Notes

SN - 1756-0500

IS - 1

M1 - 747

ER -