NSIT: Novel sequence identification tool

Benjarath Pupacdi, Asif Javed, Mohammed J. Zaki, Mathuros Ruchirawat

Research output: Contribution to journalArticle

Abstract

Novel sequences are DNA sequences present in an individual's genome but absent in the human reference assembly. They are predicted to be biologically important, both individual and population specific, and consistent with the known human migration paths. Recent works have shown that an average person harbors 2/5 Mb of such sequences and estimated that the human pan-genome contains as high as 19/40 Mb of novel sequences. To identify them in a de novo genome assembly, some existing sequence aligners have been used but no computational method has been specifically proposed for this task. In this work, we developed NSIT (Novel Sequence Identification Tool), a software that can accurately and efficiently identify novel sequences in an individual's de novo whole genome assembly. We identified and characterized 1.1 Mb, 1.2 Mb, and 1.0 Mb of novel sequences in NA18507 (African), YH (Asian), and NA12878 (European) de novo genome assemblies, respectively. Our results show very high concordance with the previous work using the respective reference assembly. In addition, our results using the latest human reference assembly suggest that the amount of novel sequences per individual may not be as high as previously reported. We additionally developed a graphical viewer for comparisons of novel sequence contents. The viewer also helped in identifying sequence contamination; we found 130 kb of Epstein-Barr virus sequence in the previously published NA18507 novel sequences as well as 287 kb of zebrafish repeats in NA12878 de novo assembly. NSIT requires <2GB of RAM and 1.5-2 hrs on a commodity desktop. The program is applicable to input assemblies with varying contig/scaffold sizes, ranging from 100 bp to as high as 50 Mb. It works in both 32-bit and 64-bit systems and outperforms, by large margins, other fast sequence aligners previously applied to this task. To our knowledge, NSIT is the first software designed specifically for novel sequence identification in a de novo human genome assembly.

Original languageEnglish
Article numbere108011
JournalPLoS One
Volume9
Issue number9
DOIs
Publication statusPublished - 29 Sep 2014

Fingerprint

genome assembly
Genome
Human Genome
Genes
Human Migration
Software
Zebrafish
Human herpesvirus 4
Human Herpesvirus 4
genome
Danio rerio
products and commodities
nucleotide sequences
DNA sequences
Random access storage
Population
Ports and harbors
Computational methods
Viruses
Scaffolds

ASJC Scopus subject areas

  • Agricultural and Biological Sciences(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Medicine(all)

Cite this

Pupacdi, B., Javed, A., Zaki, M. J., & Ruchirawat, M. (2014). NSIT: Novel sequence identification tool. PLoS One, 9(9), [e108011]. https://doi.org/10.1371/journal.pone.0108011

NSIT : Novel sequence identification tool. / Pupacdi, Benjarath; Javed, Asif; Zaki, Mohammed J.; Ruchirawat, Mathuros.

In: PLoS One, Vol. 9, No. 9, e108011, 29.09.2014.

Research output: Contribution to journalArticle

Pupacdi, B, Javed, A, Zaki, MJ & Ruchirawat, M 2014, 'NSIT: Novel sequence identification tool', PLoS One, vol. 9, no. 9, e108011. https://doi.org/10.1371/journal.pone.0108011
Pupacdi B, Javed A, Zaki MJ, Ruchirawat M. NSIT: Novel sequence identification tool. PLoS One. 2014 Sep 29;9(9). e108011. https://doi.org/10.1371/journal.pone.0108011
Pupacdi, Benjarath ; Javed, Asif ; Zaki, Mohammed J. ; Ruchirawat, Mathuros. / NSIT : Novel sequence identification tool. In: PLoS One. 2014 ; Vol. 9, No. 9.
@article{2e4fdffc08594d42906430a8cf73b9c9,
title = "NSIT: Novel sequence identification tool",
abstract = "Novel sequences are DNA sequences present in an individual's genome but absent in the human reference assembly. They are predicted to be biologically important, both individual and population specific, and consistent with the known human migration paths. Recent works have shown that an average person harbors 2/5 Mb of such sequences and estimated that the human pan-genome contains as high as 19/40 Mb of novel sequences. To identify them in a de novo genome assembly, some existing sequence aligners have been used but no computational method has been specifically proposed for this task. In this work, we developed NSIT (Novel Sequence Identification Tool), a software that can accurately and efficiently identify novel sequences in an individual's de novo whole genome assembly. We identified and characterized 1.1 Mb, 1.2 Mb, and 1.0 Mb of novel sequences in NA18507 (African), YH (Asian), and NA12878 (European) de novo genome assemblies, respectively. Our results show very high concordance with the previous work using the respective reference assembly. In addition, our results using the latest human reference assembly suggest that the amount of novel sequences per individual may not be as high as previously reported. We additionally developed a graphical viewer for comparisons of novel sequence contents. The viewer also helped in identifying sequence contamination; we found 130 kb of Epstein-Barr virus sequence in the previously published NA18507 novel sequences as well as 287 kb of zebrafish repeats in NA12878 de novo assembly. NSIT requires <2GB of RAM and 1.5-2 hrs on a commodity desktop. The program is applicable to input assemblies with varying contig/scaffold sizes, ranging from 100 bp to as high as 50 Mb. It works in both 32-bit and 64-bit systems and outperforms, by large margins, other fast sequence aligners previously applied to this task. To our knowledge, NSIT is the first software designed specifically for novel sequence identification in a de novo human genome assembly.",
author = "Benjarath Pupacdi and Asif Javed and Zaki, {Mohammed J.} and Mathuros Ruchirawat",
year = "2014",
month = "9",
day = "29",
doi = "10.1371/journal.pone.0108011",
language = "English",
volume = "9",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "9",

}

TY - JOUR

T1 - NSIT

T2 - Novel sequence identification tool

AU - Pupacdi, Benjarath

AU - Javed, Asif

AU - Zaki, Mohammed J.

AU - Ruchirawat, Mathuros

PY - 2014/9/29

Y1 - 2014/9/29

N2 - Novel sequences are DNA sequences present in an individual's genome but absent in the human reference assembly. They are predicted to be biologically important, both individual and population specific, and consistent with the known human migration paths. Recent works have shown that an average person harbors 2/5 Mb of such sequences and estimated that the human pan-genome contains as high as 19/40 Mb of novel sequences. To identify them in a de novo genome assembly, some existing sequence aligners have been used but no computational method has been specifically proposed for this task. In this work, we developed NSIT (Novel Sequence Identification Tool), a software that can accurately and efficiently identify novel sequences in an individual's de novo whole genome assembly. We identified and characterized 1.1 Mb, 1.2 Mb, and 1.0 Mb of novel sequences in NA18507 (African), YH (Asian), and NA12878 (European) de novo genome assemblies, respectively. Our results show very high concordance with the previous work using the respective reference assembly. In addition, our results using the latest human reference assembly suggest that the amount of novel sequences per individual may not be as high as previously reported. We additionally developed a graphical viewer for comparisons of novel sequence contents. The viewer also helped in identifying sequence contamination; we found 130 kb of Epstein-Barr virus sequence in the previously published NA18507 novel sequences as well as 287 kb of zebrafish repeats in NA12878 de novo assembly. NSIT requires <2GB of RAM and 1.5-2 hrs on a commodity desktop. The program is applicable to input assemblies with varying contig/scaffold sizes, ranging from 100 bp to as high as 50 Mb. It works in both 32-bit and 64-bit systems and outperforms, by large margins, other fast sequence aligners previously applied to this task. To our knowledge, NSIT is the first software designed specifically for novel sequence identification in a de novo human genome assembly.

AB - Novel sequences are DNA sequences present in an individual's genome but absent in the human reference assembly. They are predicted to be biologically important, both individual and population specific, and consistent with the known human migration paths. Recent works have shown that an average person harbors 2/5 Mb of such sequences and estimated that the human pan-genome contains as high as 19/40 Mb of novel sequences. To identify them in a de novo genome assembly, some existing sequence aligners have been used but no computational method has been specifically proposed for this task. In this work, we developed NSIT (Novel Sequence Identification Tool), a software that can accurately and efficiently identify novel sequences in an individual's de novo whole genome assembly. We identified and characterized 1.1 Mb, 1.2 Mb, and 1.0 Mb of novel sequences in NA18507 (African), YH (Asian), and NA12878 (European) de novo genome assemblies, respectively. Our results show very high concordance with the previous work using the respective reference assembly. In addition, our results using the latest human reference assembly suggest that the amount of novel sequences per individual may not be as high as previously reported. We additionally developed a graphical viewer for comparisons of novel sequence contents. The viewer also helped in identifying sequence contamination; we found 130 kb of Epstein-Barr virus sequence in the previously published NA18507 novel sequences as well as 287 kb of zebrafish repeats in NA12878 de novo assembly. NSIT requires <2GB of RAM and 1.5-2 hrs on a commodity desktop. The program is applicable to input assemblies with varying contig/scaffold sizes, ranging from 100 bp to as high as 50 Mb. It works in both 32-bit and 64-bit systems and outperforms, by large margins, other fast sequence aligners previously applied to this task. To our knowledge, NSIT is the first software designed specifically for novel sequence identification in a de novo human genome assembly.

UR - http://www.scopus.com/inward/record.url?scp=84907495414&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84907495414&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0108011

DO - 10.1371/journal.pone.0108011

M3 - Article

C2 - 25264906

AN - SCOPUS:84907495414

VL - 9

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 9

M1 - e108011

ER -