Semantic assembly and annotation of draft RNAseq transcripts without a reference genome

Andrey Ptitsyn, Ramzi Temanni, Christelle Bouchard, Peter A V Anderson

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Transcriptomes are one of the first sources of high-throughput genomic data that have benefitted from the introduction of Next-Gen Sequencing. As sequencing technology becomes more accessible, transcriptome sequencing is applicable to multiple organisms for which genome sequences are unavailable. Currently all methods for de novo assembly are based on the concept of matching the nucleotide context overlapping between short fragments-reads. However, even short reads may still contain biologically relevant information which can be used as hints in guiding the assembly process. We propose a computational workflow for the reconstruction and functional annotation of expressed gene transcripts that does not require a reference genome sequence and can be tolerant to low coverage, high error rates and other issues that often lead to poor results of de novo assembly in studies of non-model organisms. We start with either raw sequences or the output of a context-based de novo transcriptome assembly. Instead of mapping reads to a reference genome or creating a completely unsupervised clustering of reads, we assemble the unknown transcriptome using nearest homologs from a public database as seeds. We consider even distant relations, indirectly linking protein-coding fragments to entire gene families in multiple distantly related genomes. The intended application of the proposed method is an additional step of semantic (based on relations between protein-coding fragments) scaffolding following traditional (i.e. based on sequence overlap) de novo assembly. The method we developed was effective in analysis of the jellyfish Cyanea capillata transcriptome and may be applicable in other studies of gene expression in species lacking a high quality reference genome sequence. Our algorithms are implemented in C and designed for parallel computation using a high-performance computer. The software is available free of charge via an open source license.

Original languageEnglish
Article numbere0138006
JournalPLoS One
Volume10
Issue number9
DOIs
Publication statusPublished - 22 Sep 2015

Fingerprint

Transcriptome
Semantics
transcriptome
Genes
Genome
genome
Molecular Sequence Annotation
application coverage
Scyphozoa
Workflow
organisms
Licensure
transcriptomics
application methods
Cluster Analysis
Seeds
Proteins
Software
genes
Nucleotides

ASJC Scopus subject areas

  • Medicine(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Agricultural and Biological Sciences(all)

Cite this

Semantic assembly and annotation of draft RNAseq transcripts without a reference genome. / Ptitsyn, Andrey; Temanni, Ramzi; Bouchard, Christelle; Anderson, Peter A V.

In: PLoS One, Vol. 10, No. 9, e0138006, 22.09.2015.

Research output: Contribution to journalArticle

Ptitsyn, Andrey ; Temanni, Ramzi ; Bouchard, Christelle ; Anderson, Peter A V. / Semantic assembly and annotation of draft RNAseq transcripts without a reference genome. In: PLoS One. 2015 ; Vol. 10, No. 9.
@article{24ff5874131e460b8ee07f7b9868b58c,
title = "Semantic assembly and annotation of draft RNAseq transcripts without a reference genome",
abstract = "Transcriptomes are one of the first sources of high-throughput genomic data that have benefitted from the introduction of Next-Gen Sequencing. As sequencing technology becomes more accessible, transcriptome sequencing is applicable to multiple organisms for which genome sequences are unavailable. Currently all methods for de novo assembly are based on the concept of matching the nucleotide context overlapping between short fragments-reads. However, even short reads may still contain biologically relevant information which can be used as hints in guiding the assembly process. We propose a computational workflow for the reconstruction and functional annotation of expressed gene transcripts that does not require a reference genome sequence and can be tolerant to low coverage, high error rates and other issues that often lead to poor results of de novo assembly in studies of non-model organisms. We start with either raw sequences or the output of a context-based de novo transcriptome assembly. Instead of mapping reads to a reference genome or creating a completely unsupervised clustering of reads, we assemble the unknown transcriptome using nearest homologs from a public database as seeds. We consider even distant relations, indirectly linking protein-coding fragments to entire gene families in multiple distantly related genomes. The intended application of the proposed method is an additional step of semantic (based on relations between protein-coding fragments) scaffolding following traditional (i.e. based on sequence overlap) de novo assembly. The method we developed was effective in analysis of the jellyfish Cyanea capillata transcriptome and may be applicable in other studies of gene expression in species lacking a high quality reference genome sequence. Our algorithms are implemented in C and designed for parallel computation using a high-performance computer. The software is available free of charge via an open source license.",
author = "Andrey Ptitsyn and Ramzi Temanni and Christelle Bouchard and Anderson, {Peter A V}",
year = "2015",
month = "9",
day = "22",
doi = "10.1371/journal.pone.0138006",
language = "English",
volume = "10",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "9",

}

TY - JOUR

T1 - Semantic assembly and annotation of draft RNAseq transcripts without a reference genome

AU - Ptitsyn, Andrey

AU - Temanni, Ramzi

AU - Bouchard, Christelle

AU - Anderson, Peter A V

PY - 2015/9/22

Y1 - 2015/9/22

N2 - Transcriptomes are one of the first sources of high-throughput genomic data that have benefitted from the introduction of Next-Gen Sequencing. As sequencing technology becomes more accessible, transcriptome sequencing is applicable to multiple organisms for which genome sequences are unavailable. Currently all methods for de novo assembly are based on the concept of matching the nucleotide context overlapping between short fragments-reads. However, even short reads may still contain biologically relevant information which can be used as hints in guiding the assembly process. We propose a computational workflow for the reconstruction and functional annotation of expressed gene transcripts that does not require a reference genome sequence and can be tolerant to low coverage, high error rates and other issues that often lead to poor results of de novo assembly in studies of non-model organisms. We start with either raw sequences or the output of a context-based de novo transcriptome assembly. Instead of mapping reads to a reference genome or creating a completely unsupervised clustering of reads, we assemble the unknown transcriptome using nearest homologs from a public database as seeds. We consider even distant relations, indirectly linking protein-coding fragments to entire gene families in multiple distantly related genomes. The intended application of the proposed method is an additional step of semantic (based on relations between protein-coding fragments) scaffolding following traditional (i.e. based on sequence overlap) de novo assembly. The method we developed was effective in analysis of the jellyfish Cyanea capillata transcriptome and may be applicable in other studies of gene expression in species lacking a high quality reference genome sequence. Our algorithms are implemented in C and designed for parallel computation using a high-performance computer. The software is available free of charge via an open source license.

AB - Transcriptomes are one of the first sources of high-throughput genomic data that have benefitted from the introduction of Next-Gen Sequencing. As sequencing technology becomes more accessible, transcriptome sequencing is applicable to multiple organisms for which genome sequences are unavailable. Currently all methods for de novo assembly are based on the concept of matching the nucleotide context overlapping between short fragments-reads. However, even short reads may still contain biologically relevant information which can be used as hints in guiding the assembly process. We propose a computational workflow for the reconstruction and functional annotation of expressed gene transcripts that does not require a reference genome sequence and can be tolerant to low coverage, high error rates and other issues that often lead to poor results of de novo assembly in studies of non-model organisms. We start with either raw sequences or the output of a context-based de novo transcriptome assembly. Instead of mapping reads to a reference genome or creating a completely unsupervised clustering of reads, we assemble the unknown transcriptome using nearest homologs from a public database as seeds. We consider even distant relations, indirectly linking protein-coding fragments to entire gene families in multiple distantly related genomes. The intended application of the proposed method is an additional step of semantic (based on relations between protein-coding fragments) scaffolding following traditional (i.e. based on sequence overlap) de novo assembly. The method we developed was effective in analysis of the jellyfish Cyanea capillata transcriptome and may be applicable in other studies of gene expression in species lacking a high quality reference genome sequence. Our algorithms are implemented in C and designed for parallel computation using a high-performance computer. The software is available free of charge via an open source license.

UR - http://www.scopus.com/inward/record.url?scp=84947768230&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84947768230&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0138006

DO - 10.1371/journal.pone.0138006

M3 - Article

VL - 10

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 9

M1 - e0138006

ER -