Identification of long non-coding transcripts with feature selection

a comparative study

Giovanna M.M. Ventola, Teresa M.R. Noviello, Salvatore D'Aniello, Antonietta Spagnuolo, Michele Ceccarelli, Luigi Cerulo

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

BACKGROUND: The unveiling of long non-coding RNAs as important gene regulators in many biological contexts has increased the demand for efficient and robust computational methods to identify novel long non-coding RNAs from transcripts assembled with high throughput RNA-seq data. Several classes of sequence-based features have been proposed to distinguish between coding and non-coding transcripts. Among them, open reading frame, conservation scores, nucleotide arrangements, and RNA secondary structure have been used with success in literature to recognize intergenic long non-coding RNAs, a particular subclass of non-coding RNAs.

RESULTS: In this paper we perform a systematic assessment of a wide collection of features extracted from sequence data. We use most of the features proposed in the literature, and we include, as a novel set of features, the occurrence of repeats contained in transposable elements. The aim is to detect signatures (groups of features) able to distinguish long non-coding transcripts from other classes, both protein-coding and non-coding. We evaluate different feature selection algorithms, test for signature stability, and evaluate the prediction ability of a signature with a machine learning algorithm. The study reveals different signatures in human, mouse, and zebrafish, highlighting that some features are shared among species, while others tend to be species-specific. Compared to coding potential tools and similar supervised approaches, including novel signatures, such as those identified here, in a machine learning algorithm improves the prediction performance, in terms of area under precision and recall curve, by 1 to 24%, depending on the species and on the signature.

CONCLUSIONS: Understanding which features are best suited for the prediction of long non-coding RNAs allows for the development of more effective automatic annotation pipelines especially relevant for poorly annotated genomes, such as zebrafish. We provide a web tool that recognizes novel long non-coding RNAs with the obtained signatures from fasta and gtf formats. The tool is available at the following url: http://www.bioinformatics-sannio.org/software/ .

Original languageEnglish
Number of pages1
JournalBMC Bioinformatics
Volume18
Issue number1
DOIs
Publication statusPublished - 23 Mar 2017
Externally publishedYes

Fingerprint

Long Noncoding RNA
RNA
Feature Selection
Comparative Study
Feature extraction
Signature
Zebrafish
Coding
Learning algorithms
Learning systems
Genes
Learning Algorithm
Machine Learning
Untranslated RNA
DNA Transposable Elements
Group Signature
RNA Secondary Structure
Regulator Genes
Bioinformatics
Computational methods

Keywords

  • Classification
  • Feature selection
  • lncRNA

ASJC Scopus subject areas

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Cite this

Identification of long non-coding transcripts with feature selection : a comparative study. / Ventola, Giovanna M.M.; Noviello, Teresa M.R.; D'Aniello, Salvatore; Spagnuolo, Antonietta; Ceccarelli, Michele; Cerulo, Luigi.

In: BMC Bioinformatics, Vol. 18, No. 1, 23.03.2017.

Research output: Contribution to journalArticle

Ventola, Giovanna M.M. ; Noviello, Teresa M.R. ; D'Aniello, Salvatore ; Spagnuolo, Antonietta ; Ceccarelli, Michele ; Cerulo, Luigi. / Identification of long non-coding transcripts with feature selection : a comparative study. In: BMC Bioinformatics. 2017 ; Vol. 18, No. 1.
@article{da4b5beffa024e4eb0eb524caa5ce37e,
title = "Identification of long non-coding transcripts with feature selection: a comparative study",
abstract = "BACKGROUND: The unveiling of long non-coding RNAs as important gene regulators in many biological contexts has increased the demand for efficient and robust computational methods to identify novel long non-coding RNAs from transcripts assembled with high throughput RNA-seq data. Several classes of sequence-based features have been proposed to distinguish between coding and non-coding transcripts. Among them, open reading frame, conservation scores, nucleotide arrangements, and RNA secondary structure have been used with success in literature to recognize intergenic long non-coding RNAs, a particular subclass of non-coding RNAs.RESULTS: In this paper we perform a systematic assessment of a wide collection of features extracted from sequence data. We use most of the features proposed in the literature, and we include, as a novel set of features, the occurrence of repeats contained in transposable elements. The aim is to detect signatures (groups of features) able to distinguish long non-coding transcripts from other classes, both protein-coding and non-coding. We evaluate different feature selection algorithms, test for signature stability, and evaluate the prediction ability of a signature with a machine learning algorithm. The study reveals different signatures in human, mouse, and zebrafish, highlighting that some features are shared among species, while others tend to be species-specific. Compared to coding potential tools and similar supervised approaches, including novel signatures, such as those identified here, in a machine learning algorithm improves the prediction performance, in terms of area under precision and recall curve, by 1 to 24{\%}, depending on the species and on the signature.CONCLUSIONS: Understanding which features are best suited for the prediction of long non-coding RNAs allows for the development of more effective automatic annotation pipelines especially relevant for poorly annotated genomes, such as zebrafish. We provide a web tool that recognizes novel long non-coding RNAs with the obtained signatures from fasta and gtf formats. The tool is available at the following url: http://www.bioinformatics-sannio.org/software/ .",
keywords = "Classification, Feature selection, lncRNA",
author = "Ventola, {Giovanna M.M.} and Noviello, {Teresa M.R.} and Salvatore D'Aniello and Antonietta Spagnuolo and Michele Ceccarelli and Luigi Cerulo",
year = "2017",
month = "3",
day = "23",
doi = "10.1186/s12859-017-1594-z",
language = "English",
volume = "18",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Identification of long non-coding transcripts with feature selection

T2 - a comparative study

AU - Ventola, Giovanna M.M.

AU - Noviello, Teresa M.R.

AU - D'Aniello, Salvatore

AU - Spagnuolo, Antonietta

AU - Ceccarelli, Michele

AU - Cerulo, Luigi

PY - 2017/3/23

Y1 - 2017/3/23

N2 - BACKGROUND: The unveiling of long non-coding RNAs as important gene regulators in many biological contexts has increased the demand for efficient and robust computational methods to identify novel long non-coding RNAs from transcripts assembled with high throughput RNA-seq data. Several classes of sequence-based features have been proposed to distinguish between coding and non-coding transcripts. Among them, open reading frame, conservation scores, nucleotide arrangements, and RNA secondary structure have been used with success in literature to recognize intergenic long non-coding RNAs, a particular subclass of non-coding RNAs.RESULTS: In this paper we perform a systematic assessment of a wide collection of features extracted from sequence data. We use most of the features proposed in the literature, and we include, as a novel set of features, the occurrence of repeats contained in transposable elements. The aim is to detect signatures (groups of features) able to distinguish long non-coding transcripts from other classes, both protein-coding and non-coding. We evaluate different feature selection algorithms, test for signature stability, and evaluate the prediction ability of a signature with a machine learning algorithm. The study reveals different signatures in human, mouse, and zebrafish, highlighting that some features are shared among species, while others tend to be species-specific. Compared to coding potential tools and similar supervised approaches, including novel signatures, such as those identified here, in a machine learning algorithm improves the prediction performance, in terms of area under precision and recall curve, by 1 to 24%, depending on the species and on the signature.CONCLUSIONS: Understanding which features are best suited for the prediction of long non-coding RNAs allows for the development of more effective automatic annotation pipelines especially relevant for poorly annotated genomes, such as zebrafish. We provide a web tool that recognizes novel long non-coding RNAs with the obtained signatures from fasta and gtf formats. The tool is available at the following url: http://www.bioinformatics-sannio.org/software/ .

AB - BACKGROUND: The unveiling of long non-coding RNAs as important gene regulators in many biological contexts has increased the demand for efficient and robust computational methods to identify novel long non-coding RNAs from transcripts assembled with high throughput RNA-seq data. Several classes of sequence-based features have been proposed to distinguish between coding and non-coding transcripts. Among them, open reading frame, conservation scores, nucleotide arrangements, and RNA secondary structure have been used with success in literature to recognize intergenic long non-coding RNAs, a particular subclass of non-coding RNAs.RESULTS: In this paper we perform a systematic assessment of a wide collection of features extracted from sequence data. We use most of the features proposed in the literature, and we include, as a novel set of features, the occurrence of repeats contained in transposable elements. The aim is to detect signatures (groups of features) able to distinguish long non-coding transcripts from other classes, both protein-coding and non-coding. We evaluate different feature selection algorithms, test for signature stability, and evaluate the prediction ability of a signature with a machine learning algorithm. The study reveals different signatures in human, mouse, and zebrafish, highlighting that some features are shared among species, while others tend to be species-specific. Compared to coding potential tools and similar supervised approaches, including novel signatures, such as those identified here, in a machine learning algorithm improves the prediction performance, in terms of area under precision and recall curve, by 1 to 24%, depending on the species and on the signature.CONCLUSIONS: Understanding which features are best suited for the prediction of long non-coding RNAs allows for the development of more effective automatic annotation pipelines especially relevant for poorly annotated genomes, such as zebrafish. We provide a web tool that recognizes novel long non-coding RNAs with the obtained signatures from fasta and gtf formats. The tool is available at the following url: http://www.bioinformatics-sannio.org/software/ .

KW - Classification

KW - Feature selection

KW - lncRNA

UR - http://www.scopus.com/inward/record.url?scp=85028639656&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85028639656&partnerID=8YFLogxK

U2 - 10.1186/s12859-017-1594-z

DO - 10.1186/s12859-017-1594-z

M3 - Article

VL - 18

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - 1

ER -