Conserved codon composition of ribosomal protein coding genes in Escherichia coli, Mycobacterium tuberculosis and Saccharomyces cerevisiae

Lessons from supervised machine learning in functional genomics

Kui Lin, Yuyu Kuang, Jeremiah S. Joseph, Prasanna Kolatkar

Research output: Contribution to journalArticle

45 Citations (Scopus)

Abstract

Genomics projects have resulted in a flood of sequence data. Functional annotation currently relies almost exclusively on inter-species sequence comparison and is restricted in cases of limited data from related species and widely divergent sequences with no known homologs. Here, we demonstrate that codon composition, a fusion of codon usage bias and amino acid composition signals, can accurately discriminate, in the absence of sequence homology information, cytoplasmic ribosomal protein genes from all other genes of known function in Saccharomyces cerevisiae, Escherichia coli and Mycobacterium tuberculosis using an implementation of support vector machines, SVMlight. Analysis of these codon composition signals is instructive in determining features that confer individuality to ribosomal protein genes. Each of the sets of positively charged, negatively charged and small hydrophobic residues, as well as codon bias, contribute to their distinctive codon composition profile. The representation of all these signals is sensitively detected, combined and augmented by the SVMs to perform an accurate classification. Of special mention is an obvious outlier, yeast gene RPL22B, highly homologous to RPL22A but employing very different codon usage, perhaps indicating a non-ribosomal function. Finally, we propose that codon composition be used in combination with other attributes in gene/protein classification by supervised machine learning algorithms.

Original languageEnglish
Pages (from-to)2599-2607
Number of pages9
JournalNucleic Acids Research
Volume30
Issue number11
Publication statusPublished - 1 Jun 2002
Externally publishedYes

Fingerprint

Ribosomal Proteins
Genomics
Mycobacterium tuberculosis
Codon
Saccharomyces cerevisiae
Escherichia coli
Genes
Sequence Homology
Supervised Machine Learning
Individuality
Yeasts
Amino Acids
Proteins

ASJC Scopus subject areas

  • Genetics

Cite this

@article{5f7e87d3bc3a40c3b00ba956e942578a,
title = "Conserved codon composition of ribosomal protein coding genes in Escherichia coli, Mycobacterium tuberculosis and Saccharomyces cerevisiae: Lessons from supervised machine learning in functional genomics",
abstract = "Genomics projects have resulted in a flood of sequence data. Functional annotation currently relies almost exclusively on inter-species sequence comparison and is restricted in cases of limited data from related species and widely divergent sequences with no known homologs. Here, we demonstrate that codon composition, a fusion of codon usage bias and amino acid composition signals, can accurately discriminate, in the absence of sequence homology information, cytoplasmic ribosomal protein genes from all other genes of known function in Saccharomyces cerevisiae, Escherichia coli and Mycobacterium tuberculosis using an implementation of support vector machines, SVMlight. Analysis of these codon composition signals is instructive in determining features that confer individuality to ribosomal protein genes. Each of the sets of positively charged, negatively charged and small hydrophobic residues, as well as codon bias, contribute to their distinctive codon composition profile. The representation of all these signals is sensitively detected, combined and augmented by the SVMs to perform an accurate classification. Of special mention is an obvious outlier, yeast gene RPL22B, highly homologous to RPL22A but employing very different codon usage, perhaps indicating a non-ribosomal function. Finally, we propose that codon composition be used in combination with other attributes in gene/protein classification by supervised machine learning algorithms.",
author = "Kui Lin and Yuyu Kuang and Joseph, {Jeremiah S.} and Prasanna Kolatkar",
year = "2002",
month = "6",
day = "1",
language = "English",
volume = "30",
pages = "2599--2607",
journal = "Nucleic Acids Research",
issn = "0305-1048",
publisher = "Oxford University Press",
number = "11",

}

TY - JOUR

T1 - Conserved codon composition of ribosomal protein coding genes in Escherichia coli, Mycobacterium tuberculosis and Saccharomyces cerevisiae

T2 - Lessons from supervised machine learning in functional genomics

AU - Lin, Kui

AU - Kuang, Yuyu

AU - Joseph, Jeremiah S.

AU - Kolatkar, Prasanna

PY - 2002/6/1

Y1 - 2002/6/1

N2 - Genomics projects have resulted in a flood of sequence data. Functional annotation currently relies almost exclusively on inter-species sequence comparison and is restricted in cases of limited data from related species and widely divergent sequences with no known homologs. Here, we demonstrate that codon composition, a fusion of codon usage bias and amino acid composition signals, can accurately discriminate, in the absence of sequence homology information, cytoplasmic ribosomal protein genes from all other genes of known function in Saccharomyces cerevisiae, Escherichia coli and Mycobacterium tuberculosis using an implementation of support vector machines, SVMlight. Analysis of these codon composition signals is instructive in determining features that confer individuality to ribosomal protein genes. Each of the sets of positively charged, negatively charged and small hydrophobic residues, as well as codon bias, contribute to their distinctive codon composition profile. The representation of all these signals is sensitively detected, combined and augmented by the SVMs to perform an accurate classification. Of special mention is an obvious outlier, yeast gene RPL22B, highly homologous to RPL22A but employing very different codon usage, perhaps indicating a non-ribosomal function. Finally, we propose that codon composition be used in combination with other attributes in gene/protein classification by supervised machine learning algorithms.

AB - Genomics projects have resulted in a flood of sequence data. Functional annotation currently relies almost exclusively on inter-species sequence comparison and is restricted in cases of limited data from related species and widely divergent sequences with no known homologs. Here, we demonstrate that codon composition, a fusion of codon usage bias and amino acid composition signals, can accurately discriminate, in the absence of sequence homology information, cytoplasmic ribosomal protein genes from all other genes of known function in Saccharomyces cerevisiae, Escherichia coli and Mycobacterium tuberculosis using an implementation of support vector machines, SVMlight. Analysis of these codon composition signals is instructive in determining features that confer individuality to ribosomal protein genes. Each of the sets of positively charged, negatively charged and small hydrophobic residues, as well as codon bias, contribute to their distinctive codon composition profile. The representation of all these signals is sensitively detected, combined and augmented by the SVMs to perform an accurate classification. Of special mention is an obvious outlier, yeast gene RPL22B, highly homologous to RPL22A but employing very different codon usage, perhaps indicating a non-ribosomal function. Finally, we propose that codon composition be used in combination with other attributes in gene/protein classification by supervised machine learning algorithms.

UR - http://www.scopus.com/inward/record.url?scp=0036606072&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0036606072&partnerID=8YFLogxK

M3 - Article

VL - 30

SP - 2599

EP - 2607

JO - Nucleic Acids Research

JF - Nucleic Acids Research

SN - 0305-1048

IS - 11

ER -