Annotation of bacterial genomes using improved phylogenomic profiles

F. Enault, Karsten Suhre, C. Abergel, O. Poirot, J. M. Claverie

Research output: Contribution to journalArticle

37 Citations (Scopus)

Abstract

Motivation: Phylogenomic profiling is a large-scale comparative genomic method used to infer protein function from evolutionary information first described in a binary form by Pellegrini et al. (1999). Here, we propose improvements of this approach including the use of normalized Blastp bit scores, a normalization of the matrix of profiles to take into account the evolutionary distances between bacteria, the definition of a phylogenomic neighborhood based on continuous pairwise distances between genes and an original annotation procedure including the computation of a p-value for each functional assignment. Results: The method presented here increases the number of Ecocyc enzymes identified as being evolutionarily related by about 25% with respect to the original binary form (absent/present) method. The fraction of 'false' positives is shown to be smaller than 20%. Based on their phylogenomic relationships, genes of unknown function can then be automatically related to annotated genes. Each gene annotation predicted is associated with a p-value, i.e. its probability to be obtained by chance. The validity of this method was extensively tested on a large set of genes of known function using the MultiFun database. We find that 50% of 3122 function attributions that can be made at a p-value level of 10-11 correspond to the actual gene annotation. The method can be readily applied to any newly sequenced microbial genome. In contrast to earlier work on the same topic, our approach avoids the use of arbitrary cut-off values, and provides a reliability estimate of the functional predictions in form of p-values.

Original languageEnglish
Pages (from-to)i105-i107
JournalBioinformatics
Volume19
DOIs
Publication statusPublished - 2003
Externally publishedYes

Fingerprint

Bacterial Genomes
Annotation
Genome
Genes
p-Value
Gene
Molecular Sequence Annotation
Binary Forms
Microbial Genome
Comparative Genomics
Profiling
False Positive
Large Set
Bacteria
Normalization
Profile
Pairwise
Enzymes
Assignment
Databases

Keywords

  • Automated annotation
  • Bacteria
  • Evolution
  • Functional prediction
  • Phylogenomics

ASJC Scopus subject areas

  • Statistics and Probability
  • Medicine(all)
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

Annotation of bacterial genomes using improved phylogenomic profiles. / Enault, F.; Suhre, Karsten; Abergel, C.; Poirot, O.; Claverie, J. M.

In: Bioinformatics, Vol. 19, 2003, p. i105-i107.

Research output: Contribution to journalArticle

Enault, F. ; Suhre, Karsten ; Abergel, C. ; Poirot, O. ; Claverie, J. M. / Annotation of bacterial genomes using improved phylogenomic profiles. In: Bioinformatics. 2003 ; Vol. 19. pp. i105-i107.
@article{3edd42b131a8463da9af2249b96a739b,
title = "Annotation of bacterial genomes using improved phylogenomic profiles",
abstract = "Motivation: Phylogenomic profiling is a large-scale comparative genomic method used to infer protein function from evolutionary information first described in a binary form by Pellegrini et al. (1999). Here, we propose improvements of this approach including the use of normalized Blastp bit scores, a normalization of the matrix of profiles to take into account the evolutionary distances between bacteria, the definition of a phylogenomic neighborhood based on continuous pairwise distances between genes and an original annotation procedure including the computation of a p-value for each functional assignment. Results: The method presented here increases the number of Ecocyc enzymes identified as being evolutionarily related by about 25{\%} with respect to the original binary form (absent/present) method. The fraction of 'false' positives is shown to be smaller than 20{\%}. Based on their phylogenomic relationships, genes of unknown function can then be automatically related to annotated genes. Each gene annotation predicted is associated with a p-value, i.e. its probability to be obtained by chance. The validity of this method was extensively tested on a large set of genes of known function using the MultiFun database. We find that 50{\%} of 3122 function attributions that can be made at a p-value level of 10-11 correspond to the actual gene annotation. The method can be readily applied to any newly sequenced microbial genome. In contrast to earlier work on the same topic, our approach avoids the use of arbitrary cut-off values, and provides a reliability estimate of the functional predictions in form of p-values.",
keywords = "Automated annotation, Bacteria, Evolution, Functional prediction, Phylogenomics",
author = "F. Enault and Karsten Suhre and C. Abergel and O. Poirot and Claverie, {J. M.}",
year = "2003",
doi = "10.1093/bioinformatics/btg1013",
language = "English",
volume = "19",
pages = "i105--i107",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",

}

TY - JOUR

T1 - Annotation of bacterial genomes using improved phylogenomic profiles

AU - Enault, F.

AU - Suhre, Karsten

AU - Abergel, C.

AU - Poirot, O.

AU - Claverie, J. M.

PY - 2003

Y1 - 2003

N2 - Motivation: Phylogenomic profiling is a large-scale comparative genomic method used to infer protein function from evolutionary information first described in a binary form by Pellegrini et al. (1999). Here, we propose improvements of this approach including the use of normalized Blastp bit scores, a normalization of the matrix of profiles to take into account the evolutionary distances between bacteria, the definition of a phylogenomic neighborhood based on continuous pairwise distances between genes and an original annotation procedure including the computation of a p-value for each functional assignment. Results: The method presented here increases the number of Ecocyc enzymes identified as being evolutionarily related by about 25% with respect to the original binary form (absent/present) method. The fraction of 'false' positives is shown to be smaller than 20%. Based on their phylogenomic relationships, genes of unknown function can then be automatically related to annotated genes. Each gene annotation predicted is associated with a p-value, i.e. its probability to be obtained by chance. The validity of this method was extensively tested on a large set of genes of known function using the MultiFun database. We find that 50% of 3122 function attributions that can be made at a p-value level of 10-11 correspond to the actual gene annotation. The method can be readily applied to any newly sequenced microbial genome. In contrast to earlier work on the same topic, our approach avoids the use of arbitrary cut-off values, and provides a reliability estimate of the functional predictions in form of p-values.

AB - Motivation: Phylogenomic profiling is a large-scale comparative genomic method used to infer protein function from evolutionary information first described in a binary form by Pellegrini et al. (1999). Here, we propose improvements of this approach including the use of normalized Blastp bit scores, a normalization of the matrix of profiles to take into account the evolutionary distances between bacteria, the definition of a phylogenomic neighborhood based on continuous pairwise distances between genes and an original annotation procedure including the computation of a p-value for each functional assignment. Results: The method presented here increases the number of Ecocyc enzymes identified as being evolutionarily related by about 25% with respect to the original binary form (absent/present) method. The fraction of 'false' positives is shown to be smaller than 20%. Based on their phylogenomic relationships, genes of unknown function can then be automatically related to annotated genes. Each gene annotation predicted is associated with a p-value, i.e. its probability to be obtained by chance. The validity of this method was extensively tested on a large set of genes of known function using the MultiFun database. We find that 50% of 3122 function attributions that can be made at a p-value level of 10-11 correspond to the actual gene annotation. The method can be readily applied to any newly sequenced microbial genome. In contrast to earlier work on the same topic, our approach avoids the use of arbitrary cut-off values, and provides a reliability estimate of the functional predictions in form of p-values.

KW - Automated annotation

KW - Bacteria

KW - Evolution

KW - Functional prediction

KW - Phylogenomics

UR - http://www.scopus.com/inward/record.url?scp=3242888755&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=3242888755&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btg1013

DO - 10.1093/bioinformatics/btg1013

M3 - Article

C2 - 12855445

AN - SCOPUS:3242888755

VL - 19

SP - i105-i107

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

ER -