### Abstract

Motivation: Phylogenomic profiling is a large-scale comparative genomic method used to infer protein function from evolutionary information first described in a binary form by Pellegrini et al. (1999). Here, we propose improvements of this approach including the use of normalized Blastp bit scores, a normalization of the matrix of profiles to take into account the evolutionary distances between bacteria, the definition of a phylogenomic neighborhood based on continuous pairwise distances between genes and an original annotation procedure including the computation of a p-value for each functional assignment. Results: The method presented here increases the number of Ecocyc enzymes identified as being evolutionarily related by about 25% with respect to the original binary form (absent/present) method. The fraction of 'false' positives is shown to be smaller than 20%. Based on their phylogenomic relationships, genes of unknown function can then be automatically related to annotated genes. Each gene annotation predicted is associated with a p-value, i.e. its probability to be obtained by chance. The validity of this method was extensively tested on a large set of genes of known function using the MultiFun database. We find that 50% of 3122 function attributions that can be made at a p-value level of 10^{-11} correspond to the actual gene annotation. The method can be readily applied to any newly sequenced microbial genome. In contrast to earlier work on the same topic, our approach avoids the use of arbitrary cut-off values, and provides a reliability estimate of the functional predictions in form of p-values.

Original language | English |
---|---|

Pages (from-to) | i105-i107 |

Journal | Bioinformatics |

Volume | 19 |

DOIs | |

Publication status | Published - 2003 |

Externally published | Yes |

### Fingerprint

### Keywords

- Automated annotation
- Bacteria
- Evolution
- Functional prediction
- Phylogenomics

### ASJC Scopus subject areas

- Statistics and Probability
- Medicine(all)
- Biochemistry
- Molecular Biology
- Computer Science Applications
- Computational Theory and Mathematics
- Computational Mathematics

### Cite this

*Bioinformatics*,

*19*, i105-i107. https://doi.org/10.1093/bioinformatics/btg1013

**Annotation of bacterial genomes using improved phylogenomic profiles.** / Enault, F.; Suhre, Karsten; Abergel, C.; Poirot, O.; Claverie, J. M.

Research output: Contribution to journal › Article

*Bioinformatics*, vol. 19, pp. i105-i107. https://doi.org/10.1093/bioinformatics/btg1013

}

TY - JOUR

T1 - Annotation of bacterial genomes using improved phylogenomic profiles

AU - Enault, F.

AU - Suhre, Karsten

AU - Abergel, C.

AU - Poirot, O.

AU - Claverie, J. M.

PY - 2003

Y1 - 2003

N2 - Motivation: Phylogenomic profiling is a large-scale comparative genomic method used to infer protein function from evolutionary information first described in a binary form by Pellegrini et al. (1999). Here, we propose improvements of this approach including the use of normalized Blastp bit scores, a normalization of the matrix of profiles to take into account the evolutionary distances between bacteria, the definition of a phylogenomic neighborhood based on continuous pairwise distances between genes and an original annotation procedure including the computation of a p-value for each functional assignment. Results: The method presented here increases the number of Ecocyc enzymes identified as being evolutionarily related by about 25% with respect to the original binary form (absent/present) method. The fraction of 'false' positives is shown to be smaller than 20%. Based on their phylogenomic relationships, genes of unknown function can then be automatically related to annotated genes. Each gene annotation predicted is associated with a p-value, i.e. its probability to be obtained by chance. The validity of this method was extensively tested on a large set of genes of known function using the MultiFun database. We find that 50% of 3122 function attributions that can be made at a p-value level of 10-11 correspond to the actual gene annotation. The method can be readily applied to any newly sequenced microbial genome. In contrast to earlier work on the same topic, our approach avoids the use of arbitrary cut-off values, and provides a reliability estimate of the functional predictions in form of p-values.

AB - Motivation: Phylogenomic profiling is a large-scale comparative genomic method used to infer protein function from evolutionary information first described in a binary form by Pellegrini et al. (1999). Here, we propose improvements of this approach including the use of normalized Blastp bit scores, a normalization of the matrix of profiles to take into account the evolutionary distances between bacteria, the definition of a phylogenomic neighborhood based on continuous pairwise distances between genes and an original annotation procedure including the computation of a p-value for each functional assignment. Results: The method presented here increases the number of Ecocyc enzymes identified as being evolutionarily related by about 25% with respect to the original binary form (absent/present) method. The fraction of 'false' positives is shown to be smaller than 20%. Based on their phylogenomic relationships, genes of unknown function can then be automatically related to annotated genes. Each gene annotation predicted is associated with a p-value, i.e. its probability to be obtained by chance. The validity of this method was extensively tested on a large set of genes of known function using the MultiFun database. We find that 50% of 3122 function attributions that can be made at a p-value level of 10-11 correspond to the actual gene annotation. The method can be readily applied to any newly sequenced microbial genome. In contrast to earlier work on the same topic, our approach avoids the use of arbitrary cut-off values, and provides a reliability estimate of the functional predictions in form of p-values.

KW - Automated annotation

KW - Bacteria

KW - Evolution

KW - Functional prediction

KW - Phylogenomics

UR - http://www.scopus.com/inward/record.url?scp=3242888755&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=3242888755&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btg1013

DO - 10.1093/bioinformatics/btg1013

M3 - Article

C2 - 12855445

AN - SCOPUS:3242888755

VL - 19

SP - i105-i107

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

ER -