Your browser doesn't support javascript.
loading
SAFPred: synteny-aware gene function prediction for bacteria using protein embeddings.
Urhan, Aysun; Cosma, Bianca-Maria; Earl, Ashlee M; Manson, Abigail L; Abeel, Thomas.
Afiliación
  • Urhan A; Delft Bioinformatics Lab, Delft University of Technology Van Mourik, Delft XE 2628, The Netherlands.
  • Cosma BM; Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States.
  • Earl AM; Delft Bioinformatics Lab, Delft University of Technology Van Mourik, Delft XE 2628, The Netherlands.
  • Manson AL; Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States.
  • Abeel T; Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States.
Bioinformatics ; 40(6)2024 06 03.
Article en En | MEDLINE | ID: mdl-38775729
ABSTRACT
MOTIVATION Today, we know the function of only a small fraction of the protein sequences predicted from genomic data. This problem is even more salient for bacteria, which represent some of the most phylogenetically and metabolically diverse taxa on Earth. This low rate of bacterial gene annotation is compounded by the fact that most function prediction algorithms have focused on eukaryotes, and conventional annotation approaches rely on the presence of similar sequences in existing databases. However, often there are no such sequences for novel bacterial proteins. Thus, we need improved gene function prediction methods tailored for bacteria. Recently, transformer-based language models-adopted from the natural language processing field-have been used to obtain new representations of proteins, to replace amino acid sequences. These representations, referred to as protein embeddings, have shown promise for improving annotation of eukaryotes, but there have been only limited applications on bacterial genomes.

RESULTS:

To predict gene functions in bacteria, we developed SAFPred, a novel synteny-aware gene function prediction tool based on protein embeddings from state-of-the-art protein language models. SAFpred also leverages the unique operon structure of bacteria through conserved synteny. SAFPred outperformed both conventional sequence-based annotation methods and state-of-the-art methods on multiple bacterial species, including for distant homolog detection, where the sequence similarity to the proteins in the training set was as low as 40%. Using SAFPred to identify gene functions across diverse enterococci, of which some species are major clinical threats, we identified 11 previously unrecognized putative novel toxins, with potential significance to human and animal health. AVAILABILITY AND IMPLEMENTATION https//github.com/AbeelLab/safpred.
Asunto(s)

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Proteínas Bacterianas / Algoritmos / Genoma Bacteriano Idioma: En Revista: Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2024 Tipo del documento: Article País de afiliación: Países Bajos Pais de publicación: Reino Unido

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Proteínas Bacterianas / Algoritmos / Genoma Bacteriano Idioma: En Revista: Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2024 Tipo del documento: Article País de afiliación: Países Bajos Pais de publicación: Reino Unido