Organizing the bacterial annotation space with amino acid sequence embeddings.

Grigson, Susanna R; McKerral, Jody C; Mitchell, James G; Edwards, Robert A

Grigson, Susanna R; McKerral, Jody C; Mitchell, James G; Edwards, Robert A.

Afiliação

Grigson SR; Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, South Australia, 5042, Australia. susie.grigson@flinders.edu.au.
McKerral JC; Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, South Australia, 5042, Australia.
Mitchell JG; Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, South Australia, 5042, Australia.
Edwards RA; Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, South Australia, 5042, Australia.

BMC Bioinformatics ; 23(1): 385, 2022 Sep 23.

Article em En | MEDLINE | ID: mdl-36151519

RESUMO

BACKGROUND: Due to the ever-expanding gap between the number of proteins being discovered and their functional characterization, protein function inference remains a fundamental challenge in computational biology. Currently, known protein annotations are organized in human-curated ontologies, however, all possible protein functions may not be organized accurately. Meanwhile, recent advancements in natural language processing and machine learning have developed models which embed amino acid sequences as vectors in n-dimensional space. So far, these embeddings have primarily been used to classify protein sequences using manually constructed protein classification schemes. RESULTS: In this work, we describe the use of amino acid sequence embeddings as a systematic framework for studying protein ontologies. Using a sequence embedding, we show that the bacterial carbohydrate metabolism class within the SEED annotation system contains 48 clusters of embedded sequences despite this class containing 29 functional labels. Furthermore, by embedding Bacillus amino acid sequences with unknown functions, we show that these unknown sequences form clusters that are likely to have similar biological roles. CONCLUSIONS: This study demonstrates that amino acid sequence embeddings may be a powerful tool for developing more robust ontologies for annotating protein sequence data. In addition, embeddings may be beneficial for clustering protein sequences with unknown functions and selecting optimal candidate proteins to characterize experimentally.

Assuntos

Biologia Computacional; Proteínas; Sequência de Aminoácidos; Bactérias; Biologia Computacional/métodos; Humanos; Aprendizado de Máquina; Anotação de Sequência Molecular; Proteínas/química

Palavras-chave

Bacteria; Function prediction; Machine learning; Protein ontology; Sequence embedding

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Proteínas / Biologia Computacional Tipo de estudo: Prognostic_studies Limite: Humans Idioma: En Revista: BMC Bioinformatics Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2022 Tipo de documento: Article País de afiliação: Austrália

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google