Your browser doesn't support javascript.
loading
Improvements in viral gene annotation using large language models and soft alignments.
Harrigan, William L; Ferrell, Barbra D; Wommack, K Eric; Polson, Shawn W; Schreiber, Zachary D; Belcaid, Mahdi.
Afiliação
  • Harrigan WL; Hawai'i Institute of Marine Biology, University of Hawai'i at Manoa, Honolulu, HI, 96822, USA.
  • Ferrell BD; Department of Plant & Soil Sciences, University of Delaware, Newark, DE, 19713, USA.
  • Wommack KE; Department of Plant & Soil Sciences, University of Delaware, Newark, DE, 19713, USA.
  • Polson SW; Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19713, USA.
  • Schreiber ZD; Department of Plant & Soil Sciences, University of Delaware, Newark, DE, 19713, USA.
  • Belcaid M; Department of Computer Science, University of Hawai'i at Manoa, Honolulu, HI, 96822, USA. mahdi@hawaii.edu.
BMC Bioinformatics ; 25(1): 165, 2024 Apr 25.
Article em En | MEDLINE | ID: mdl-38664627
ABSTRACT

BACKGROUND:

The annotation of protein sequences in public databases has long posed a challenge in molecular biology. This issue is particularly acute for viral proteins, which demonstrate limited homology to known proteins when using alignment, k-mer, or profile-based homology search approaches. A novel methodology employing Large Language Models (LLMs) addresses this methodological challenge by annotating protein sequences based on embeddings.

RESULTS:

Central to our contribution is the soft alignment algorithm, drawing from traditional protein alignment but leveraging embedding similarity at the amino acid level to bypass the need for conventional scoring matrices. This method not only surpasses pooled embedding-based models in efficiency but also in interpretability, enabling users to easily trace homologous amino acids and delve deeper into the alignments. Far from being a black box, our approach provides transparent, BLAST-like alignment visualizations, combining traditional biological research with AI advancements to elevate protein annotation through embedding-based analysis while ensuring interpretability. Tests using the Virus Orthologous Groups and ViralZone protein databases indicated that the novel soft alignment approach recognized and annotated sequences that both blastp and pooling-based methods, which are commonly used for sequence annotation, failed to detect.

CONCLUSION:

The embeddings approach shows the great potential of LLMs for enhancing protein sequence annotation, especially in viral genomics. These findings present a promising avenue for more efficient and accurate protein function inference in molecular biology.
Assuntos
Palavras-chave

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Algoritmos / Alinhamento de Sequência / Anotação de Sequência Molecular Idioma: En Revista: BMC Bioinformatics Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Algoritmos / Alinhamento de Sequência / Anotação de Sequência Molecular Idioma: En Revista: BMC Bioinformatics Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Estados Unidos