Your browser doesn't support javascript.
loading
Scoring alignments by embedding vector similarity.
Ashrafzadeh, Sepehr; Golding, G Brian; Ilie, Silvana; Ilie, Lucian.
Afiliação
  • Ashrafzadeh S; Department of Computer Science, University of Western Ontario, London, N6A 5B7, Ontario, Canada.
  • Golding GB; Department of Biology, McMaster University, Hamilton, L8S 4K1, Ontario, Canada.
  • Ilie S; Department of Mathematics, Toronto Metropolitan University, Toronto, M5B 2K3, Ontario, Canada.
  • Ilie L; Department of Computer Science, University of Western Ontario, London, N6A 5B7, Ontario, Canada.
Brief Bioinform ; 25(3)2024 Mar 27.
Article em En | MEDLINE | ID: mdl-38695119
ABSTRACT
Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.
Assuntos
Palavras-chave

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Algoritmos / Alinhamento de Sequência / Biologia Computacional Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Algoritmos / Alinhamento de Sequência / Biologia Computacional Idioma: En Ano de publicação: 2024 Tipo de documento: Article