Your browser doesn't support javascript.
loading
Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies.
Yeung, Wayland; Zhou, Zhongliang; Mathew, Liju; Gravel, Nathan; Taujale, Rahil; O'Boyle, Brady; Salcedo, Mariah; Venkat, Aarya; Lanzilotta, William; Li, Sheng; Kannan, Natarajan.
Afiliação
  • Yeung W; Institute of Bioinformatics, University of Georgia, 30602, Georgia, USA.
  • Zhou Z; School of Computing, University of Georgia, 30602, Georgia, USA.
  • Mathew L; Department of Microbiology, University of Georgia, 30602, Georgia, USA.
  • Gravel N; Institute of Bioinformatics, University of Georgia, 30602, Georgia, USA.
  • Taujale R; Institute of Bioinformatics, University of Georgia, 30602, Georgia, USA.
  • O'Boyle B; Department of Biochemistry and Molecular Biology, University of Georgia, 30602, Georgia, USA.
  • Salcedo M; Department of Biochemistry and Molecular Biology, University of Georgia, 30602, Georgia, USA.
  • Venkat A; Department of Biochemistry and Molecular Biology, University of Georgia, 30602, Georgia, USA.
  • Lanzilotta W; Department of Biochemistry and Molecular Biology, University of Georgia, 30602, Georgia, USA.
  • Li S; School of Data Science, University of Virginia, 22903, Virginia, USA.
  • Kannan N; Institute of Bioinformatics, University of Georgia, 30602, Georgia, USA.
Brief Bioinform ; 24(1)2023 01 19.
Article em En | MEDLINE | ID: mdl-36642409
ABSTRACT
Protein language models, trained on millions of biologically observed sequences, generate feature-rich numerical representations of protein sequences. These representations, called sequence embeddings, can infer structure-functional properties, despite protein language models being trained on primary sequence alone. While sequence embeddings have been applied toward tasks such as structure and function prediction, applications toward alignment-free sequence classification have been hindered by the lack of studies to derive, quantify and evaluate relationships between protein sequence embeddings. Here, we develop workflows and visualization methods for the classification of protein families using sequence embedding derived from protein language models. A benchmark of manifold visualization methods reveals that Neighbor Joining (NJ) embedding trees are highly effective in capturing global structure while achieving similar performance in capturing local structure compared with popular dimensionality reduction techniques such as t-SNE and UMAP. The statistical significance of hierarchical clusters on a tree is evaluated by resampling embeddings using a variational autoencoder (VAE). We demonstrate the application of our methods in the classification of two well-studied enzyme superfamilies, phosphatases and protein kinases. Our embedding-based classifications remain consistent with and extend upon previously published sequence alignment-based classifications. We also propose a new hierarchical classification for the S-Adenosyl-L-Methionine (SAM) enzyme superfamily which has been difficult to classify using traditional alignment-based approaches. Beyond applications in sequence classification, our results further suggest NJ trees are a promising general method for visualizing high-dimensional data sets.
Assuntos
Palavras-chave

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Proteínas / Sequência de Aminoácidos Idioma: En Ano de publicação: 2023 Tipo de documento: Article

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Proteínas / Sequência de Aminoácidos Idioma: En Ano de publicação: 2023 Tipo de documento: Article