Your browser doesn't support javascript.
loading
Contrastive learning on protein embeddings enlightens midnight zone.
Heinzinger, Michael; Littmann, Maria; Sillitoe, Ian; Bordin, Nicola; Orengo, Christine; Rost, Burkhard.
Afiliação
  • Heinzinger M; TUM (Technical University of Munich) Dept Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany.
  • Littmann M; TUM (Technical University of Munich) Dept Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany.
  • Sillitoe I; Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK.
  • Bordin N; Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK.
  • Orengo C; Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK.
  • Rost B; TUM (Technical University of Munich) Dept Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany.
NAR Genom Bioinform ; 4(2): lqac043, 2022 Jun.
Article em En | MEDLINE | ID: mdl-35702380
ABSTRACT
Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the 'midnight zone' of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https//github.com/Rostlab/EAT.

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: NAR Genom Bioinform Ano de publicação: 2022 Tipo de documento: Article País de afiliação: Alemanha

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: NAR Genom Bioinform Ano de publicação: 2022 Tipo de documento: Article País de afiliação: Alemanha