Pesquisa | BVS CLAP/SMR-OPAS/OMS

An analysis of protein language model embeddings for fold prediction.

Villegas-Morcillo, Amelia; Gomez, Angel M; Sanchez, Victoria.

Brief Bioinform ; 23(3)2022 05 13.

Artigo em Inglês | MEDLINE | ID: mdl-35443054

RESUMO

The identification of the protein fold class is a challenging problem in structural biology. Recent computational methods for fold prediction leverage deep learning techniques to extract protein fold-representative embeddings mainly using evolutionary information in the form of multiple sequence alignment (MSA) as input source. In contrast, protein language models (LM) have reshaped the field thanks to their ability to learn efficient protein representations (protein-LM embeddings) from purely sequential information in a self-supervised manner. In this paper, we analyze a framework for protein fold prediction using pre-trained protein-LM embeddings as input to several fine-tuning neural network models, which are supervisedly trained with fold labels. In particular, we compare the performance of six protein-LM embeddings: the long short-term memory-based UniRep and SeqVec, and the transformer-based ESM-1b, ESM-MSA, ProtBERT and ProtT5; as well as three neural networks: Multi-Layer Perceptron, ResCNN-BGRU (RBG) and Light-Attention (LAT). We separately evaluated the pairwise fold recognition (PFR) and direct fold classification (DFC) tasks on well-known benchmark datasets. The results indicate that the combination of transformer-based embeddings, particularly those obtained at amino acid level, with the RBG and LAT fine-tuning models performs remarkably well in both tasks. To further increase prediction accuracy, we propose several ensemble strategies for PFR and DFC, which provide a significant performance boost over the current state-of-the-art results. All this suggests that moving from traditional protein representations to protein-LM embeddings is a very promising approach to protein fold-related tasks.

Assuntos

Idioma , Redes Neurais de Computação , Processamento de Linguagem Natural , Proteínas/química

ManyFold: an efficient and flexible library for training and validating protein folding models.

Villegas-Morcillo, Amelia; Robinson, Louis; Flajolet, Arthur; Barrett, Thomas D.

Bioinformatics ; 39(1)2023 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-36495196

RESUMO

SUMMARY: ManyFold is a flexible library for protein structure prediction with deep learning that (i) supports models that use both multiple sequence alignments (MSAs) and protein language model (pLM) embedding as inputs, (ii) allows inference of existing models (AlphaFold and OpenFold), (iii) is fully trainable, allowing for both fine-tuning and the training of new models from scratch and (iv) is written in Jax to support efficient batched operation in distributed settings. A proof-of-concept pLM-based model, pLMFold, is trained from scratch to obtain reasonable results with reduced computational overheads in comparison to AlphaFold. AVAILABILITY AND IMPLEMENTATION: The source code for ManyFold, the validation dataset and a small sample of training data are available at https://github.com/instadeepai/manyfold. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Proteínas , Software , Proteínas/química , Dobramento de Proteína , Alinhamento de Sequência , Idioma

Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function.

Villegas-Morcillo, Amelia; Makrodimitris, Stavros; van Ham, Roeland C H J; Gomez, Angel M; Sanchez, Victoria; Reinders, Marcel J T.

Bioinformatics ; 37(2): 162-170, 2021 04 19.

Artigo em Inglês | MEDLINE | ID: mdl-32797179

RESUMO

MOTIVATION: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. RESULTS: We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. AVAILABILITY AND IMPLEMENTATION: Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Proteínas , Software , Sequência de Aminoácidos , Redes Neurais de Computação , Proteínas/genética

FoldHSphere: deep hyperspherical embeddings for protein fold recognition.

Villegas-Morcillo, Amelia; Sanchez, Victoria; Gomez, Angel M.

BMC Bioinformatics ; 22(1): 490, 2021 Oct 12.

Artigo em Inglês | MEDLINE | ID: mdl-34641786

RESUMO

BACKGROUND: Current state-of-the-art deep learning approaches for protein fold recognition learn protein embeddings that improve prediction performance at the fold level. However, there still exists aperformance gap at the fold level and the (relatively easier) family level, suggesting that it might be possible to learn an embedding space that better represents the protein folds. RESULTS: In this paper, we propose the FoldHSphere method to learn a better fold embedding space through a two-stage training procedure. We first obtain prototype vectors for each fold class that are maximally separated in hyperspherical space. We then train a neural network by minimizing the angular large margin cosine loss to learn protein embeddings clustered around the corresponding hyperspherical fold prototypes. Our network architectures, ResCNN-GRU and ResCNN-BGRU, process the input protein sequences by applying several residual-convolutional blocks followed by a gated recurrent unit-based recurrent layer. Evaluation results on the LINDAHL dataset indicate that the use of our hyperspherical embeddings effectively bridges the performance gap at the family and fold levels. Furthermore, our FoldHSpherePro ensemble method yields an accuracy of 81.3% at the fold level, outperforming all the state-of-the-art methods. CONCLUSIONS: Our methodology is efficient in learning discriminative and fold-representative embeddings for the protein domains. The proposed hyperspherical embeddings are effective at identifying the protein fold class by pairwise comparison, even when amino acid sequence similarities are low.

Assuntos

Algoritmos , Redes Neurais de Computação , Proteínas

Protein Fold Recognition From Sequences Using Convolutional and Recurrent Neural Networks.

Villegas-Morcillo, Amelia; Gomez, Angel M; Morales-Cordovilla, Juan A; Sanchez, Victoria.

IEEE/ACM Trans Comput Biol Bioinform ; 18(6): 2848-2854, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-32750896

RESUMO

The identification of a protein fold type from its amino acid sequence provides important insights about the protein 3D structure. In this paper, we propose a deep learning architecture that can process protein residue-level features to address the protein fold recognition task. Our neural network model combines 1D-convolutional layers with gated recurrent unit (GRU) layers. The GRU cells, as recurrent layers, cope with the processing issues associated to the highly variable protein sequence lengths and so extract a fold-related embedding of fixed size for each protein domain. These embeddings are then used to perform the pairwise fold recognition task, which is based on transferring the fold type of the most similar template structure. We compare our model with several template-based and deep learning-based methods from the state-of-the-art. The evaluation results over the well-known LINDAHL and SCOP_TEST sets, along with a proposed LINDAHL test set updated to SCOP 1.75, show that our embeddings perform significantly better than these methods, specially at the fold level. Supplementary material, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TCBB.2020.3012732, source code and trained models are available at http://sigmat.ugr.es/~amelia/CNN-GRU-RF+/.

Assuntos

Biologia Computacional/métodos , Aprendizado Profundo , Dobramento de Proteína , Proteínas , Análise de Sequência de Proteína/métodos , Algoritmos , Proteínas/química , Proteínas/genética , Proteínas/metabolismo

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA