Búsqueda | Portal Regional de la BVS

ManyFold: an efficient and flexible library for training and validating protein folding models.

Villegas-Morcillo, Amelia; Robinson, Louis; Flajolet, Arthur; Barrett, Thomas D.

Bioinformatics ; 39(1)2023 01 01.

Artículo en Inglés | MEDLINE | ID: mdl-36495196

RESUMEN

SUMMARY: ManyFold is a flexible library for protein structure prediction with deep learning that (i) supports models that use both multiple sequence alignments (MSAs) and protein language model (pLM) embedding as inputs, (ii) allows inference of existing models (AlphaFold and OpenFold), (iii) is fully trainable, allowing for both fine-tuning and the training of new models from scratch and (iv) is written in Jax to support efficient batched operation in distributed settings. A proof-of-concept pLM-based model, pLMFold, is trained from scratch to obtain reasonable results with reduced computational overheads in comparison to AlphaFold. AVAILABILITY AND IMPLEMENTATION: The source code for ManyFold, the validation dataset and a small sample of training data are available at https://github.com/instadeepai/manyfold. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Proteínas , Programas Informáticos , Proteínas/química , Pliegue de Proteína , Alineación de Secuencia , Lenguaje

An analysis of protein language model embeddings for fold prediction.

Villegas-Morcillo, Amelia; Gomez, Angel M; Sanchez, Victoria.

Brief Bioinform ; 23(3)2022 05 13.

Artículo en Inglés | MEDLINE | ID: mdl-35443054

RESUMEN

The identification of the protein fold class is a challenging problem in structural biology. Recent computational methods for fold prediction leverage deep learning techniques to extract protein fold-representative embeddings mainly using evolutionary information in the form of multiple sequence alignment (MSA) as input source. In contrast, protein language models (LM) have reshaped the field thanks to their ability to learn efficient protein representations (protein-LM embeddings) from purely sequential information in a self-supervised manner. In this paper, we analyze a framework for protein fold prediction using pre-trained protein-LM embeddings as input to several fine-tuning neural network models, which are supervisedly trained with fold labels. In particular, we compare the performance of six protein-LM embeddings: the long short-term memory-based UniRep and SeqVec, and the transformer-based ESM-1b, ESM-MSA, ProtBERT and ProtT5; as well as three neural networks: Multi-Layer Perceptron, ResCNN-BGRU (RBG) and Light-Attention (LAT). We separately evaluated the pairwise fold recognition (PFR) and direct fold classification (DFC) tasks on well-known benchmark datasets. The results indicate that the combination of transformer-based embeddings, particularly those obtained at amino acid level, with the RBG and LAT fine-tuning models performs remarkably well in both tasks. To further increase prediction accuracy, we propose several ensemble strategies for PFR and DFC, which provide a significant performance boost over the current state-of-the-art results. All this suggests that moving from traditional protein representations to protein-LM embeddings is a very promising approach to protein fold-related tasks.

Asunto(s)

Lenguaje , Redes Neurales de la Computación , Procesamiento de Lenguaje Natural , Proteínas/química

FoldHSphere: deep hyperspherical embeddings for protein fold recognition.

Villegas-Morcillo, Amelia; Sanchez, Victoria; Gomez, Angel M.

BMC Bioinformatics ; 22(1): 490, 2021 Oct 12.

Artículo en Inglés | MEDLINE | ID: mdl-34641786

RESUMEN

BACKGROUND: Current state-of-the-art deep learning approaches for protein fold recognition learn protein embeddings that improve prediction performance at the fold level. However, there still exists aperformance gap at the fold level and the (relatively easier) family level, suggesting that it might be possible to learn an embedding space that better represents the protein folds. RESULTS: In this paper, we propose the FoldHSphere method to learn a better fold embedding space through a two-stage training procedure. We first obtain prototype vectors for each fold class that are maximally separated in hyperspherical space. We then train a neural network by minimizing the angular large margin cosine loss to learn protein embeddings clustered around the corresponding hyperspherical fold prototypes. Our network architectures, ResCNN-GRU and ResCNN-BGRU, process the input protein sequences by applying several residual-convolutional blocks followed by a gated recurrent unit-based recurrent layer. Evaluation results on the LINDAHL dataset indicate that the use of our hyperspherical embeddings effectively bridges the performance gap at the family and fold levels. Furthermore, our FoldHSpherePro ensemble method yields an accuracy of 81.3% at the fold level, outperforming all the state-of-the-art methods. CONCLUSIONS: Our methodology is efficient in learning discriminative and fold-representative embeddings for the protein domains. The proposed hyperspherical embeddings are effective at identifying the protein fold class by pairwise comparison, even when amino acid sequence similarities are low.

Asunto(s)

Algoritmos , Redes Neurales de la Computación , Proteínas

Protein Fold Recognition From Sequences Using Convolutional and Recurrent Neural Networks.

Villegas-Morcillo, Amelia; Gomez, Angel M; Morales-Cordovilla, Juan A; Sanchez, Victoria.

IEEE/ACM Trans Comput Biol Bioinform ; 18(6): 2848-2854, 2021.

Artículo en Inglés | MEDLINE | ID: mdl-32750896

RESUMEN

The identification of a protein fold type from its amino acid sequence provides important insights about the protein 3D structure. In this paper, we propose a deep learning architecture that can process protein residue-level features to address the protein fold recognition task. Our neural network model combines 1D-convolutional layers with gated recurrent unit (GRU) layers. The GRU cells, as recurrent layers, cope with the processing issues associated to the highly variable protein sequence lengths and so extract a fold-related embedding of fixed size for each protein domain. These embeddings are then used to perform the pairwise fold recognition task, which is based on transferring the fold type of the most similar template structure. We compare our model with several template-based and deep learning-based methods from the state-of-the-art. The evaluation results over the well-known LINDAHL and SCOP_TEST sets, along with a proposed LINDAHL test set updated to SCOP 1.75, show that our embeddings perform significantly better than these methods, specially at the fold level. Supplementary material, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TCBB.2020.3012732, source code and trained models are available at http://sigmat.ugr.es/~amelia/CNN-GRU-RF+/.

Asunto(s)

Biología Computacional/métodos , Aprendizaje Profundo , Pliegue de Proteína , Proteínas , Análisis de Secuencia de Proteína/métodos , Algoritmos , Proteínas/química , Proteínas/genética , Proteínas/metabolismo

Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function.

Villegas-Morcillo, Amelia; Makrodimitris, Stavros; van Ham, Roeland C H J; Gomez, Angel M; Sanchez, Victoria; Reinders, Marcel J T.

Bioinformatics ; 37(2): 162-170, 2021 04 19.

Artículo en Inglés | MEDLINE | ID: mdl-32797179

RESUMEN

MOTIVATION: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. RESULTS: We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. AVAILABILITY AND IMPLEMENTATION: Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Proteínas , Programas Informáticos , Secuencia de Aminoácidos , Redes Neurales de la Computación , Proteínas/genética

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA