Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 71
Filtrar
1.
Int J Mol Sci ; 25(8)2024 Apr 20.
Artigo em Inglês | MEDLINE | ID: mdl-38674107

RESUMO

The fibroblast growth factor receptor 2 (FGFR2) gene is one of the most extensively studied genes with many known mutations implicated in several human disorders, including oncogenic ones. Most FGFR2 disease-associated gene mutations are missense mutations that result in constitutive activation of the FGFR2 protein and downstream molecular pathways. Many tertiary structures of the FGFR2 kinase domain are publicly available in the wildtype and mutated forms and in the inactive and activated state of the receptor. The current literature suggests a molecular brake inhibiting the ATP-binding A loop from adopting the activated state. Mutations relieve this brake, triggering allosteric changes between active and inactive states. However, the existing analysis relies on static structures and fails to account for the intrinsic structural dynamics. In this study, we utilize experimentally resolved structures of the FGFR2 tyrosine kinase domain and machine learning to capture the intrinsic structural dynamics, correlate it with functional regions and disease types, and enrich it with predicted structures of variants with currently no experimentally resolved structures. Our findings demonstrate the value of machine learning-enabled characterizations of structure dynamics in revealing the impact of mutations on (dys)function and disorder in FGFR2.


Assuntos
Receptor Tipo 2 de Fator de Crescimento de Fibroblastos , Receptor Tipo 2 de Fator de Crescimento de Fibroblastos/genética , Receptor Tipo 2 de Fator de Crescimento de Fibroblastos/química , Receptor Tipo 2 de Fator de Crescimento de Fibroblastos/metabolismo , Humanos , Mutação , Aprendizado de Máquina , Mutação de Sentido Incorreto , Modelos Moleculares , Conformação Proteica , Domínios Proteicos , Relação Estrutura-Atividade
2.
Artigo em Inglês | MEDLINE | ID: mdl-38621825

RESUMO

Over the years, many computational methods have been created for the analysis of the impact of single amino acid substitutions resulting from single-nucleotide variants in genome coding regions. Historically, all methods have been supervised and thus limited by the inadequate sizes of experimentally curated data sets and by the lack of a standardized definition of variant effect. The emergence of unsupervised, deep learning (DL)-based methods raised an important question: Can machines learn the language of life from the unannotated protein sequence data well enough to identify significant errors in the protein "sentences"? Our analysis suggests that some unsupervised methods perform as well or better than existing supervised methods. Unsupervised methods are also faster and can, thus, be useful in large-scale variant evaluations. For all other methods, however, their performance varies by both evaluation metrics and by the type of variant effect being predicted. We also note that the evaluation of method performance is still lacking on less-studied, nonhuman proteins where unsupervised methods hold the most promise.

3.
bioRxiv ; 2024 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-38293094

RESUMO

Understanding the impact of genomic variants on transcription factor binding and gene regulation remains a key area of research, with implications for unraveling the complex mechanisms underlying various functional effects. Our study delves into the role of DNA's biophysical properties, including thermodynamic stability, shape, and flexibility in transcription factor (TF) binding. We developed a multi-modal deep learning model integrating these properties with DNA sequence data. Trained on ChIP-Seq (chromatin immunoprecipitation sequencing) data in vivo involving 690 TF-DNA binding events in human genome, our model significantly improves prediction performance in over 660 binding events, with up to 9.6% increase in AUROC metric compared to the baseline model when using no DNA biophysical properties explicitly. Further, we expanded our analysis to in vitro high-throughput Systematic Evolution of Ligands by Exponential enrichment (SELEX) and Protein Binding Microarray (PBM) datasets, comparing our model with established frameworks. The inclusion of DNA breathing features consistently improved TF binding predictions across different cell lines in these datasets. Notably, for complex ChIP-Seq datasets, integrating DNABERT2 with a cross-attention mechanism provided greater predictive capabilities and insights into the mechanisms of disease-related non-coding variants found in genome-wide association studies. This work highlights the importance of DNA biophysical characteristics in TF binding and the effectiveness of multi-modal deep learning models in gene regulation studies.

4.
Bioinformatics ; 39(11)2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37991847

RESUMO

MOTIVATION: The two strands of the DNA double helix locally and spontaneously separate and recombine in living cells due to the inherent thermal DNA motion. This dynamics results in transient openings in the double helix and is referred to as "DNA breathing" or "DNA bubbles." The propensity to form local transient openings is important in a wide range of biological processes, such as transcription, replication, and transcription factors binding. However, the modeling and computer simulation of these phenomena, have remained a challenge due to the complex interplay of numerous factors, such as, temperature, salt content, DNA sequence, hydrogen bonding, base stacking, and others. RESULTS: We present pyDNA-EPBD, a parallel software implementation of the Extended Peyrard-Bishop-Dauxois (EPBD) nonlinear DNA model that allows us to describe some features of DNA dynamics in detail. The pyDNA-EPBD generates genomic scale profiles of average base-pair openings, base flipping probability, DNA bubble probability, and calculations of the characteristically dynamic length indicating the number of base pairs statistically significantly affected by a single point mutation using the Markov Chain Monte Carlo algorithm. AVAILABILITY AND IMPLEMENTATION: pyDNA-EPBD is supported across most operating systems and is freely available at https://github.com/lanl/pyDNA_EPBD. Extensive documentation can be found at https://lanl.github.io/pyDNA_EPBD/.


Assuntos
DNA , Modelos Químicos , Simulação por Computador , DNA/química , Software , Pareamento de Bases , Conformação de Ácido Nucleico
5.
bioRxiv ; 2023 Sep 12.
Artigo em Inglês | MEDLINE | ID: mdl-37745370

RESUMO

Motivation: The two strands of the DNA double helix locally and spontaneously separate and recombine in living cells due to the inherent thermal DNA motion.This dynamics results in transient openings in the double helix and is referred to as "DNA breathing" or "DNA bubbles." The propensity to form local transient openings is important in a wide range of biological processes, such as transcription, replication, and transcription factors binding. However, the modeling and computer simulation of these phenomena, have remained a challenge due to the complex interplay of numerous factors, such as, temperature, salt content, DNA sequence, hydrogen bonding, base stacking, and others. Results: We present pyDNA-EPBD, a parallel software implementation of the Extended Peyrard-Bishop- Dauxois (EPBD) nonlinear DNA model that allows us to describe some features of DNA dynamics in detail. The pyDNA-EPBD generates genomic scale profiles of average base-pair openings, base flipping probability, DNA bubble probability, and calculations of the characteristically dynamic length indicating the number of base pairs statistically significantly affected by a single point mutation using the Markov Chain Monte Carlo (MCMC) algorithm.

6.
IEEE/ACM Trans Comput Biol Bioinform ; 20(5): 2759-2771, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-34882562

RESUMO

We have long known that characterizing protein structures structure is key to understanding protein function. Computational approaches have largely addressed a narrow formulation of the problem, seeking to compute one native structure from an amino-acid sequence. Now AlphaFold2 is shown to be able to reveal a high-quality native structure for many proteins. However, researchers over the years have argued for broadening our view to account for the multiplicity of native structures. We now know that many protein molecules switch between different structures to regulate interactions with molecular partners in the cell. Elucidating such structures de novo is exceptionally difficult, as it requires exploration of possibly a very large structure space in search of competing, near-optimal structures. Here we report on a novel stochastic optimization method capable of revealing very different structures for a given protein from knowledge of its amino-acid sequence. The method leverages evolutionary search techniques and adapts its exploration of the search space to balance between exploration and exploitation in the presence of a computational budget. In addition to demonstrating the utility of this method for identifying multiple native structures, we additionally provide a benchmark dataset for researchers to continue work on this problem.

7.
Biomolecules ; 12(11)2022 11 18.
Artigo em Inglês | MEDLINE | ID: mdl-36421723

RESUMO

Protein Language Models (PLMs) are shown to be capable of learning sequence representations useful for various prediction tasks, from subcellular localization, evolutionary relationships, family membership, and more. They have yet to be demonstrated useful for protein function prediction. In particular, the problem of automatic annotation of proteins under the Gene Ontology (GO) framework remains open. This paper makes two key contributions. It debuts a novel method that leverages the transformer architecture in two ways. A sequence transformer encodes protein sequences in a task-agnostic feature space. A graph transformer learns a representation of GO terms while respecting their hierarchical relationships. The learned sequence and GO terms representations are combined and utilized for multi-label classification, with the labels corresponding to GO terms. The method is shown superior over recent representative GO prediction methods. The second major contribution in this paper is a deep investigation of different ways of constructing training and testing datasets. The paper shows that existing approaches under- or over-estimate the generalization power of a model. A novel approach is proposed to address these issues, resulting in a new benchmark dataset to rigorously evaluate and compare methods and advance the state-of-the-art.


Assuntos
Proteínas , Ontologia Genética , Proteínas/genética , Proteínas/metabolismo , Sequência de Aminoácidos
8.
Biomolecules ; 12(7)2022 06 29.
Artigo em Inglês | MEDLINE | ID: mdl-35883464

RESUMO

With the debut of AlphaFold2, we now can get a highly-accurate view of a reasonable equilibrium tertiary structure of a protein molecule. Yet, a single-structure view is insufficient and does not account for the high structural plasticity of protein molecules. Obtaining a multi-structure view of a protein molecule continues to be an outstanding challenge in computational structural biology. In tandem with methods formulated under the umbrella of stochastic optimization, we are now seeing rapid advances in the capabilities of methods based on deep learning. In recent work, we advance the capability of these models to learn from experimentally-available tertiary structures of protein molecules of varying lengths. In this work, we elucidate the important role of the composition of the training dataset on the neural network's ability to learn key local and distal patterns in tertiary structures. To make such patterns visible to the network, we utilize a contact map-based representation of protein tertiary structure. We show interesting relationships between data size, quality, and composition on the ability of latent variable models to learn key patterns of tertiary structure. In addition, we present a disentangled latent variable model which improves upon the state-of-the-art variable autoencoder-based model in key, physically-realistic structural patterns. We believe this work opens up further avenues of research on deep learning-based models for computing multi-structure views of protein molecules.


Assuntos
Biologia Computacional , Proteínas , Modelos Teóricos , Proteínas/química
9.
Biomolecules ; 12(7)2022 07 21.
Artigo em Inglês | MEDLINE | ID: mdl-35883567

RESUMO

Over the past decade, Markov State Models (MSM) have emerged as powerful methodologies to build discrete models of dynamics over structures obtained from Molecular Dynamics trajectories. The identification of macrostates for the MSM is a central decision that impacts the quality of the MSM but depends on both the selected representation of a structure and the clustering algorithm utilized over the featurized structures. Motivated by a large molecular system in its free and bound state, this paper investigates two directions of research, further reducing the representation dimensionality in a non-parametric, data-driven manner and including more structures in the computation. Rigorous evaluation of the quality of obtained MSMs via various statistical tests in a comparative setting firmly shows that fewer dimensions and more structures result in a better MSM. Many interesting findings emerge from the best MSM, advancing our understanding of the relationship between antibody dynamics and antibody-antigen recognition.


Assuntos
Algoritmos , Simulação de Dinâmica Molecular , Análise por Conglomerados , Cadeias de Markov
10.
Bioinformatics ; 38(12): 3200-3208, 2022 06 13.
Artigo em Inglês | MEDLINE | ID: mdl-35511125

RESUMO

MOTIVATION: Expanding our knowledge of small molecules beyond what is known in nature or designed in wet laboratories promises to significantly advance cheminformatics, drug discovery, biotechnology and material science. In silico molecular design remains challenging, primarily due to the complexity of the chemical space and the non-trivial relationship between chemical structures and biological properties. Deep generative models that learn directly from data are intriguing, but they have yet to demonstrate interpretability in the learned representation, so we can learn more about the relationship between the chemical and biological space. In this article, we advance research on disentangled representation learning for small molecule generation. We build on recent work by us and others on deep graph generative frameworks, which capture atomic interactions via a graph-based representation of a small molecule. The methodological novelty is how we leverage the concept of disentanglement in the graph variational autoencoder framework both to generate biologically relevant small molecules and to enhance model interpretability. RESULTS: Extensive qualitative and quantitative experimental evaluation in comparison with state-of-the-art models demonstrate the superiority of our disentanglement framework. We believe this work is an important step to address key challenges in small molecule generation with deep generative frameworks. AVAILABILITY AND IMPLEMENTATION: Training and generated data are made available at https://ieee-dataport.org/documents/dataset-disentangled-representation-learning-interpretable-molecule-generation. All code is made available at https://anonymous.4open.science/r/D-MolVAE-2799/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Aprendizado Profundo , Descoberta de Drogas
11.
IEEE/ACM Trans Comput Biol Bioinform ; 19(3): 1670-1682, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-33400654

RESUMO

A central challenge in protein modeling research and protein structure prediction in particular is known as decoy selection. The problem refers to selecting biologically-active/native tertiary structures among a multitude of physically-realistic structures generated by template-free protein structure prediction methods. Research on decoy selection is active. Clustering-based methods are popular, but they fail to identify good/near-native decoys on datasets where near-native decoys are severely under-sampled by a protein structure prediction method. Reasonable progress is reported by methods that additionally take into account the internal energy of a structure and employ it to identify basins in the energy landscape organizing the multitude of decoys. These methods, however, incur significant time costs for extracting basins from the landscape. In this paper, we propose a novel decoy selection method based on non-negative matrix factorization. We demonstrate that our method outperforms energy landscape-based methods. In particular, the proposed method addresses both the time cost issue and the challenge of identifying good decoys in a sparse dataset, successfully recognizing near-native decoys for both easy and hard protein targets.


Assuntos
Algoritmos , Proteínas , Análise por Conglomerados , Conformação Proteica , Dobramento de Proteína , Proteínas/química , Proteínas/genética
12.
Biomolecules ; 11(10)2021 09 29.
Artigo em Inglês | MEDLINE | ID: mdl-34680060

RESUMO

Many biological and biotechnological processes are controlled by protein-protein and protein-solvent interactions. In order to understand, predict, and optimize such processes, it is important to understand how solvents affect protein structure during protein-solvent interactions. In this study, all-atom molecular dynamics are used to investigate the structural dynamics and energetic properties of a C-terminal domain of the Rift Valley Fever Virus L protein solvated in glycerol and aqueous glycerol solutions in different concentrations by molecular weight. The Generalized Amber Force Field is modified by including restrained electrostatic potential atomic charges for the glycerol molecules. The peptide is considered in detail by monitoring properties like the root-mean-squared deviation, root-mean-squared fluctuation, radius of gyration, hydrodynamic radius, end-to-end distance, solvent-accessible surface area, intra-potential energy, and solvent-peptide interaction energies for hundreds of nanoseconds. Secondary structure analysis is also performed to examine the extent of conformational drift for the individual helices and sheets. We predict that the peptide helices and sheets are maintained only when the modeling strategy considers the solvent with lower glycerol concentration. We also find that the solvent-peptide becomes more cohesive with decreasing glycerol concentrations. The density and radial distribution function of glycerol solvent calculated when modeled with the modified atomic charges show a very good agreement with experimental results and other simulations at 298.15K.


Assuntos
Glicerol/química , Vírus da Febre do Vale do Rift/ultraestrutura , Proteínas Virais/ultraestrutura , Água/química , Humanos , Ligação de Hidrogênio , Simulação de Dinâmica Molecular , Peptídeos/química , Domínios Proteicos/genética , Estrutura Secundária de Proteína , Vírus da Febre do Vale do Rift/química , Vírus da Febre do Vale do Rift/genética , Solventes/química , Proteínas Virais/química , Proteínas Virais/genética
14.
Molecules ; 26(5)2021 Feb 24.
Artigo em Inglês | MEDLINE | ID: mdl-33668217

RESUMO

Protein molecules are inherently dynamic and modulate their interactions with different molecular partners by accessing different tertiary structures under physiological conditions. Elucidating such structures remains challenging. Current momentum in deep learning and the powerful performance of generative adversarial networks (GANs) in complex domains, such as computer vision, inspires us to investigate GANs on their ability to generate physically-realistic protein tertiary structures. The analysis presented here shows that several GAN models fail to capture complex, distal structural patterns present in protein tertiary structures. The study additionally reveals that mechanisms touted as effective in stabilizing the training of a GAN model are not all effective, and that performance based on loss alone may be orthogonal to performance based on the quality of generated datasets. A novel contribution in this study is the demonstration that Wasserstein GAN strikes a good balance and manages to capture both local and distal patterns, thus presenting a first step towards more powerful deep generative models for exploring a possibly very diverse set of structures supporting diverse activities of a protein molecule in the cell.


Assuntos
Redes Neurais de Computação , Proteínas/química , Estrutura Terciária de Proteína
15.
J Bioinform Comput Biol ; 19(1): 2140002, 2021 02.
Artigo em Inglês | MEDLINE | ID: mdl-33568002

RESUMO

Many regions of the protein universe remain inaccessible by wet-laboratory or computational structure determination methods. A significant challenge in elucidating these dark regions in silico relates to the ability to discriminate relevant structure(s) among many structures/decoys computed for a protein of interest, a problem known as decoy selection. Clustering decoys based on geometric similarity remains popular. However, it is unclear how exactly to exploit the groups of decoys revealed via clustering to select individual structures for prediction. In this paper, we provide an intuitive formulation of the decoy selection problem as an instance of unsupervised multi-instance learning. We address the problem in three stages, first organizing given decoys of a protein molecule into bags, then identifying relevant bags, and finally drawing individual instances from these bags to offer as prediction. We propose both non-parametric and parametric algorithms for drawing individual instances. Our evaluation utilizes two datasets, one benchmark dataset of ensembles of decoys for a varied list of protein molecules, and a dataset of decoy ensembles for targets drawn from recent CASP competitions. A comparative analysis with state-of-the-art methods reveals that the proposed approach outperforms existing methods, thus warranting further investigation of multi-instance learning to advance our treatment of decoy selection.


Assuntos
Algoritmos , Biologia Computacional/métodos , Proteínas/química , Análise por Conglomerados , Simulação por Computador , Bases de Dados de Proteínas , Aprendizado de Máquina , Estrutura Terciária de Proteína
16.
Curr Opin Struct Biol ; 67: 170-177, 2021 04.
Artigo em Inglês | MEDLINE | ID: mdl-33338762

RESUMO

Much scientific enquiry across disciplines is founded upon a mechanistic treatment of dynamic systems that ties form to function. A highly visible instance of this is in molecular biology, where characterizing macromolecular structure and dynamics is central to a detailed, molecular-level understanding of biological processes in the living cell. The current computational paradigm utilizes optimization as the generative process for modeling both structure and structural dynamics. Computational biology researchers are now attempting to wield generative models employing deep neural networks as an alternative computational paradigm. In this review, we summarize such efforts. We highlight progress and shortcomings. More importantly, we expose challenges that macromolecular structure poses to deep generative models and take this opportunity to introduce the structural biology community to several recent advances in the deep learning community that promise a way forward.


Assuntos
Biologia Computacional , Aprendizado Profundo , Estrutura Molecular , Biologia Molecular , Redes Neurais de Computação
17.
Bioinform Adv ; 1(1): vbab036, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-36700110

RESUMO

Motivation: Modeling the structural plasticity of protein molecules remains challenging. Most research has focused on obtaining one biologically active structure. This includes the recent AlphaFold2 that has been hailed as a breakthrough for protein modeling. Computing one structure does not suffice to understand how proteins modulate their interactions and even evade our immune system. Revealing the structure space available to a protein remains challenging. Data-driven approaches that learn to generate tertiary structures are increasingly garnering attention. These approaches exploit the ability to represent tertiary structures as contact or distance maps and make direct analogies with images to harness convolution-based generative adversarial frameworks from computer vision. Since such opportunistic analogies do not allow capturing highly structured data, current deep models struggle to generate physically realistic tertiary structures. Results: We present novel deep generative models that build upon the graph variational autoencoder framework. In contrast to existing literature, we represent tertiary structures as 'contact' graphs, which allow us to leverage graph-generative deep learning. Our models are able to capture rich, local and distal constraints and additionally compute disentangled latent representations that reveal the impact of individual latent factors. This elucidates what the factors control and makes our models more interpretable. Rigorous comparative evaluation along various metrics shows that the models, we propose advance the state-of-the-art. While there is still much ground to cover, the work presented here is an important first step, and graph-generative frameworks promise to get us to our goal of unraveling the exquisite structural complexity of protein molecules. Availability and implementation: Code is available at https://github.com/anonymous1025/CO-VAE. Supplementary information: Supplementary data are available at Bioinformatics Advances online.

18.
BMC Bioinformatics ; 21(Suppl 1): 189, 2020 Dec 09.
Artigo em Inglês | MEDLINE | ID: mdl-33297949

RESUMO

BACKGROUND: Identifying one or more biologically-active/native decoys from millions of non-native decoys is one of the major challenges in computational structural biology. The extreme lack of balance in positive and negative samples (native and non-native decoys) in a decoy set makes the problem even more complicated. Consensus methods show varied success in handling the challenge of decoy selection despite some issues associated with clustering large decoy sets and decoy sets that do not show much structural similarity. Recent investigations into energy landscape-based decoy selection approaches show promises. However, lack of generalization over varied test cases remains a bottleneck for these methods. RESULTS: We propose a novel decoy selection method, ML-Select, a machine learning framework that exploits the energy landscape associated with the structure space probed through a template-free decoy generation. The proposed method outperforms both clustering and energy ranking-based methods, all the while consistently offering better performance on varied test-cases. Moreover, ML-Select shows promising results even for the decoy sets consisting of mostly low-quality decoys. CONCLUSIONS: ML-Select is a useful method for decoy selection. This work suggests further research in finding more effective ways to adopt machine learning frameworks in achieving robust performance for decoy selection in template-free protein structure prediction.


Assuntos
Biologia Computacional/métodos , Proteínas/química , Análise por Conglomerados , Aprendizado de Máquina , Conformação Proteica , Dobramento de Proteína , Termodinâmica
19.
Molecules ; 25(9)2020 May 09.
Artigo em Inglês | MEDLINE | ID: mdl-32397410

RESUMO

Controlling the quality of tertiary structures computed for a protein molecule remains a central challenge in de-novo protein structure prediction. The rule of thumb is to generate as many structures as can be afforded, effectively acknowledging that having more structures increases the likelihood that some will reside near the sought biologically-active structure. A major drawback with this approach is that computing a large number of structures imposes time and space costs. In this paper, we propose a novel clustering-based approach which we demonstrate to significantly reduce an ensemble of generated structures without sacrificing quality. Evaluations are related on both benchmark and CASP target proteins. Structure ensembles subjected to the proposed approach and the source code of the proposed approach are publicly-available at the links provided in Section 1.


Assuntos
Biologia Computacional/métodos , Proteínas/química , Análise por Conglomerados , Modelos Moleculares , Dobramento de Proteína , Estrutura Terciária de Proteína , Software
20.
IEEE Trans Nanobioscience ; 19(3): 562-570, 2020 07.
Artigo em Inglês | MEDLINE | ID: mdl-32340957

RESUMO

The three-dimensional structures populated by a protein molecule determine to a great extent its biological activities. The rich information encoded by protein structure on protein function continues to motivate the development of computational approaches for determining functionally-relevant structures. The majority of structures generated in silico are not relevant. Discriminating relevant/native protein structures from non-native ones is an outstanding challenge in computational structural biology. Inherently, this is a recognition problem that can be addressed under the umbrella of machine learning. In this paper, based on the premise that near-native structures are effectively anomalies, we build on the concept of anomaly detection in machine learning. We propose methods that automatically select relevant subsets, as well as methods that select a single structure to offer as prediction. Evaluations are carried out on benchmark datasets and demonstrate that the proposed methods advance the state of the art. The presented results motivate further building on and adapting concepts and techniques from machine learning to improve recognition of near-native structures in protein structure prediction.


Assuntos
Biologia Computacional/métodos , Aprendizado de Máquina , Conformação Proteica , Proteínas/química , Algoritmos , Bases de Dados de Proteínas , Proteínas/fisiologia
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA