Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 54
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
Nature ; 606(7913): 389-395, 2022 06.
Artículo en Inglés | MEDLINE | ID: mdl-35589842

RESUMEN

Cancer immunoediting1 is a hallmark of cancer2 that predicts that lymphocytes kill more immunogenic cancer cells to cause less immunogenic clones to dominate a population. Although proven in mice1,3, whether immunoediting occurs naturally in human cancers remains unclear. Here, to address this, we investigate how 70 human pancreatic cancers evolved over 10 years. We find that, despite having more time to accumulate mutations, rare long-term survivors of pancreatic cancer who have stronger T cell activity in primary tumours develop genetically less heterogeneous recurrent tumours with fewer immunogenic mutations (neoantigens). To quantify whether immunoediting underlies these observations, we infer that a neoantigen is immunogenic (high-quality) by two features-'non-selfness'  based on neoantigen similarity to known antigens4,5, and 'selfness'  based on the antigenic distance required for a neoantigen to differentially bind to the MHC or activate a T cell compared with its wild-type peptide. Using these features, we estimate cancer clone fitness as the aggregate cost of T cells recognizing high-quality neoantigens offset by gains from oncogenic mutations. With this model, we predict the clonal evolution of tumours to reveal that long-term survivors of pancreatic cancer develop recurrent tumours with fewer high-quality neoantigens. Thus, we submit evidence that that the human immune system naturally edits neoantigens. Furthermore, we present a model to predict how immune pressure induces cancer cell populations to evolve over time. More broadly, our results argue that the immune system fundamentally surveils host genetic changes to suppress cancer.


Asunto(s)
Antígenos de Neoplasias , Supervivientes de Cáncer , Neoplasias Pancreáticas , Antígenos de Neoplasias/genética , Antígenos de Neoplasias/inmunología , Humanos , Neoplasias Pancreáticas/genética , Neoplasias Pancreáticas/inmunología , Neoplasias Pancreáticas/patología , Linfocitos T/inmunología , Escape del Tumor/inmunología
2.
Proc Natl Acad Sci U S A ; 121(26): e2312335121, 2024 Jun 25.
Artículo en Inglés | MEDLINE | ID: mdl-38889151

RESUMEN

Predicting the effects of one or more mutations to the in vivo or in vitro properties of a wild-type protein is a major computational challenge, due to the presence of epistasis, that is, of interactions between amino acids in the sequence. We introduce a computationally efficient procedure to build minimal epistatic models to predict mutational effects by combining evolutionary (homologous sequence) and few mutational-scan data. Mutagenesis measurements guide the selection of links in a sparse graphical model, while the parameters on the nodes and the edges are inferred from sequence data. We show, on 10 mutational scans, that our pipeline exhibits performances comparable to state-of-the-art deep networks trained on many more data, while requiring much less parameters and being hence more interpretable. In particular, the identified interactions adapt to the wild-type protein and to the fitness or biochemical property experimentally measured, mostly focus on key functional sites, and are not necessarily related to structural contacts. Therefore, our method is able to extract information relevant for one mutational experiment from homologous sequence data reflecting the multitude of structural and functional constraints acting on proteins throughout evolution.


Asunto(s)
Mutación , Proteínas , Proteínas/genética , Proteínas/metabolismo , Proteínas/química , Epistasis Genética , Evolución Molecular , Biología Computacional/métodos
3.
PLoS Comput Biol ; 19(10): e1011521, 2023 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-37883593

RESUMEN

Predicting the effects of mutations on protein function is an important issue in evolutionary biology and biomedical applications. Computational approaches, ranging from graphical models to deep-learning architectures, can capture the statistical properties of sequence data and predict the outcome of high-throughput mutagenesis experiments probing the fitness landscape around some wild-type protein. However, how the complexity of the models and the characteristics of the data combine to determine the predictive performance remains unclear. Here, based on a theoretical analysis of the prediction error, we propose descriptors of the sequence data, characterizing their quantity and relevance relative to the model. Our theoretical framework identifies a trade-off between these two quantities, and determines the optimal subset of data for the prediction task, showing that simple models can outperform complex ones when inferred from adequately-selected sequences. We also show how repeated subsampling of the sequence data is informative about how much epistasis in the fitness landscape is not captured by the computational model. Our approach is illustrated on several protein families, as well as on in silico solvable protein models.


Asunto(s)
Evolución Biológica , Proteínas , Proteínas/genética , Mutagénesis , Mutación , Simulación por Computador , Aptitud Genética/genética , Modelos Genéticos
4.
PLoS Comput Biol ; 19(11): e1011621, 2023 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-37976326

RESUMEN

We present here an approach to protein design that combines (i) scarce functional information such as experimental data (ii) evolutionary information learned from a natural sequence variants and (iii) physics-grounded modeling. Using a Restricted Boltzmann Machine (RBM), we learn a sequence model of a protein family. We use semi-supervision to leverage available functional information during the RBM training. We then propose a strategy to explore the protein representation space that can be informed by external models such as an empirical force-field method (FoldX). Our approach is applied to a domain of the Cas9 protein responsible for recognition of a short DNA motif. We experimentally assess the functionality of 71 variants generated to explore a range of RBM and FoldX energies. Sequences with as many as 50 differences (20% of the protein domain) to the wild-type retained functionality. Overall, 21/71 sequences designed with our method were functional. Interestingly, 6/71 sequences showed an improved activity in comparison with the original wild-type protein sequence. These results demonstrate the interest in further exploring the synergies between machine-learning of protein sequence representations and physics grounded modeling strategies informed by structural information.


Asunto(s)
Sistemas CRISPR-Cas , Proteínas , Proteínas/genética , Proteínas/química , Secuencia de Aminoácidos , Aprendizaje Automático , Aprendizaje
5.
Phys Rev Lett ; 130(15): 158402, 2023 Apr 14.
Artículo en Inglés | MEDLINE | ID: mdl-37115874

RESUMEN

Identifying and characterizing mutational paths is an important issue in evolutionary biology, with potential applications to bioengineering. We here propose an algorithm to sample mutational paths, which we benchmark on exactly solvable models of proteins in silico, and apply to data-driven models of natural proteins learned from sequence data with restricted Boltzmann machines. We then use mean-field theory to characterize paths for different mutational dynamics of interest, and to extend Kimura's estimate of evolutionary distances to sequence-based epistatic models of selection.


Asunto(s)
Evolución Biológica , Proteínas , Mutación , Proteínas/genética , Algoritmos
6.
PLoS Comput Biol ; 18(9): e1010561, 2022 09.
Artículo en Inglés | MEDLINE | ID: mdl-36174101

RESUMEN

Selection protocols such as SELEX, where molecules are selected over multiple rounds for their ability to bind to a target of interest, are popular methods for obtaining binders for diagnostic and therapeutic purposes. We show that Restricted Boltzmann Machines (RBMs), an unsupervised two-layer neural network architecture, can successfully be trained on sequence ensembles from single rounds of SELEX experiments for thrombin aptamers. RBMs assign scores to sequences that can be directly related to their fitnesses estimated through experimental enrichment ratios. Hence, RBMs trained from sequence data at a given round can be used to predict the effects of selection at later rounds. Moreover, the parameters of the trained RBMs are interpretable and identify functional features contributing most to sequence fitness. To exploit the generative capabilities of RBMs, we introduce two different training protocols: one taking into account sequence counts, capable of identifying the few best binders, and another based on unique sequences only, generating more diverse binders. We then use RBMs model to generate novel aptamers with putative disruptive mutations or good binding properties, and validate the generated sequences with gel shift assay experiments. Finally, we compare the RBM's performance with different supervised learning approaches that include random forests and several deep neural network architectures.


Asunto(s)
Redes Neurales de la Computación , Trombina , Aprendizaje Automático
7.
Mol Biol Evol ; 38(6): 2428-2445, 2021 05 19.
Artículo en Inglés | MEDLINE | ID: mdl-33555346

RESUMEN

COVID-19 can lead to acute respiratory syndrome, which can be due to dysregulated immune signaling. We analyze the distribution of CpG dinucleotides, a pathogen-associated molecular pattern, in the SARS-CoV-2 genome. We characterize CpG content by a CpG force that accounts for statistical constraints acting on the genome at the nucleotidic and amino acid levels. The CpG force, as the CpG content, is overall low compared with other pathogenic betacoronaviruses; however, it widely fluctuates along the genome, with a particularly low value, comparable with the circulating seasonal HKU1, in the spike coding region and a greater value, comparable with SARS and MERS, in the highly expressed nucleocapside coding region (N ORF), whose transcripts are relatively abundant in the cytoplasm of infected cells and present in the 3'UTRs of all subgenomic RNA. This dual nature of CpG content could confer to SARS-CoV-2 the ability to avoid triggering pattern recognition receptors upon entry, while eliciting a stronger response during replication. We then investigate the evolution of synonymous mutations since the outbreak of the COVID-19 pandemic, finding a signature of CpG loss in regions with a greater CpG force. Sequence motifs preceding the CpG-loss-associated loci in the N ORF match recently identified binding patterns of the zinc finger antiviral protein. Using a model of the viral gene evolution under human host pressure, we find that synonymous mutations seem driven in the SARS-CoV-2 genome, and particularly in the N ORF, by the viral codon bias, the transition-transversion bias, and the pressure to lower CpG content.


Asunto(s)
COVID-19/genética , Islas de CpG , Evolución Molecular , Genoma Viral , ARN Viral/genética , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , SARS-CoV-2/patogenicidad
8.
Bioinformatics ; 37(22): 4083-4090, 2021 11 18.
Artículo en Inglés | MEDLINE | ID: mdl-34117879

RESUMEN

MOTIVATION: Modeling of protein family sequence distribution from homologous sequence data recently received considerable attention, in particular for structure and function predictions, as well as for protein design. In particular, direct coupling analysis, a method to infer effective pairwise interactions between residues, was shown to capture important structural constraints and to successfully generate functional protein sequences. Building on this and other graphical models, we introduce a new framework to assess the quality of the secondary structures of the generated sequences with respect to reference structures for the family. RESULTS: We introduce two scoring functions characterizing the likeliness of the secondary structure of a protein sequence to match a reference structure, called Dot Product and Pattern Matching. We test these scores on published experimental protein mutagenesis and design dataset, and show improvement in the detection of nonfunctional sequences. We also show that use of these scores help rejecting nonfunctional sequences generated by graphical models (Restricted Boltzmann Machines) learned from homologous sequence alignments. AVAILABILITY AND IMPLEMENTATION: Data and code available at https://github.com/CyrilMa/ssqa. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Proteínas , Proteínas/química , Secuencia de Aminoácidos , Alineación de Secuencia , Estructura Secundaria de Proteína , Mutagénesis
9.
PLoS Comput Biol ; 17(9): e1009297, 2021 09.
Artículo en Inglés | MEDLINE | ID: mdl-34473697

RESUMEN

With the increasing ability to use high-throughput next-generation sequencing to quantify the diversity of the human T cell receptor (TCR) repertoire, the ability to use TCR sequences to infer antigen-specificity could greatly aid potential diagnostics and therapeutics. Here, we use a machine-learning approach known as Restricted Boltzmann Machine to develop a sequence-based inference approach to identify antigen-specific TCRs. Our approach combines probabilistic models of TCR sequences with clone abundance information to extract TCR sequence motifs central to an antigen-specific response. We use this model to identify patient personalized TCR motifs that respond to individual tumor and infectious disease antigens, and to accurately discriminate specific from non-specific responses. Furthermore, the hidden structure of the model results in an interpretable representation space where TCRs responding to the same antigen cluster, correctly discriminating the response of TCR to different viral epitopes. The model can be used to identify condition specific responding TCRs. We focus on the examples of TCRs reactive to candidate neoantigens and selected epitopes in experiments of stimulated TCR clone expansion.


Asunto(s)
Biología Computacional/métodos , Modelos Estadísticos , Linfocitos T/inmunología , Supervivientes de Cáncer , Carcinoma Ductal Pancreático/inmunología , Análisis por Conglomerados , Conjuntos de Datos como Asunto , Humanos , Neoplasias Pancreáticas/inmunología , Receptores de Antígenos de Linfocitos T/inmunología
10.
PLoS Comput Biol ; 17(3): e1008751, 2021 03.
Artículo en Inglés | MEDLINE | ID: mdl-33765014

RESUMEN

The sequences of antibodies from a given repertoire are highly diverse at few sites located on the surface of a genome-encoded larger scaffold. The scaffold is often considered to play a lesser role than highly diverse, non-genome-encoded sites in controlling binding affinity and specificity. To gauge the impact of the scaffold, we carried out quantitative phage display experiments where we compare the response to selection for binding to four different targets of three different antibody libraries based on distinct scaffolds but harboring the same diversity at randomized sites. We first show that the response to selection of an antibody library may be captured by two measurable parameters. Second, we provide evidence that one of these parameters is determined by the degree of affinity maturation of the scaffold, affinity maturation being the process by which antibodies accumulate somatic mutations to evolve towards higher affinities during the natural immune response. In all cases, we find that libraries of antibodies built around maturated scaffolds have a lower response to selection to other arbitrary targets than libraries built around germline-based scaffolds. We thus propose that germline-encoded scaffolds have a higher selective potential than maturated ones as a consequence of a selection for this potential over the long-term evolution of germline antibody genes. Our results are a first step towards quantifying the evolutionary potential of biomolecules.


Asunto(s)
Anticuerpos/genética , Biblioteca de Genes , Biología Computacional , ADN/genética , Evolución Molecular , Humanos
11.
Neural Comput ; 33(4): 1063-1112, 2021 03 26.
Artículo en Inglés | MEDLINE | ID: mdl-33513327

RESUMEN

We study the learning dynamics and the representations emerging in recurrent neural networks (RNNs) trained to integrate one or multiple temporal signals. Combining analytical and numerical investigations, we characterize the conditions under which an RNN with n neurons learns to integrate D(≪n) scalar signals of arbitrary duration. We show, for linear, ReLU, and sigmoidal neurons, that the internal state lives close to a D-dimensional manifold, whose shape is related to the activation function. Each neuron therefore carries, to various degrees, information about the value of all integrals. We discuss the deep analogy between our results and the concept of mixed selectivity forged by computational neuroscientists to interpret cortical recordings.

12.
Phys Rev Lett ; 124(4): 048302, 2020 Jan 31.
Artículo en Inglés | MEDLINE | ID: mdl-32058781

RESUMEN

Recurrent neural networks (RNN) are powerful tools to explain how attractors may emerge from noisy, high-dimensional dynamics. We study here how to learn the ∼N^{2} pairwise interactions in a RNN with N neurons to embed L manifolds of dimension D≪N. We show that the capacity, i.e., the maximal ratio L/N, decreases as |logε|^{-D}, where ε is the error on the position encoded by the neural activity along each manifold. Hence, RNN are flexible memory devices capable of storing a large number of manifolds at high spatial resolution. Our results rely on a combination of analytical tools from statistical mechanics and random matrix theory, extending Gardner's classical theory of learning to the case of patterns with strong spatial correlations.

13.
Neural Comput ; 31(8): 1671-1717, 2019 08.
Artículo en Inglés | MEDLINE | ID: mdl-31260391

RESUMEN

A restricted Boltzmann machine (RBM) is an unsupervised machine learning bipartite graphical model that jointly learns a probability distribution over data and extracts their relevant statistical features. RBMs were recently proposed for characterizing the patterns of coevolution between amino acids in protein sequences and for designing new sequences. Here, we study how the nature of the features learned by RBM changes with its defining parameters, such as the dimensionality of the representations (size of the hidden layer) and the sparsity of the features. We show that for adequate values of these parameters, RBMs operate in a so-called compositional phase in which visible configurations sampled from the RBM are obtained by recombining these features. We then compare the performance of RBM with other standard representation learning algorithms, including principal or independent component analysis (PCA, ICA), autoencoders (AE), variational autoencoders (VAE), and their sparse variants. We show that RBMs, due to the stochastic mapping between data configurations and representations, better capture the underlying interactions in the system and are significantly more robust with respect to sample size than deterministic methods such as PCA or ICA. In addition, this stochastic mapping is not prescribed a priori as in VAE, but learned from data, which allows RBMs to show good performance even with shallow architectures. All numerical results are illustrated on synthetic lattice protein data that share similar statistical features with real protein sequences and for which ground-truth interactions are known.


Asunto(s)
Aprendizaje Automático no Supervisado , Secuencia de Aminoácidos , Simulación por Computador , Modelos Moleculares , Modelos Estadísticos , Análisis de Componente Principal , Probabilidad , Proteínas/química , Proteínas/genética , Alineación de Secuencia , Electricidad Estática , Procesos Estocásticos
14.
Neural Comput ; 31(12): 2324-2347, 2019 12.
Artículo en Inglés | MEDLINE | ID: mdl-31614108

RESUMEN

The way grid cells represent space in the rodent brain has been a striking discovery, with theoretical implications still unclear. Unlike hippocampal place cells, which are known to encode multiple, environment-dependent spatial maps, grid cells have been widely believed to encode space through a single low-dimensional manifold, in which coactivity relations between different neurons are preserved when the environment is changed. Does it have to be so? Here, we compute, using two alternative mathematical models, the storage capacity of a population of grid-like units, embedded in a continuous attractor neural network, for multiple spatial maps. We show that distinct representations of multiple environments can coexist, as existing models for grid cells have the potential to express several sets of hexagonal grid patterns, challenging the view of a universal grid map. This suggests that a population of grid cells can encode multiple noncongruent metric relationships, a feature that could in principle allow a grid-like code to represent environments with a variety of different geometries and possibly conceptual and cognitive spaces, which may be expected to entail such context-dependent metric relationships.


Asunto(s)
Corteza Entorrinal/fisiología , Células de Red/fisiología , Red Nerviosa/fisiología , Percepción Espacial/fisiología , Animales , Simulación por Computador , Redes Neurales de la Computación
15.
PLoS Comput Biol ; 14(8): e1006320, 2018 08.
Artículo en Inglés | MEDLINE | ID: mdl-30106966

RESUMEN

The hippocampus is known to store cognitive representations, or maps, that encode both positional and contextual information, critical for episodic memories and functional behavior. How path integration and contextual cues are dynamically combined and processed by the hippocampus to maintain these representations accurate over time remains unclear. To answer this question, we propose a two-way data analysis and modeling approach to CA3 multi-electrode recordings of a moving rat submitted to rapid changes of contextual (light) cues, triggering back-and-forth instabitilies between two cognitive representations ("teleportation" experiment of Jezek et al). We develop a dual neural activity decoder, capable of independently identifying the recalled cognitive map at high temporal resolution (comparable to theta cycle) and the position of the rodent given a map. Remarkably, position can be reconstructed at any time with an accuracy comparable to fixed-context periods, even during highly unstable periods. These findings provide evidence for the capability of the hippocampal neural activity to maintain an accurate encoding of spatial and contextual variables, while one of these variables undergoes rapid changes independently of the other. To explain this result we introduce an attractor neural network model for the hippocampal activity that process inputs from external cues and the path integrator. Our model allows us to make predictions on the frequency of the cognitive map instability, its duration, and the detailed nature of the place-cell population activity, which are validated by a further analysis of the data. Our work therefore sheds light on the mechanisms by which the hippocampal network achieves and updates multi-dimensional neural representations from various input streams.


Asunto(s)
Región CA3 Hipocampal/fisiología , Red Nerviosa/fisiología , Conducta Espacial/fisiología , Potenciales de Acción/fisiología , Animales , Señales (Psicología) , Hipocampo/fisiología , Masculino , Memoria Episódica , Modelos Neurológicos , Redes Neurales de la Computación , Neuronas/fisiología , Ratas , Percepción Espacial/fisiología
16.
Rep Prog Phys ; 81(3): 032601, 2018 03.
Artículo en Inglés | MEDLINE | ID: mdl-29120346

RESUMEN

In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e. evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference of Boltzmann distributions from samples of microscopic configurations or observables, it is possible to extract information about evolutionary constraints and thus protein function and structure. Here we give an overview over some biologically important questions, and how statistical-mechanics inspired modeling approaches can help to answer them. Finally, we discuss some open questions, which we expect to be addressed over the next years.


Asunto(s)
Física/métodos , Proteínas/química , Secuencia de Aminoácidos , Anotación de Secuencia Molecular , Proteínas/metabolismo , Homología de Secuencia de Aminoácido
17.
Proc Natl Acad Sci U S A ; 112(49): 15154-9, 2015 Dec 08.
Artículo en Inglés | MEDLINE | ID: mdl-26575629

RESUMEN

Recent studies have demonstrated abundant transcription of a set of noncoding RNAs (ncRNAs) preferentially within tumors as opposed to normal tissue. Using an approach from statistical physics, we quantify global transcriptome-wide motif use for the first time, to our knowledge, in human and murine ncRNAs, determining that most have motif use consistent with the coding genome. However, an outlier subset of tumor-associated ncRNAs, typically of recent evolutionary origin, has motif use that is often indicative of pathogen-associated RNA. For instance, we show that the tumor-associated human repeat human satellite repeat II (HSATII) is enriched in motifs containing CpG dinucleotides in AU-rich contexts that most of the human genome and human adapted viruses have evolved to avoid. We demonstrate that a key subset of these ncRNAs functions as immunostimulatory "self-agonists" and directly activates cells of the mononuclear phagocytic system to produce proinflammatory cytokines. These ncRNAs arise from endogenous repetitive elements that are normally silenced, yet are often very highly expressed in cancers. We propose that the innate response in tumors may partially originate from direct interaction of immunogenic ncRNAs expressed in cancer cells with innate pattern recognition receptors, and thereby assign a previously unidentified danger-associated function to a set of dark matter repetitive elements. These findings potentially reconcile several observations concerning the role of ncRNA expression in cancers and their relationship to the tumor microenvironment.


Asunto(s)
Neoplasias/genética , ARN no Traducido/inmunología , Animales , Humanos , Inmunidad Innata , Ratones , Neoplasias/inmunología
18.
Phys Rev Lett ; 118(4): 048103, 2017 Jan 27.
Artículo en Inglés | MEDLINE | ID: mdl-28186794

RESUMEN

Organisms shape their own environment, which in turn affects their survival. This feedback becomes especially important for communities containing a large number of species; however, few existing approaches allow studying this regime, except in simulations. Here, we use methods of statistical physics to analytically solve a classic ecological model of resource competition introduced by MacArthur in 1969. We show that the nonintuitive phenomenology of highly diverse ecosystems includes a phase where the environment constructed by the community becomes fully decoupled from the outside world.


Asunto(s)
Ecosistema , Modelos Teóricos , Dinámica Poblacional , Simulación por Computador , Ambiente , Física
19.
J Comput Neurosci ; 43(1): 17-33, 2017 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-28484899

RESUMEN

Hippocampus stores spatial representations, or maps, which are recalled each time a subject is placed in the corresponding environment. Across different environments of similar geometry, these representations show strong orthogonality in CA3 of hippocampus, whereas in the CA1 subfield a considerable overlap between the maps can be seen. The lower orthogonality decreases reliability of various decoders developed in an attempt to identify which of the stored maps is active at the moment. Especially, the problem with decoding emerges with a need to analyze data at high temporal resolution. Here, we introduce a functional-connectivity-based decoder, which accounts for the pairwise correlations between the spiking activities of neurons in each map and does not require any positional information, i.e. any knowledge about place fields. We first show, on recordings of hippocampal activity in constant environmental conditions, that our decoder outperforms existing decoding methods in CA1. Our decoder is then applied to data from teleportation experiments, in which an instantaneous switch between the environment identity triggers a recall of the corresponding spatial representation . We test the sensitivity of our approach on the transition dynamics between the respective memory states (maps). We find that the rate of spontaneous state shifts (flickering) after a teleportation event is increased not only within the first few seconds as already reported, but this instability is sustained across much longer (> 1 min.) periods.


Asunto(s)
Región CA1 Hipocampal/fisiología , Memoria , Modelos Neurológicos , Neuronas/fisiología , Hipocampo , Humanos , Reproducibilidad de los Resultados
20.
PLoS Comput Biol ; 12(5): e1004889, 2016 05.
Artículo en Inglés | MEDLINE | ID: mdl-27177270

RESUMEN

Inverse statistical approaches to determine protein structure and function from Multiple Sequence Alignments (MSA) are emerging as powerful tools in computational biology. However the underlying assumptions of the relationship between the inferred effective Potts Hamiltonian and real protein structure and energetics remain untested so far. Here we use lattice protein model (LP) to benchmark those inverse statistical approaches. We build MSA of highly stable sequences in target LP structures, and infer the effective pairwise Potts Hamiltonians from those MSA. We find that inferred Potts Hamiltonians reproduce many important aspects of 'true' LP structures and energetics. Careful analysis reveals that effective pairwise couplings in inferred Potts Hamiltonians depend not only on the energetics of the native structure but also on competing folds; in particular, the coupling values reflect both positive design (stabilization of native conformation) and negative design (destabilization of competing folds). In addition to providing detailed structural information, the inferred Potts models used as protein Hamiltonian for design of new sequences are able to generate with high probability completely new sequences with the desired folds, which is not possible using independent-site models. Those are remarkable results as the effective LP Hamiltonians used to generate MSA are not simple pairwise models due to the competition between the folds. Our findings elucidate the reasons for the success of inverse approaches to the modelling of proteins from sequence data, and their limitations.


Asunto(s)
Proteínas/química , Alineación de Secuencia/estadística & datos numéricos , Secuencia de Aminoácidos , Benchmarking , Biología Computacional , Simulación por Computador , Modelos Moleculares , Modelos Estadísticos , Conformación Proteica , Pliegue de Proteína , Proteínas/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA