Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 49
Filtrar
1.
Nature ; 606(7913): 389-395, 2022 06.
Artículo en Inglés | MEDLINE | ID: mdl-35589842

RESUMEN

Cancer immunoediting1 is a hallmark of cancer2 that predicts that lymphocytes kill more immunogenic cancer cells to cause less immunogenic clones to dominate a population. Although proven in mice1,3, whether immunoediting occurs naturally in human cancers remains unclear. Here, to address this, we investigate how 70 human pancreatic cancers evolved over 10 years. We find that, despite having more time to accumulate mutations, rare long-term survivors of pancreatic cancer who have stronger T cell activity in primary tumours develop genetically less heterogeneous recurrent tumours with fewer immunogenic mutations (neoantigens). To quantify whether immunoediting underlies these observations, we infer that a neoantigen is immunogenic (high-quality) by two features-'non-selfness'  based on neoantigen similarity to known antigens4,5, and 'selfness'  based on the antigenic distance required for a neoantigen to differentially bind to the MHC or activate a T cell compared with its wild-type peptide. Using these features, we estimate cancer clone fitness as the aggregate cost of T cells recognizing high-quality neoantigens offset by gains from oncogenic mutations. With this model, we predict the clonal evolution of tumours to reveal that long-term survivors of pancreatic cancer develop recurrent tumours with fewer high-quality neoantigens. Thus, we submit evidence that that the human immune system naturally edits neoantigens. Furthermore, we present a model to predict how immune pressure induces cancer cell populations to evolve over time. More broadly, our results argue that the immune system fundamentally surveils host genetic changes to suppress cancer.


Asunto(s)
Antígenos de Neoplasias , Supervivientes de Cáncer , Neoplasias Pancreáticas , Antígenos de Neoplasias/genética , Antígenos de Neoplasias/inmunología , Humanos , Neoplasias Pancreáticas/genética , Neoplasias Pancreáticas/inmunología , Neoplasias Pancreáticas/patología , Linfocitos T/inmunología , Escape del Tumor/inmunología
2.
Proc Natl Acad Sci U S A ; 121(26): e2312335121, 2024 Jun 25.
Artículo en Inglés | MEDLINE | ID: mdl-38889151

RESUMEN

Predicting the effects of one or more mutations to the in vivo or in vitro properties of a wild-type protein is a major computational challenge, due to the presence of epistasis, that is, of interactions between amino acids in the sequence. We introduce a computationally efficient procedure to build minimal epistatic models to predict mutational effects by combining evolutionary (homologous sequence) and few mutational-scan data. Mutagenesis measurements guide the selection of links in a sparse graphical model, while the parameters on the nodes and the edges are inferred from sequence data. We show, on 10 mutational scans, that our pipeline exhibits performances comparable to state-of-the-art deep networks trained on many more data, while requiring much less parameters and being hence more interpretable. In particular, the identified interactions adapt to the wild-type protein and to the fitness or biochemical property experimentally measured, mostly focus on key functional sites, and are not necessarily related to structural contacts. Therefore, our method is able to extract information relevant for one mutational experiment from homologous sequence data reflecting the multitude of structural and functional constraints acting on proteins throughout evolution.


Asunto(s)
Mutación , Proteínas , Proteínas/genética , Proteínas/metabolismo , Proteínas/química , Epistasis Genética , Evolución Molecular , Biología Computacional/métodos
3.
RNA ; 28(3): 277-289, 2022 03.
Artículo en Inglés | MEDLINE | ID: mdl-34937774

RESUMEN

Coronavirus RNA-dependent RNA polymerases produce subgenomic RNAs (sgRNAs) that encode viral structural and accessory proteins. User-friendly bioinformatic tools to detect and quantify sgRNA production are urgently needed to study the growing number of next-generation sequencing (NGS) data of SARS-CoV-2. We introduced sgDI-tector to identify and quantify sgRNA in SARS-CoV-2 NGS data. sgDI-tector allowed detection of sgRNA without initial knowledge of the transcription-regulatory sequences. We produced NGS data and successfully detected the nested set of sgRNAs with the ranking M > ORF3a > N>ORF6 > ORF7a > ORF8 > S > E>ORF7b. We also compared the level of sgRNA production with other types of viral RNA products such as defective interfering viral genomes.


Asunto(s)
Biología Computacional/métodos , Genoma Viral , ARN Viral/genética , SARS-CoV-2/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Sistemas de Lectura Abierta
4.
PLoS Comput Biol ; 19(10): e1011521, 2023 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-37883593

RESUMEN

Predicting the effects of mutations on protein function is an important issue in evolutionary biology and biomedical applications. Computational approaches, ranging from graphical models to deep-learning architectures, can capture the statistical properties of sequence data and predict the outcome of high-throughput mutagenesis experiments probing the fitness landscape around some wild-type protein. However, how the complexity of the models and the characteristics of the data combine to determine the predictive performance remains unclear. Here, based on a theoretical analysis of the prediction error, we propose descriptors of the sequence data, characterizing their quantity and relevance relative to the model. Our theoretical framework identifies a trade-off between these two quantities, and determines the optimal subset of data for the prediction task, showing that simple models can outperform complex ones when inferred from adequately-selected sequences. We also show how repeated subsampling of the sequence data is informative about how much epistasis in the fitness landscape is not captured by the computational model. Our approach is illustrated on several protein families, as well as on in silico solvable protein models.


Asunto(s)
Evolución Biológica , Proteínas , Proteínas/genética , Mutagénesis , Mutación , Simulación por Computador , Aptitud Genética/genética , Modelos Genéticos
5.
PLoS Comput Biol ; 19(11): e1011621, 2023 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-37976326

RESUMEN

We present here an approach to protein design that combines (i) scarce functional information such as experimental data (ii) evolutionary information learned from a natural sequence variants and (iii) physics-grounded modeling. Using a Restricted Boltzmann Machine (RBM), we learn a sequence model of a protein family. We use semi-supervision to leverage available functional information during the RBM training. We then propose a strategy to explore the protein representation space that can be informed by external models such as an empirical force-field method (FoldX). Our approach is applied to a domain of the Cas9 protein responsible for recognition of a short DNA motif. We experimentally assess the functionality of 71 variants generated to explore a range of RBM and FoldX energies. Sequences with as many as 50 differences (20% of the protein domain) to the wild-type retained functionality. Overall, 21/71 sequences designed with our method were functional. Interestingly, 6/71 sequences showed an improved activity in comparison with the original wild-type protein sequence. These results demonstrate the interest in further exploring the synergies between machine-learning of protein sequence representations and physics grounded modeling strategies informed by structural information.


Asunto(s)
Sistemas CRISPR-Cas , Proteínas , Proteínas/genética , Proteínas/química , Secuencia de Aminoácidos , Aprendizaje Automático , Aprendizaje
6.
Nucleic Acids Res ; 50(21): 12082-12093, 2022 11 28.
Artículo en Inglés | MEDLINE | ID: mdl-36478056

RESUMEN

The hybridization kinetic of an oligonucleotide to its template is a fundamental step in many biological processes such as replication arrest, CRISPR recognition, DNA sequencing, DNA origami, etc. Although single kinetic descriptions exist for special cases of this problem, there are no simple general prediction schemes. In this work, we have measured experimentally, with no fluorescent labelling, the displacement of an oligonucleotide from its substrate in two situations: one corresponding to oligonucleotide binding/unbinding on ssDNA and one in which the oligonucleotide is displaced by the refolding of a dsDNA fork. In this second situation, the fork is expelling the oligonucleotide thus significantly reducing its residence time. To account for our data in these two situations, we have constructed a mathematical model, based on the known nearest neighbour dinucleotide free energies, and provided a good estimate of the residence times of different oligonucleotides (DNA, RNA, LNA) of various lengths in different experimental conditions (force, temperature, buffer conditions, presence of mismatches, etc.). This study provides a foundation for the dynamics of oligonucleotide displacement, a process of importance in numerous biological and bioengineering contexts.


Asunto(s)
ADN , Oligonucleótidos , ADN/genética , Hibridación de Ácido Nucleico , ADN de Cadena Simple , Sondas de Oligonucleótidos
7.
Phys Rev Lett ; 130(15): 158402, 2023 Apr 14.
Artículo en Inglés | MEDLINE | ID: mdl-37115874

RESUMEN

Identifying and characterizing mutational paths is an important issue in evolutionary biology, with potential applications to bioengineering. We here propose an algorithm to sample mutational paths, which we benchmark on exactly solvable models of proteins in silico, and apply to data-driven models of natural proteins learned from sequence data with restricted Boltzmann machines. We then use mean-field theory to characterize paths for different mutational dynamics of interest, and to extend Kimura's estimate of evolutionary distances to sequence-based epistatic models of selection.


Asunto(s)
Evolución Biológica , Proteínas , Mutación , Proteínas/genética , Algoritmos
8.
PLoS Comput Biol ; 18(9): e1010561, 2022 09.
Artículo en Inglés | MEDLINE | ID: mdl-36174101

RESUMEN

Selection protocols such as SELEX, where molecules are selected over multiple rounds for their ability to bind to a target of interest, are popular methods for obtaining binders for diagnostic and therapeutic purposes. We show that Restricted Boltzmann Machines (RBMs), an unsupervised two-layer neural network architecture, can successfully be trained on sequence ensembles from single rounds of SELEX experiments for thrombin aptamers. RBMs assign scores to sequences that can be directly related to their fitnesses estimated through experimental enrichment ratios. Hence, RBMs trained from sequence data at a given round can be used to predict the effects of selection at later rounds. Moreover, the parameters of the trained RBMs are interpretable and identify functional features contributing most to sequence fitness. To exploit the generative capabilities of RBMs, we introduce two different training protocols: one taking into account sequence counts, capable of identifying the few best binders, and another based on unique sequences only, generating more diverse binders. We then use RBMs model to generate novel aptamers with putative disruptive mutations or good binding properties, and validate the generated sequences with gel shift assay experiments. Finally, we compare the RBM's performance with different supervised learning approaches that include random forests and several deep neural network architectures.


Asunto(s)
Redes Neurales de la Computación , Trombina , Aprendizaje Automático
9.
Mol Biol Evol ; 38(6): 2428-2445, 2021 05 19.
Artículo en Inglés | MEDLINE | ID: mdl-33555346

RESUMEN

COVID-19 can lead to acute respiratory syndrome, which can be due to dysregulated immune signaling. We analyze the distribution of CpG dinucleotides, a pathogen-associated molecular pattern, in the SARS-CoV-2 genome. We characterize CpG content by a CpG force that accounts for statistical constraints acting on the genome at the nucleotidic and amino acid levels. The CpG force, as the CpG content, is overall low compared with other pathogenic betacoronaviruses; however, it widely fluctuates along the genome, with a particularly low value, comparable with the circulating seasonal HKU1, in the spike coding region and a greater value, comparable with SARS and MERS, in the highly expressed nucleocapside coding region (N ORF), whose transcripts are relatively abundant in the cytoplasm of infected cells and present in the 3'UTRs of all subgenomic RNA. This dual nature of CpG content could confer to SARS-CoV-2 the ability to avoid triggering pattern recognition receptors upon entry, while eliciting a stronger response during replication. We then investigate the evolution of synonymous mutations since the outbreak of the COVID-19 pandemic, finding a signature of CpG loss in regions with a greater CpG force. Sequence motifs preceding the CpG-loss-associated loci in the N ORF match recently identified binding patterns of the zinc finger antiviral protein. Using a model of the viral gene evolution under human host pressure, we find that synonymous mutations seem driven in the SARS-CoV-2 genome, and particularly in the N ORF, by the viral codon bias, the transition-transversion bias, and the pressure to lower CpG content.


Asunto(s)
COVID-19/genética , Islas de CpG , Evolución Molecular , Genoma Viral , ARN Viral/genética , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , SARS-CoV-2/patogenicidad
10.
Bioinformatics ; 37(22): 4083-4090, 2021 11 18.
Artículo en Inglés | MEDLINE | ID: mdl-34117879

RESUMEN

MOTIVATION: Modeling of protein family sequence distribution from homologous sequence data recently received considerable attention, in particular for structure and function predictions, as well as for protein design. In particular, direct coupling analysis, a method to infer effective pairwise interactions between residues, was shown to capture important structural constraints and to successfully generate functional protein sequences. Building on this and other graphical models, we introduce a new framework to assess the quality of the secondary structures of the generated sequences with respect to reference structures for the family. RESULTS: We introduce two scoring functions characterizing the likeliness of the secondary structure of a protein sequence to match a reference structure, called Dot Product and Pattern Matching. We test these scores on published experimental protein mutagenesis and design dataset, and show improvement in the detection of nonfunctional sequences. We also show that use of these scores help rejecting nonfunctional sequences generated by graphical models (Restricted Boltzmann Machines) learned from homologous sequence alignments. AVAILABILITY AND IMPLEMENTATION: Data and code available at https://github.com/CyrilMa/ssqa. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Proteínas , Proteínas/química , Secuencia de Aminoácidos , Alineación de Secuencia , Estructura Secundaria de Proteína , Mutagénesis
11.
PLoS Comput Biol ; 17(3): e1008751, 2021 03.
Artículo en Inglés | MEDLINE | ID: mdl-33765014

RESUMEN

The sequences of antibodies from a given repertoire are highly diverse at few sites located on the surface of a genome-encoded larger scaffold. The scaffold is often considered to play a lesser role than highly diverse, non-genome-encoded sites in controlling binding affinity and specificity. To gauge the impact of the scaffold, we carried out quantitative phage display experiments where we compare the response to selection for binding to four different targets of three different antibody libraries based on distinct scaffolds but harboring the same diversity at randomized sites. We first show that the response to selection of an antibody library may be captured by two measurable parameters. Second, we provide evidence that one of these parameters is determined by the degree of affinity maturation of the scaffold, affinity maturation being the process by which antibodies accumulate somatic mutations to evolve towards higher affinities during the natural immune response. In all cases, we find that libraries of antibodies built around maturated scaffolds have a lower response to selection to other arbitrary targets than libraries built around germline-based scaffolds. We thus propose that germline-encoded scaffolds have a higher selective potential than maturated ones as a consequence of a selection for this potential over the long-term evolution of germline antibody genes. Our results are a first step towards quantifying the evolutionary potential of biomolecules.


Asunto(s)
Anticuerpos/genética , Biblioteca de Genes , Biología Computacional , ADN/genética , Evolución Molecular , Humanos
12.
PLoS Comput Biol ; 17(9): e1009297, 2021 09.
Artículo en Inglés | MEDLINE | ID: mdl-34473697

RESUMEN

With the increasing ability to use high-throughput next-generation sequencing to quantify the diversity of the human T cell receptor (TCR) repertoire, the ability to use TCR sequences to infer antigen-specificity could greatly aid potential diagnostics and therapeutics. Here, we use a machine-learning approach known as Restricted Boltzmann Machine to develop a sequence-based inference approach to identify antigen-specific TCRs. Our approach combines probabilistic models of TCR sequences with clone abundance information to extract TCR sequence motifs central to an antigen-specific response. We use this model to identify patient personalized TCR motifs that respond to individual tumor and infectious disease antigens, and to accurately discriminate specific from non-specific responses. Furthermore, the hidden structure of the model results in an interpretable representation space where TCRs responding to the same antigen cluster, correctly discriminating the response of TCR to different viral epitopes. The model can be used to identify condition specific responding TCRs. We focus on the examples of TCRs reactive to candidate neoantigens and selected epitopes in experiments of stimulated TCR clone expansion.


Asunto(s)
Biología Computacional/métodos , Modelos Estadísticos , Linfocitos T/inmunología , Supervivientes de Cáncer , Carcinoma Ductal Pancreático/inmunología , Análisis por Conglomerados , Conjuntos de Datos como Asunto , Humanos , Neoplasias Pancreáticas/inmunología , Receptores de Antígenos de Linfocitos T/inmunología
13.
Neural Comput ; 31(8): 1671-1717, 2019 08.
Artículo en Inglés | MEDLINE | ID: mdl-31260391

RESUMEN

A restricted Boltzmann machine (RBM) is an unsupervised machine learning bipartite graphical model that jointly learns a probability distribution over data and extracts their relevant statistical features. RBMs were recently proposed for characterizing the patterns of coevolution between amino acids in protein sequences and for designing new sequences. Here, we study how the nature of the features learned by RBM changes with its defining parameters, such as the dimensionality of the representations (size of the hidden layer) and the sparsity of the features. We show that for adequate values of these parameters, RBMs operate in a so-called compositional phase in which visible configurations sampled from the RBM are obtained by recombining these features. We then compare the performance of RBM with other standard representation learning algorithms, including principal or independent component analysis (PCA, ICA), autoencoders (AE), variational autoencoders (VAE), and their sparse variants. We show that RBMs, due to the stochastic mapping between data configurations and representations, better capture the underlying interactions in the system and are significantly more robust with respect to sample size than deterministic methods such as PCA or ICA. In addition, this stochastic mapping is not prescribed a priori as in VAE, but learned from data, which allows RBMs to show good performance even with shallow architectures. All numerical results are illustrated on synthetic lattice protein data that share similar statistical features with real protein sequences and for which ground-truth interactions are known.


Asunto(s)
Aprendizaje Automático no Supervisado , Secuencia de Aminoácidos , Simulación por Computador , Modelos Moleculares , Modelos Estadísticos , Análisis de Componente Principal , Probabilidad , Proteínas/química , Proteínas/genética , Alineación de Secuencia , Electricidad Estática , Procesos Estocásticos
14.
PLoS Comput Biol ; 14(8): e1006320, 2018 08.
Artículo en Inglés | MEDLINE | ID: mdl-30106966

RESUMEN

The hippocampus is known to store cognitive representations, or maps, that encode both positional and contextual information, critical for episodic memories and functional behavior. How path integration and contextual cues are dynamically combined and processed by the hippocampus to maintain these representations accurate over time remains unclear. To answer this question, we propose a two-way data analysis and modeling approach to CA3 multi-electrode recordings of a moving rat submitted to rapid changes of contextual (light) cues, triggering back-and-forth instabitilies between two cognitive representations ("teleportation" experiment of Jezek et al). We develop a dual neural activity decoder, capable of independently identifying the recalled cognitive map at high temporal resolution (comparable to theta cycle) and the position of the rodent given a map. Remarkably, position can be reconstructed at any time with an accuracy comparable to fixed-context periods, even during highly unstable periods. These findings provide evidence for the capability of the hippocampal neural activity to maintain an accurate encoding of spatial and contextual variables, while one of these variables undergoes rapid changes independently of the other. To explain this result we introduce an attractor neural network model for the hippocampal activity that process inputs from external cues and the path integrator. Our model allows us to make predictions on the frequency of the cognitive map instability, its duration, and the detailed nature of the place-cell population activity, which are validated by a further analysis of the data. Our work therefore sheds light on the mechanisms by which the hippocampal network achieves and updates multi-dimensional neural representations from various input streams.


Asunto(s)
Región CA3 Hipocampal/fisiología , Red Nerviosa/fisiología , Conducta Espacial/fisiología , Potenciales de Acción/fisiología , Animales , Señales (Psicología) , Hipocampo/fisiología , Masculino , Memoria Episódica , Modelos Neurológicos , Redes Neurales de la Computación , Neuronas/fisiología , Ratas , Percepción Espacial/fisiología
15.
Rep Prog Phys ; 81(3): 032601, 2018 03.
Artículo en Inglés | MEDLINE | ID: mdl-29120346

RESUMEN

In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e. evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference of Boltzmann distributions from samples of microscopic configurations or observables, it is possible to extract information about evolutionary constraints and thus protein function and structure. Here we give an overview over some biologically important questions, and how statistical-mechanics inspired modeling approaches can help to answer them. Finally, we discuss some open questions, which we expect to be addressed over the next years.


Asunto(s)
Física/métodos , Proteínas/química , Secuencia de Aminoácidos , Anotación de Secuencia Molecular , Proteínas/metabolismo , Homología de Secuencia de Aminoácido
16.
Proc Natl Acad Sci U S A ; 112(49): 15154-9, 2015 Dec 08.
Artículo en Inglés | MEDLINE | ID: mdl-26575629

RESUMEN

Recent studies have demonstrated abundant transcription of a set of noncoding RNAs (ncRNAs) preferentially within tumors as opposed to normal tissue. Using an approach from statistical physics, we quantify global transcriptome-wide motif use for the first time, to our knowledge, in human and murine ncRNAs, determining that most have motif use consistent with the coding genome. However, an outlier subset of tumor-associated ncRNAs, typically of recent evolutionary origin, has motif use that is often indicative of pathogen-associated RNA. For instance, we show that the tumor-associated human repeat human satellite repeat II (HSATII) is enriched in motifs containing CpG dinucleotides in AU-rich contexts that most of the human genome and human adapted viruses have evolved to avoid. We demonstrate that a key subset of these ncRNAs functions as immunostimulatory "self-agonists" and directly activates cells of the mononuclear phagocytic system to produce proinflammatory cytokines. These ncRNAs arise from endogenous repetitive elements that are normally silenced, yet are often very highly expressed in cancers. We propose that the innate response in tumors may partially originate from direct interaction of immunogenic ncRNAs expressed in cancer cells with innate pattern recognition receptors, and thereby assign a previously unidentified danger-associated function to a set of dark matter repetitive elements. These findings potentially reconcile several observations concerning the role of ncRNA expression in cancers and their relationship to the tumor microenvironment.


Asunto(s)
Neoplasias/genética , ARN no Traducido/inmunología , Animales , Humanos , Inmunidad Innata , Ratones , Neoplasias/inmunología
17.
J Comput Neurosci ; 43(1): 17-33, 2017 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-28484899

RESUMEN

Hippocampus stores spatial representations, or maps, which are recalled each time a subject is placed in the corresponding environment. Across different environments of similar geometry, these representations show strong orthogonality in CA3 of hippocampus, whereas in the CA1 subfield a considerable overlap between the maps can be seen. The lower orthogonality decreases reliability of various decoders developed in an attempt to identify which of the stored maps is active at the moment. Especially, the problem with decoding emerges with a need to analyze data at high temporal resolution. Here, we introduce a functional-connectivity-based decoder, which accounts for the pairwise correlations between the spiking activities of neurons in each map and does not require any positional information, i.e. any knowledge about place fields. We first show, on recordings of hippocampal activity in constant environmental conditions, that our decoder outperforms existing decoding methods in CA1. Our decoder is then applied to data from teleportation experiments, in which an instantaneous switch between the environment identity triggers a recall of the corresponding spatial representation . We test the sensitivity of our approach on the transition dynamics between the respective memory states (maps). We find that the rate of spontaneous state shifts (flickering) after a teleportation event is increased not only within the first few seconds as already reported, but this instability is sustained across much longer (> 1 min.) periods.


Asunto(s)
Región CA1 Hipocampal/fisiología , Memoria , Modelos Neurológicos , Neuronas/fisiología , Hipocampo , Humanos , Reproducibilidad de los Resultados
18.
PLoS Comput Biol ; 12(5): e1004889, 2016 05.
Artículo en Inglés | MEDLINE | ID: mdl-27177270

RESUMEN

Inverse statistical approaches to determine protein structure and function from Multiple Sequence Alignments (MSA) are emerging as powerful tools in computational biology. However the underlying assumptions of the relationship between the inferred effective Potts Hamiltonian and real protein structure and energetics remain untested so far. Here we use lattice protein model (LP) to benchmark those inverse statistical approaches. We build MSA of highly stable sequences in target LP structures, and infer the effective pairwise Potts Hamiltonians from those MSA. We find that inferred Potts Hamiltonians reproduce many important aspects of 'true' LP structures and energetics. Careful analysis reveals that effective pairwise couplings in inferred Potts Hamiltonians depend not only on the energetics of the native structure but also on competing folds; in particular, the coupling values reflect both positive design (stabilization of native conformation) and negative design (destabilization of competing folds). In addition to providing detailed structural information, the inferred Potts models used as protein Hamiltonian for design of new sequences are able to generate with high probability completely new sequences with the desired folds, which is not possible using independent-site models. Those are remarkable results as the effective LP Hamiltonians used to generate MSA are not simple pairwise models due to the competition between the folds. Our findings elucidate the reasons for the success of inverse approaches to the modelling of proteins from sequence data, and their limitations.


Asunto(s)
Proteínas/química , Alineación de Secuencia/estadística & datos numéricos , Secuencia de Aminoácidos , Benchmarking , Biología Computacional , Simulación por Computador , Modelos Moleculares , Modelos Estadísticos , Conformación Proteica , Pliegue de Proteína , Proteínas/genética
19.
Nucleic Acids Res ; 43(21): 10444-55, 2015 Dec 02.
Artículo en Inglés | MEDLINE | ID: mdl-26420827

RESUMEN

Despite the biological importance of non-coding RNA, their structural characterization remains challenging. Making use of the rapidly growing sequence databases, we analyze nucleotide coevolution across homologous sequences via Direct-Coupling Analysis to detect nucleotide-nucleotide contacts. For a representative set of riboswitches, we show that the results of Direct-Coupling Analysis in combination with a generalized Nussinov algorithm systematically improve the results of RNA secondary structure prediction beyond traditional covariance approaches based on mutual information. Even more importantly, we show that the results of Direct-Coupling Analysis are enriched in tertiary structure contacts. By integrating these predictions into molecular modeling tools, systematically improved tertiary structure predictions can be obtained, as compared to using secondary structure information alone.


Asunto(s)
ARN/química , Análisis de Secuencia de ARN/métodos , Algoritmos , Evolución Molecular , Modelos Moleculares , Conformación de Ácido Nucleico , Riboswitch , Alineación de Secuencia , Homología de Secuencia de Ácido Nucleico
20.
Proc Natl Acad Sci U S A ; 111(13): 5054-9, 2014 Apr 01.
Artículo en Inglés | MEDLINE | ID: mdl-24639520

RESUMEN

We outline a theory to quantify the interplay of entropic and selective forces on nucleotide organization and apply it to the genomes of single-stranded RNA viruses. We quantify these forces as intensive variables that can easily be compared between sequences, outline a computationally efficient transfer-matrix method for their calculation, and apply this method to influenza and HIV viruses. We find viruses altering their dinucleotide motif use under selective forces, with these forces on CpG dinucleotides growing stronger in influenza the longer it replicates in humans. For a subset of genes in the human genome, many involved in antiviral innate immunity, the forces acting on CpG dinucleotides are even greater than the forces observed in viruses, suggesting that both effects are in response to similar selective forces involving the innate immune system. We further find that the dynamics of entropic forces balancing selective forces can be used to predict how long it will take a virus to adapt to a new host, and that it would take H1N1 several centuries to adapt to humans from birds, typically contributing many of its synonymous substitutions to the forcible removal of CpG dinucleotides. By examining the probability landscape of dinucleotide motifs, we predict where motifs are likely to appear using only a single-force parameter and uncover the localization of UpU motifs in HIV. Essentially, we extend the natural language and concepts of statistical physics, such as entropy and conjugated forces, to understanding viral sequences and, more generally, constrained genome evolution.


Asunto(s)
Entropía , Modelos Biológicos , Virus/genética , Secuencia de Bases , Codón/genética , Simulación por Computador , Fosfatos de Dinucleósidos/genética , Humanos , Subtipo H1N1 del Virus de la Influenza A/genética , Imitación Molecular , Motivos de Nucleótidos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA