Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 57
Filtrar
Más filtros

Bases de datos
Tipo del documento
Intervalo de año de publicación
1.
Proc Natl Acad Sci U S A ; 121(24): e2316401121, 2024 Jun 11.
Artículo en Inglés | MEDLINE | ID: mdl-38838016

RESUMEN

The accurate prediction of binding between T cell receptors (TCR) and their cognate epitopes is key to understanding the adaptive immune response and developing immunotherapies. Current methods face two significant limitations: the shortage of comprehensive high-quality data and the bias introduced by the selection of the negative training data commonly used in the supervised learning approaches. We propose a method, Transformer-based Unsupervised Language model for Interacting Peptides and T cell receptors (TULIP), that addresses both limitations by leveraging incomplete data and unsupervised learning and using the transformer architecture of language models. Our model is flexible and integrates all possible data sources, regardless of their quality or completeness. We demonstrate the existence of a bias introduced by the sampling procedure used in previous supervised approaches, emphasizing the need for an unsupervised approach. TULIP recognizes the specific TCRs binding an epitope, performing well on unseen epitopes. Our model outperforms state-of-the-art models and offers a promising direction for the development of more accurate TCR epitope recognition models.


Asunto(s)
Péptidos , Receptores de Antígenos de Linfocitos T , Receptores de Antígenos de Linfocitos T/inmunología , Receptores de Antígenos de Linfocitos T/metabolismo , Péptidos/inmunología , Péptidos/química , Péptidos/metabolismo , Humanos , Epítopos/inmunología , Unión Proteica , Epítopos de Linfocito T/inmunología , Aprendizaje Automático no Supervisado
2.
Nucleic Acids Res ; 52(10): 5465-5477, 2024 Jun 10.
Artículo en Inglés | MEDLINE | ID: mdl-38661206

RESUMEN

Generative probabilistic models emerge as a new paradigm in data-driven, evolution-informed design of biomolecular sequences. This paper introduces a novel approach, called Edge Activation Direct Coupling Analysis (eaDCA), tailored to the characteristics of RNA sequences, with a strong emphasis on simplicity, efficiency, and interpretability. eaDCA explicitly constructs sparse coevolutionary models for RNA families, achieving performance levels comparable to more complex methods while utilizing a significantly lower number of parameters. Our approach demonstrates efficiency in generating artificial RNA sequences that closely resemble their natural counterparts in both statistical analyses and SHAPE-MaP experiments, and in predicting the effect of mutations. Notably, eaDCA provides a unique feature: estimating the number of potential functional sequences within a given RNA family. For example, in the case of cyclic di-AMP riboswitches (RF00379), our analysis suggests the existence of approximately 1039 functional nucleotide sequences. While huge compared to the known <4000 natural sequences, this number represents only a tiny fraction of the vast pool of nearly 1082 possible nucleotide sequences of the same length (136 nucleotides). These results underscore the promise of sparse and interpretable generative models, such as eaDCA, in enhancing our understanding of the expansive RNA sequence space.


Asunto(s)
Biología Computacional , Modelos Genéticos , ARN , Algoritmos , Secuencia de Bases , Evolución Molecular , Modelos Estadísticos , Mutación , Conformación de Ácido Nucleico , Riboswitch/genética , ARN/química , ARN/genética , Análisis de Secuencia de ARN , Biología Computacional/métodos
3.
Proc Natl Acad Sci U S A ; 119(4)2022 01 25.
Artículo en Inglés | MEDLINE | ID: mdl-35022216

RESUMEN

The emergence of new variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a major concern given their potential impact on the transmissibility and pathogenicity of the virus as well as the efficacy of therapeutic interventions. Here, we predict the mutability of all positions in SARS-CoV-2 protein domains to forecast the appearance of unseen variants. Using sequence data from other coronaviruses, preexisting to SARS-CoV-2, we build statistical models that not only capture amino acid conservation but also more complex patterns resulting from epistasis. We show that these models are notably superior to conservation profiles in estimating the already observable SARS-CoV-2 variability. In the receptor binding domain of the spike protein, we observe that the predicted mutability correlates well with experimental measures of protein stability and that both are reliable mutability predictors (receiver operating characteristic areas under the curve ∼0.8). Most interestingly, we observe an increasing agreement between our model and the observed variability as more data become available over time, proving the anticipatory capacity of our model. When combined with data concerning the immune response, our approach identifies positions where current variants of concern are highly overrepresented. These results could assist studies on viral evolution and future viral outbreaks and, in particular, guide the exploration and anticipation of potentially harmful future SARS-CoV-2 variants.


Asunto(s)
COVID-19/virología , Epistasis Genética , Epítopos , Mutación , SARS-CoV-2/genética , Glicoproteína de la Espiga del Coronavirus/química , Glicoproteína de la Espiga del Coronavirus/genética , Proteínas Virales/química , Algoritmos , Área Bajo la Curva , Biología Computacional/métodos , Análisis Mutacional de ADN , Bases de Datos de Proteínas , Aprendizaje Profundo , Epítopos/química , Genoma Viral , Humanos , Modelos Estadísticos , Mutagénesis , Probabilidad , Dominios Proteicos , Curva ROC
4.
Bioinformatics ; 39(7)2023 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-37399105

RESUMEN

MOTIVATION: Being able to artificially design novel proteins of desired function is pivotal in many biological and biomedical applications. Generative statistical modeling has recently emerged as a new paradigm for designing amino acid sequences, including in particular models and embedding methods borrowed from natural language processing (NLP). However, most approaches target single proteins or protein domains, and do not take into account any functional specificity or interaction with the context. To extend beyond current computational strategies, we develop a method for generating protein domain sequences intended to interact with another protein domain. Using data from natural multidomain proteins, we cast the problem as a translation problem from a given interactor domain to the new domain to be generated, i.e. we generate artificial partner sequences conditional on an input sequence. We also show in an example that the same procedure can be applied to interactions between distinct proteins. RESULTS: Evaluating our model's quality using diverse metrics, in part related to distinct biological questions, we show that our method outperforms state-of-the-art shallow autoregressive strategies. We also explore the possibility of fine-tuning pretrained large language models for the same task and of using Alphafold 2 for assessing the quality of sampled sequences. AVAILABILITY AND IMPLEMENTATION: Data and code on https://github.com/barthelemymp/Domain2DomainProteinTranslation.


Asunto(s)
Lenguaje , Proteínas , Secuencia de Aminoácidos , Proteínas/química , Dominios Proteicos
5.
PLoS Comput Biol ; 19(3): e1011010, 2023 03.
Artículo en Inglés | MEDLINE | ID: mdl-36996234

RESUMEN

Predicting protein-protein interactions from sequences is an important goal of computational biology. Various sources of information can be used to this end. Starting from the sequences of two interacting protein families, one can use phylogeny or residue coevolution to infer which paralogs are specific interaction partners within each species. We show that these two signals can be combined to improve the performance of the inference of interaction partners among paralogs. For this, we first align the sequence-similarity graphs of the two families through simulated annealing, yielding a robust partial pairing. We next use this partial pairing to seed a coevolution-based iterative pairing algorithm. This combined method improves performance over either separate method. The improvement obtained is striking in the difficult cases where the average number of paralogs per species is large or where the total number of sequences is modest.


Asunto(s)
Algoritmos , Proteínas , Unión Proteica , Filogenia , Proteínas/química , Biología Computacional/métodos
6.
Mol Biol Evol ; 39(1)2022 01 07.
Artículo en Inglés | MEDLINE | ID: mdl-34751386

RESUMEN

During their evolution, proteins explore sequence space via an interplay between random mutations and phenotypic selection. Here, we build upon recent progress in reconstructing data-driven fitness landscapes for families of homologous proteins, to propose stochastic models of experimental protein evolution. These models predict quantitatively important features of experimentally evolved sequence libraries, like fitness distributions and position-specific mutational spectra. They also allow us to efficiently simulate sequence libraries for a vast array of combinations of experimental parameters like sequence divergence, selection strength, and library size. We showcase the potential of the approach in reanalyzing two recent experiments to determine protein structure from signals of epistasis emerging in experimental sequence libraries. To be detectable, these signals require sufficiently large and sufficiently diverged libraries. Our modeling framework offers a quantitative explanation for different outcomes of recently published experiments. Furthermore, we can forecast the outcome of time- and resource-intensive evolution experiments, opening thereby a way to computationally optimize experimental protocols.


Asunto(s)
Epistasis Genética , Vuelo Espacial , Evolución Molecular , Aptitud Genética , Modelos Genéticos , Mutación , Proteínas/genética
7.
J Am Chem Soc ; 144(31): 14057-14070, 2022 08 10.
Artículo en Inglés | MEDLINE | ID: mdl-35895935

RESUMEN

Dehydroamino acids are important structural motifs and biosynthetic intermediates for natural products. Many bioactive natural products of nonribosomal origin contain dehydroamino acids; however, the biosynthesis of dehydroamino acids in most nonribosomal peptides is not well understood. Here, we provide biochemical and bioinformatic evidence in support of the role of a unique class of condensation domains in dehydration (CmodAA). We also obtain the crystal structure of a CmodAA domain, which is part of the nonribosomal peptide synthetase AmbE in the biosynthesis of the antibiotic methoxyvinylglycine. Biochemical analysis reveals that AmbE-CmodAA modifies a peptide substrate that is attached to the donor carrier protein. Mutational studies of AmbE-CmodAA identify several key residues for activity, including four residues that are mostly conserved in the CmodAA subfamily. Alanine mutation of these conserved residues either significantly increases or decreases AmbE activity. AmbE exhibits a dimeric conformation, which is uncommon and could enable transfer of an intermediate between different protomers. Our discovery highlights a central dehydrating function for CmodAA domains that unifies dehydroamino acid biosynthesis in diverse nonribosomal peptide pathways. Our work also begins to shed light on the mechanism of CmodAA domains. Understanding CmodAA domain function may facilitate identification of new natural products that contain dehydroamino acids and enable engineering of dehydroamino acids into nonribosomal peptides.


Asunto(s)
Productos Biológicos , Biosíntesis de Péptidos Independientes de Ácidos Nucleicos , Antibacterianos , Péptido Sintasas/metabolismo , Péptidos/química
8.
PLoS Comput Biol ; 17(5): e1008957, 2021 05.
Artículo en Inglés | MEDLINE | ID: mdl-34029316

RESUMEN

Coevolution-based contact prediction, either directly by coevolutionary couplings resulting from global statistical sequence models or using structural supervision and deep learning, has found widespread application in protein-structure prediction from sequence. However, one of the basic assumptions in global statistical modeling is that sequences form an at least approximately independent sample of an unknown probability distribution, which is to be learned from data. In the case of protein families, this assumption is obviously violated by phylogenetic relations between protein sequences. It has turned out to be notoriously difficult to take phylogenetic correlations into account in coevolutionary model learning. Here, we propose a complementary approach: we develop strategies to randomize or resample sequence data, such that conservation patterns and phylogenetic relations are preserved, while intrinsic (i.e. structure- or function-based) coevolutionary couplings are removed. A comparison between the results of Direct Coupling Analysis applied to real and to resampled data shows that the largest coevolutionary couplings, i.e. those used for contact prediction, are only weakly influenced by phylogeny. However, the phylogeny-induced spurious couplings in the resampled data are compatible in size with the first false-positive contact predictions from real data. Dissecting functional from phylogeny-induced couplings might therefore extend accurate contact predictions to the range of intermediate-size couplings.


Asunto(s)
Evolución Molecular , Filogenia , Proteínas/química , Algoritmos , Biología Computacional/métodos , Conformación Proteica , Alineación de Secuencia
9.
BMC Bioinformatics ; 22(1): 528, 2021 Oct 29.
Artículo en Inglés | MEDLINE | ID: mdl-34715775

RESUMEN

BACKGROUND: Boltzmann machines are energy-based models that have been shown to provide an accurate statistical description of domains of evolutionary-related protein and RNA families. They are parametrized in terms of local biases accounting for residue conservation, and pairwise terms to model epistatic coevolution between residues. From the model parameters, it is possible to extract an accurate prediction of the three-dimensional contact map of the target domain. More recently, the accuracy of these models has been also assessed in terms of their ability in predicting mutational effects and generating in silico functional sequences. RESULTS: Our adaptive implementation of Boltzmann machine learning, adabmDCA, can be generally applied to both protein and RNA families and accomplishes several learning set-ups, depending on the complexity of the input data and on the user requirements. The code is fully available at https://github.com/anna-pa-m/adabmDCA . As an example, we have performed the learning of three Boltzmann machines modeling the Kunitz and Beta-lactamase2 protein domains and TPP-riboswitch RNA domain. CONCLUSIONS: The models learned by adabmDCA are comparable to those obtained by state-of-the-art techniques for this task, in terms of the quality of the inferred contact map as well as of the synthetically generated sequences. In addition, the code implements both equilibrium and out-of-equilibrium learning, which allows for an accurate and lossless training when the equilibrium one is prohibitive in terms of computational time, and allows for pruning irrelevant parameters using an information-based criterion.


Asunto(s)
Aprendizaje Automático , Proteínas , Humanos , Proteínas/genética , ARN
10.
PLoS Comput Biol ; 16(10): e1007621, 2020 10.
Artículo en Inglés | MEDLINE | ID: mdl-33035205

RESUMEN

Predicting three-dimensional protein structure and assembling protein complexes using sequence information belongs to the most prominent tasks in computational biology. Recently substantial progress has been obtained in the case of single proteins using a combination of unsupervised coevolutionary sequence analysis with structurally supervised deep learning. While reaching impressive accuracies in predicting residue-residue contacts, deep learning has a number of disadvantages. The need for large structural training sets limits the applicability to multi-protein complexes; and their deep architecture makes the interpretability of the convolutional neural networks intrinsically hard. Here we introduce FilterDCA, a simpler supervised predictor for inter-domain and inter-protein contacts. It is based on the fact that contact maps of proteins show typical contact patterns, which results from secondary structure and are reflected by patterns in coevolutionary analysis. We explicitly integrate averaged contacts patterns with coevolutionary scores derived by Direct Coupling Analysis, improving performance over standard coevolutionary analysis, while remaining fully transparent and interpretable. The FilterDCA code is available at http://gitlab.lcqb.upmc.fr/muscat/FilterDCA.


Asunto(s)
Biología Computacional/métodos , Conformación Proteica , Proteínas/química , Análisis de Secuencia de Proteína/métodos , Modelos Moleculares , Programas Informáticos , Aprendizaje Automático Supervisado
11.
PLoS Comput Biol ; 15(10): e1007179, 2019 10.
Artículo en Inglés | MEDLINE | ID: mdl-31609984

RESUMEN

Determining which proteins interact together is crucial to a systems-level understanding of the cell. Recently, algorithms based on Direct Coupling Analysis (DCA) pairwise maximum-entropy models have allowed to identify interaction partners among paralogous proteins from sequence data. This success of DCA at predicting protein-protein interactions could be mainly based on its known ability to identify pairs of residues that are in contact in the three-dimensional structure of protein complexes and that coevolve to remain physicochemically complementary. However, interacting proteins possess similar evolutionary histories. What is the role of purely phylogenetic correlations in the performance of DCA-based methods to infer interaction partners? To address this question, we employ controlled synthetic data that only involve phylogeny and no interactions or contacts. We find that DCA accurately identifies the pairs of synthetic sequences that share evolutionary history. While phylogenetic correlations confound the identification of contacting residues by DCA, they are thus useful to predict interacting partners among paralogs. We find that DCA performs as well as phylogenetic methods to this end, and slightly better than them with large and accurate training sets. Employing DCA or phylogenetic methods within an Iterative Pairing Algorithm (IPA) allows to predict pairs of evolutionary partners without a training set. We further demonstrate the ability of these various methods to correctly predict pairings among real paralogous proteins with genome proximity but no known direct physical interaction, illustrating the importance of phylogenetic correlations in natural data. However, for physically interacting and strongly coevolving proteins, DCA and mutual information outperform phylogenetic methods. We finally discuss how to distinguish physically interacting proteins from proteins that only share a common evolutionary history.


Asunto(s)
Mapeo de Interacción de Proteínas/métodos , Análisis de Secuencia de ADN/métodos , Algoritmos , Evolución Molecular , Filogenia , Unión Proteica/fisiología , Conformación Proteica , Proteínas/química
12.
PLoS Comput Biol ; 15(10): e1006891, 2019 10.
Artículo en Inglés | MEDLINE | ID: mdl-31634362

RESUMEN

Interacting proteins and protein domains coevolve on multiple scales, from their correlated presence across species, to correlations in amino-acid usage. Genomic databases provide rapidly growing data for variability in genomic protein content and in protein sequences, calling for computational predictions of unknown interactions. We first introduce the concept of direct phyletic couplings, based on global statistical models of phylogenetic profiles. They strongly increase the accuracy of predicting pairs of related protein domains beyond simpler correlation-based approaches like phylogenetic profiling (80% vs. 30-50% positives out of the 1000 highest-scoring pairs). Combined with the direct coupling analysis of inter-protein residue-residue coevolution, we provide multi-scale evidence for direct but unknown interaction between protein families. An in-depth discussion shows these to be biologically sensible and directly experimentally testable. Negative phyletic couplings highlight alternative solutions for the same functionality, including documented cases of convergent evolution. Thereby our work proves the strong potential of global statistical modeling approaches to genome-wide coevolutionary analysis, far beyond the established use for individual protein complexes and domain-domain interactions.


Asunto(s)
Biología Computacional/métodos , Dominios y Motivos de Interacción de Proteínas/fisiología , Mapeo de Interacción de Proteínas/métodos , Algoritmos , Aminoácidos/metabolismo , Animales , Fenómenos Biofísicos , Evolución Molecular , Humanos , Modelos Estadísticos , Filogenia , Unión Proteica/fisiología , Dominios Proteicos/fisiología , Proteínas/química
13.
Proc Natl Acad Sci U S A ; 114(13): E2662-E2671, 2017 03 28.
Artículo en Inglés | MEDLINE | ID: mdl-28289198

RESUMEN

Proteins have evolved to perform diverse cellular functions, from serving as reaction catalysts to coordinating cellular propagation and development. Frequently, proteins do not exert their full potential as monomers but rather undergo concerted interactions as either homo-oligomers or with other proteins as hetero-oligomers. The experimental study of such protein complexes and interactions has been arduous. Theoretical structure prediction methods are an attractive alternative. Here, we investigate homo-oligomeric interfaces by tracing residue coevolution via the global statistical direct coupling analysis (DCA). DCA can accurately infer spatial adjacencies between residues. These adjacencies can be included as constraints in structure prediction techniques to predict high-resolution models. By taking advantage of the ongoing exponential growth of sequence databases, we go significantly beyond anecdotal cases of a few protein families and apply DCA to a systematic large-scale study of nearly 2,000 Pfam protein families with sufficient sequence information and structurally resolved homo-oligomeric interfaces. We find that large interfaces are commonly identified by DCA. We further demonstrate that DCA can differentiate between subfamilies with different binding modes within one large Pfam family. Sequence-derived contact information for the subfamilies proves sufficient to assemble accurate structural models of the diverse protein-oligomers. Thus, we provide an approach to investigate oligomerization for arbitrary protein families leading to structural models complementary to often-difficult experimental methods. Combined with ever more abundant sequential data, we anticipate that this study will be instrumental to allow the structural description of many heteroprotein complexes in the future.


Asunto(s)
Evolución Molecular , Proteínas/química , Bases de Datos de Proteínas , Modelos Moleculares , Biología Molecular/métodos , Conformación Proteica , Dominios y Motivos de Interacción de Proteínas , Proteínas/metabolismo
14.
Proc Natl Acad Sci U S A ; 114(43): E9026-E9035, 2017 Oct 24.
Artículo en Inglés | MEDLINE | ID: mdl-29073099

RESUMEN

Understanding the extreme variation among bacterial genomes remains an unsolved challenge in evolutionary biology, despite long-standing debate about the relative importance of natural selection, mutation, and random drift. A potentially important confounding factor is the variation in mutation rates between lineages and over evolutionary history, which has been documented in several species. Mutation accumulation experiments have shown that hypermutability can erode genomes over short timescales. These results, however, were obtained under conditions of extremely weak selection, casting doubt on their general relevance. Here, we circumvent this limitation by analyzing genomes from mutator populations that arose during a long-term experiment with Escherichia coli, in which populations have been adaptively evolving for >50,000 generations. We develop an analytical framework to quantify the relative contributions of mutation and selection in shaping genomic characteristics, and we validate it using genomes evolved under regimes of high mutation rates with weak selection (mutation accumulation experiments) and low mutation rates with strong selection (natural isolates). Our results show that, despite sustained adaptive evolution in the long-term experiment, the signature of selection is much weaker than that of mutational biases in mutator genomes. This finding suggests that relatively brief periods of hypermutability can play an outsized role in shaping extant bacterial genomes. Overall, these results highlight the importance of genomic draft, in which strong linkage limits the ability of selection to purge deleterious mutations. These insights are also relevant to other biological systems evolving under strong linkage and high mutation rates, including viruses and cancer cells.


Asunto(s)
Escherichia coli/genética , Evolución Molecular , Genoma Bacteriano , Selección Genética , Escherichia coli/fisiología , Mutación , Tasa de Mutación , Filogenia
15.
Mol Biol Evol ; 35(4): 1018-1027, 2018 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-29351669

RESUMEN

Global coevolutionary models of homologous protein families, as constructed by direct coupling analysis (DCA), have recently gained popularity in particular due to their capacity to accurately predict residue-residue contacts from sequence information alone, and thereby to facilitate tertiary and quaternary protein structure prediction. More recently, they have also been used to predict fitness effects of amino-acid substitutions in proteins, and to predict evolutionary conserved protein-protein interactions. These models are based on two currently unjustified hypotheses: 1) correlations in the amino-acid usage of different positions are resulting collectively from networks of direct couplings; and 2) pairwise couplings are sufficient to capture the amino-acid variability. Here, we propose a highly precise inference scheme based on Boltzmann-machine learning, which allows us to systematically address these hypotheses. We show how correlations are built up in a highly collective way by a large number of coupling paths, which are based on the proteins three-dimensional structure. We further find that pairwise coevolutionary models capture the collective residue variability across homologous proteins even for quantities which are not imposed by the inference procedure, like three-residue correlations, the clustered structure of protein families in sequence space or the sequence distances between homologs. These findings strongly suggest that pairwise coevolutionary models are actually sufficient to accurately capture the residue variability in homologous protein families.


Asunto(s)
Coevolución Biológica , Modelos Genéticos , Proteínas/genética , Familia de Multigenes , Homología de Secuencia de Aminoácido
16.
PLoS Comput Biol ; 14(3): e1005992, 2018 03.
Artículo en Inglés | MEDLINE | ID: mdl-29543809

RESUMEN

We present a new educational initiative called Meet-U that aims to train students for collaborative work in computational biology and to bridge the gap between education and research. Meet-U mimics the setup of collaborative research projects and takes advantage of the most popular tools for collaborative work and of cloud computing. Students are grouped in teams of 4-5 people and have to realize a project from A to Z that answers a challenging question in biology. Meet-U promotes "coopetition," as the students collaborate within and across the teams and are also in competition with each other to develop the best final product. Meet-U fosters interactions between different actors of education and research through the organization of a meeting day, open to everyone, where the students present their work to a jury of researchers and jury members give research seminars. This very unique combination of education and research is strongly motivating for the students and provides a formidable opportunity for a scientific community to unite and increase its visibility. We report on our experience with Meet-U in two French universities with master's students in bioinformatics and modeling, with protein-protein docking as the subject of the course. Meet-U is easy to implement and can be straightforwardly transferred to other fields and/or universities. All the information and data are available at www.meet-u.org.


Asunto(s)
Biología Computacional/educación , Biología Computacional/métodos , Investigación/educación , Humanos , Proyectos de Investigación , Estudiantes , Universidades
17.
Proc Natl Acad Sci U S A ; 113(43): 12186-12191, 2016 10 25.
Artículo en Inglés | MEDLINE | ID: mdl-27729520

RESUMEN

Understanding protein-protein interactions is central to our understanding of almost all complex biological processes. Computational tools exploiting rapidly growing genomic databases to characterize protein-protein interactions are urgently needed. Such methods should connect multiple scales from evolutionary conserved interactions between families of homologous proteins, over the identification of specifically interacting proteins in the case of multiple paralogs inside a species, down to the prediction of residues being in physical contact across interaction interfaces. Statistical inference methods detecting residue-residue coevolution have recently triggered considerable progress in using sequence data for quaternary protein structure prediction; they require, however, large joint alignments of homologous protein pairs known to interact. The generation of such alignments is a complex computational task on its own; application of coevolutionary modeling has, in turn, been restricted to proteins without paralogs, or to bacterial systems with the corresponding coding genes being colocalized in operons. Here we show that the direct coupling analysis of residue coevolution can be extended to connect the different scales, and simultaneously to match interacting paralogs, to identify interprotein residue-residue contacts and to discriminate interacting from noninteracting families in a multiprotein system. Our results extend the potential applications of coevolutionary analysis far beyond cases treatable so far.


Asunto(s)
Evolución Molecular , Unión Proteica/genética , Mapeo de Interacción de Proteínas , Proteínas/genética , Algoritmos , Fenómenos Biofísicos , Biología Computacional , Conformación Proteica , Proteínas/química , Alineación de Secuencia , Homología de Secuencia de Aminoácido
18.
Rep Prog Phys ; 81(3): 032601, 2018 03.
Artículo en Inglés | MEDLINE | ID: mdl-29120346

RESUMEN

In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e. evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference of Boltzmann distributions from samples of microscopic configurations or observables, it is possible to extract information about evolutionary constraints and thus protein function and structure. Here we give an overview over some biologically important questions, and how statistical-mechanics inspired modeling approaches can help to answer them. Finally, we discuss some open questions, which we expect to be addressed over the next years.


Asunto(s)
Física/métodos , Proteínas/química , Secuencia de Aminoácidos , Anotación de Secuencia Molecular , Proteínas/metabolismo , Homología de Secuencia de Aminoácido
19.
Mol Biol Evol ; 33(1): 268-80, 2016 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-26446903

RESUMEN

The quantitative characterization of mutational landscapes is a task of outstanding importance in evolutionary and medical biology: It is, for example, of central importance for our understanding of the phenotypic effect of mutations related to disease and antibiotic drug resistance. Here we develop a novel inference scheme for mutational landscapes, which is based on the statistical analysis of large alignments of homologs of the protein of interest. Our method is able to capture epistatic couplings between residues, and therefore to assess the dependence of mutational effects on the sequence context where they appear. Compared with recent large-scale mutagenesis data of the beta-lactamase TEM-1, a protein providing resistance against beta-lactam antibiotics, our method leads to an increase of about 40% in explicative power as compared with approaches neglecting epistasis. We find that the informative sequence context extends to residues at native distances of about 20 Å from the mutated site, reaching thus far beyond residues in direct physical contact.


Asunto(s)
Proteínas de Escherichia coli/genética , Evolución Molecular , Mutación/genética , beta-Lactamasas/genética , Mapeo Cromosómico , Análisis Mutacional de ADN , ADN Bacteriano/análisis , ADN Bacteriano/genética , Epistasis Genética , Modelos Genéticos
20.
Nucleic Acids Res ; 43(21): 10444-55, 2015 Dec 02.
Artículo en Inglés | MEDLINE | ID: mdl-26420827

RESUMEN

Despite the biological importance of non-coding RNA, their structural characterization remains challenging. Making use of the rapidly growing sequence databases, we analyze nucleotide coevolution across homologous sequences via Direct-Coupling Analysis to detect nucleotide-nucleotide contacts. For a representative set of riboswitches, we show that the results of Direct-Coupling Analysis in combination with a generalized Nussinov algorithm systematically improve the results of RNA secondary structure prediction beyond traditional covariance approaches based on mutual information. Even more importantly, we show that the results of Direct-Coupling Analysis are enriched in tertiary structure contacts. By integrating these predictions into molecular modeling tools, systematically improved tertiary structure predictions can be obtained, as compared to using secondary structure information alone.


Asunto(s)
ARN/química , Análisis de Secuencia de ARN/métodos , Algoritmos , Evolución Molecular , Modelos Moleculares , Conformación de Ácido Nucleico , Riboswitch , Alineación de Secuencia , Homología de Secuencia de Ácido Nucleico
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA