Búsqueda | BVS Bolivia

1.

TULIP: A transformer-based unsupervised language model for interacting peptides and T cell receptors that generalizes to unseen epitopes.

Meynard-Piganeau, Barthelemy; Feinauer, Christoph; Weigt, Martin; Walczak, Aleksandra M; Mora, Thierry.

Proc Natl Acad Sci U S A ; 121(24): e2316401121, 2024 Jun 11.

Artículo en Inglés | MEDLINE | ID: mdl-38838016

RESUMEN

The accurate prediction of binding between T cell receptors (TCR) and their cognate epitopes is key to understanding the adaptive immune response and developing immunotherapies. Current methods face two significant limitations: the shortage of comprehensive high-quality data and the bias introduced by the selection of the negative training data commonly used in the supervised learning approaches. We propose a method, Transformer-based Unsupervised Language model for Interacting Peptides and T cell receptors (TULIP), that addresses both limitations by leveraging incomplete data and unsupervised learning and using the transformer architecture of language models. Our model is flexible and integrates all possible data sources, regardless of their quality or completeness. We demonstrate the existence of a bias introduced by the sampling procedure used in previous supervised approaches, emphasizing the need for an unsupervised approach. TULIP recognizes the specific TCRs binding an epitope, performing well on unseen epitopes. Our model outperforms state-of-the-art models and offers a promising direction for the development of more accurate TCR epitope recognition models.

Asunto(s)

Péptidos , Receptores de Antígenos de Linfocitos T , Receptores de Antígenos de Linfocitos T/inmunología , Receptores de Antígenos de Linfocitos T/metabolismo , Péptidos/inmunología , Péptidos/química , Péptidos/metabolismo , Humanos , Epítopos/inmunología , Unión Proteica , Epítopos de Linfocito T/inmunología , Aprendizaje Automático no Supervisado

2.

Towards parsimonious generative modeling of RNA families.

Calvanese, Francesco; Lambert, Camille N; Nghe, Philippe; Zamponi, Francesco; Weigt, Martin.

Nucleic Acids Res ; 52(10): 5465-5477, 2024 Jun 10.

Artículo en Inglés | MEDLINE | ID: mdl-38661206

RESUMEN

Generative probabilistic models emerge as a new paradigm in data-driven, evolution-informed design of biomolecular sequences. This paper introduces a novel approach, called Edge Activation Direct Coupling Analysis (eaDCA), tailored to the characteristics of RNA sequences, with a strong emphasis on simplicity, efficiency, and interpretability. eaDCA explicitly constructs sparse coevolutionary models for RNA families, achieving performance levels comparable to more complex methods while utilizing a significantly lower number of parameters. Our approach demonstrates efficiency in generating artificial RNA sequences that closely resemble their natural counterparts in both statistical analyses and SHAPE-MaP experiments, and in predicting the effect of mutations. Notably, eaDCA provides a unique feature: estimating the number of potential functional sequences within a given RNA family. For example, in the case of cyclic di-AMP riboswitches (RF00379), our analysis suggests the existence of approximately 1039 functional nucleotide sequences. While huge compared to the known <4000 natural sequences, this number represents only a tiny fraction of the vast pool of nearly 1082 possible nucleotide sequences of the same length (136 nucleotides). These results underscore the promise of sparse and interpretable generative models, such as eaDCA, in enhancing our understanding of the expansive RNA sequence space.

Asunto(s)

Biología Computacional , Modelos Genéticos , ARN , Algoritmos , Secuencia de Bases , Evolución Molecular , Modelos Estadísticos , Mutación , Conformación de Ácido Nucleico , Riboswitch/genética , ARN/química , ARN/genética , Análisis de Secuencia de ARN , Biología Computacional/métodos

3.

Generating interacting protein sequences using domain-to-domain translation.

Meynard-Piganeau, Barthelemy; Fabbri, Caterina; Weigt, Martin; Pagnani, Andrea; Feinauer, Christoph.

Bioinformatics ; 39(7)2023 07 01.

Artículo en Inglés | MEDLINE | ID: mdl-37399105

RESUMEN

MOTIVATION: Being able to artificially design novel proteins of desired function is pivotal in many biological and biomedical applications. Generative statistical modeling has recently emerged as a new paradigm for designing amino acid sequences, including in particular models and embedding methods borrowed from natural language processing (NLP). However, most approaches target single proteins or protein domains, and do not take into account any functional specificity or interaction with the context. To extend beyond current computational strategies, we develop a method for generating protein domain sequences intended to interact with another protein domain. Using data from natural multidomain proteins, we cast the problem as a translation problem from a given interactor domain to the new domain to be generated, i.e. we generate artificial partner sequences conditional on an input sequence. We also show in an example that the same procedure can be applied to interactions between distinct proteins. RESULTS: Evaluating our model's quality using diverse metrics, in part related to distinct biological questions, we show that our method outperforms state-of-the-art shallow autoregressive strategies. We also explore the possibility of fine-tuning pretrained large language models for the same task and of using Alphafold 2 for assessing the quality of sampled sequences. AVAILABILITY AND IMPLEMENTATION: Data and code on https://github.com/barthelemymp/Domain2DomainProteinTranslation.

Asunto(s)

Lenguaje , Proteínas , Secuencia de Aminoácidos , Proteínas/química , Dominios Proteicos

4.

Combining phylogeny and coevolution improves the inference of interaction partners among paralogous proteins.

Gandarilla-Pérez, Carlos A; Pinilla, Sergio; Bitbol, Anne-Florence; Weigt, Martin.

PLoS Comput Biol ; 19(3): e1011010, 2023 03.

Artículo en Inglés | MEDLINE | ID: mdl-36996234

RESUMEN

Predicting protein-protein interactions from sequences is an important goal of computational biology. Various sources of information can be used to this end. Starting from the sequences of two interacting protein families, one can use phylogeny or residue coevolution to infer which paralogs are specific interaction partners within each species. We show that these two signals can be combined to improve the performance of the inference of interaction partners among paralogs. For this, we first align the sequence-similarity graphs of the two families through simulated annealing, yielding a robust partial pairing. We next use this partial pairing to seed a coevolution-based iterative pairing algorithm. This combined method improves performance over either separate method. The improvement obtained is striking in the difficult cases where the average number of paralogs per species is large or where the total number of sequences is modest.

Asunto(s)

Algoritmos , Proteínas , Unión Proteica , Filogenia , Proteínas/química , Biología Computacional/métodos

5.

Structure and Function of a Dehydrating Condensation Domain in Nonribosomal Peptide Biosynthesis.

Patteson, Jon B; Fortinez, Camille Marie; Putz, Andrew T; Rodriguez-Rivas, Juan; Bryant, L Henry; Adhikari, Kamal; Weigt, Martin; Schmeing, T Martin; Li, Bo.

J Am Chem Soc ; 144(31): 14057-14070, 2022 08 10.

Artículo en Inglés | MEDLINE | ID: mdl-35895935

RESUMEN

Dehydroamino acids are important structural motifs and biosynthetic intermediates for natural products. Many bioactive natural products of nonribosomal origin contain dehydroamino acids; however, the biosynthesis of dehydroamino acids in most nonribosomal peptides is not well understood. Here, we provide biochemical and bioinformatic evidence in support of the role of a unique class of condensation domains in dehydration (CmodAA). We also obtain the crystal structure of a CmodAA domain, which is part of the nonribosomal peptide synthetase AmbE in the biosynthesis of the antibiotic methoxyvinylglycine. Biochemical analysis reveals that AmbE-CmodAA modifies a peptide substrate that is attached to the donor carrier protein. Mutational studies of AmbE-CmodAA identify several key residues for activity, including four residues that are mostly conserved in the CmodAA subfamily. Alanine mutation of these conserved residues either significantly increases or decreases AmbE activity. AmbE exhibits a dimeric conformation, which is uncommon and could enable transfer of an intermediate between different protomers. Our discovery highlights a central dehydrating function for CmodAA domains that unifies dehydroamino acid biosynthesis in diverse nonribosomal peptide pathways. Our work also begins to shed light on the mechanism of CmodAA domains. Understanding CmodAA domain function may facilitate identification of new natural products that contain dehydroamino acids and enable engineering of dehydroamino acids into nonribosomal peptides.

Asunto(s)

Productos Biológicos , Biosíntesis de Péptidos Independientes de Ácidos Nucleicos , Antibacterianos , Péptido Sintasas/metabolismo , Péptidos/química

6.

Deciphering polymorphism in 61,157 Escherichia coli genomes via epistatic sequence landscapes.

Vigué, Lucile; Croce, Giancarlo; Petitjean, Marie; Ruppé, Etienne; Tenaillon, Olivier; Weigt, Martin.

Nat Commun ; 13(1): 4030, 2022 07 12.

Artículo en Inglés | MEDLINE | ID: mdl-35821377

RESUMEN

Characterizing the effect of mutations is key to understand the evolution of protein sequences and to separate neutral amino-acid changes from deleterious ones. Epistatic interactions between residues can lead to a context dependence of mutation effects. Context dependence constrains the amino-acid changes that can contribute to polymorphism in the short term, and the ones that can accumulate between species in the long term. We use computational approaches to accurately predict the polymorphisms segregating in a panel of 61,157 Escherichia coli genomes from the analysis of distant homologues. By comparing a context-aware Direct-Coupling Analysis modelling to a non-epistatic approach, we show that the genetic context strongly constrains the tolerable amino acids in 30% to 50% of amino-acid sites. The study of more distant species suggests the gradual build-up of genetic context over long evolutionary timescales by the accumulation of small epistatic contributions.

Asunto(s)

Escherichia coli , Polimorfismo Genético , Escherichia coli/genética , Mutación

7.

Author Correction: Efficient generative modeling of protein sequences using simple autoregressive models.

Trinquier, Jeanne; Uguzzoni, Guido; Pagnani, Andrea; Zamponi, Francesco; Weigt, Martin.

Nat Commun ; 13(1): 1889, 2022 Apr 01.

Artículo en Inglés | MEDLINE | ID: mdl-35365680

8.

Epistatic models predict mutable sites in SARS-CoV-2 proteins and epitopes.

Rodriguez-Rivas, Juan; Croce, Giancarlo; Muscat, Maureen; Weigt, Martin.

Proc Natl Acad Sci U S A ; 119(4)2022 01 25.

Artículo en Inglés | MEDLINE | ID: mdl-35022216

RESUMEN

The emergence of new variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a major concern given their potential impact on the transmissibility and pathogenicity of the virus as well as the efficacy of therapeutic interventions. Here, we predict the mutability of all positions in SARS-CoV-2 protein domains to forecast the appearance of unseen variants. Using sequence data from other coronaviruses, preexisting to SARS-CoV-2, we build statistical models that not only capture amino acid conservation but also more complex patterns resulting from epistasis. We show that these models are notably superior to conservation profiles in estimating the already observable SARS-CoV-2 variability. In the receptor binding domain of the spike protein, we observe that the predicted mutability correlates well with experimental measures of protein stability and that both are reliable mutability predictors (receiver operating characteristic areas under the curve â¼0.8). Most interestingly, we observe an increasing agreement between our model and the observed variability as more data become available over time, proving the anticipatory capacity of our model. When combined with data concerning the immune response, our approach identifies positions where current variants of concern are highly overrepresented. These results could assist studies on viral evolution and future viral outbreaks and, in particular, guide the exploration and anticipation of potentially harmful future SARS-CoV-2 variants.

Asunto(s)

COVID-19/virología , Epistasis Genética , Epítopos , Mutación , SARS-CoV-2/genética , Glicoproteína de la Espiga del Coronavirus/química , Glicoproteína de la Espiga del Coronavirus/genética , Proteínas Virales/química , Algoritmos , Área Bajo la Curva , Biología Computacional/métodos , Análisis Mutacional de ADN , Bases de Datos de Proteínas , Aprendizaje Profundo , Epítopos/química , Genoma Viral , Humanos , Modelos Estadísticos , Mutagénesis , Probabilidad , Dominios Proteicos , Curva ROC

9.

Modeling Sequence-Space Exploration and Emergence of Epistatic Signals in Protein Evolution.

Bisardi, Matteo; Rodriguez-Rivas, Juan; Zamponi, Francesco; Weigt, Martin.

Mol Biol Evol ; 39(1)2022 01 07.

Artículo en Inglés | MEDLINE | ID: mdl-34751386

RESUMEN

During their evolution, proteins explore sequence space via an interplay between random mutations and phenotypic selection. Here, we build upon recent progress in reconstructing data-driven fitness landscapes for families of homologous proteins, to propose stochastic models of experimental protein evolution. These models predict quantitatively important features of experimentally evolved sequence libraries, like fitness distributions and position-specific mutational spectra. They also allow us to efficiently simulate sequence libraries for a vast array of combinations of experimental parameters like sequence divergence, selection strength, and library size. We showcase the potential of the approach in reanalyzing two recent experiments to determine protein structure from signals of epistasis emerging in experimental sequence libraries. To be detectable, these signals require sufficiently large and sufficiently diverged libraries. Our modeling framework offers a quantitative explanation for different outcomes of recently published experiments. Furthermore, we can forecast the outcome of time- and resource-intensive evolution experiments, opening thereby a way to computationally optimize experimental protocols.

Asunto(s)

Epistasis Genética , Vuelo Espacial , Evolución Molecular , Aptitud Genética , Modelos Genéticos , Mutación , Proteínas/genética

10.

adabmDCA: adaptive Boltzmann machine learning for biological sequences.

Muntoni, Anna Paola; Pagnani, Andrea; Weigt, Martin; Zamponi, Francesco.

BMC Bioinformatics ; 22(1): 528, 2021 Oct 29.

Artículo en Inglés | MEDLINE | ID: mdl-34715775

RESUMEN

BACKGROUND: Boltzmann machines are energy-based models that have been shown to provide an accurate statistical description of domains of evolutionary-related protein and RNA families. They are parametrized in terms of local biases accounting for residue conservation, and pairwise terms to model epistatic coevolution between residues. From the model parameters, it is possible to extract an accurate prediction of the three-dimensional contact map of the target domain. More recently, the accuracy of these models has been also assessed in terms of their ability in predicting mutational effects and generating in silico functional sequences. RESULTS: Our adaptive implementation of Boltzmann machine learning, adabmDCA, can be generally applied to both protein and RNA families and accomplishes several learning set-ups, depending on the complexity of the input data and on the user requirements. The code is fully available at https://github.com/anna-pa-m/adabmDCA . As an example, we have performed the learning of three Boltzmann machines modeling the Kunitz and Beta-lactamase2 protein domains and TPP-riboswitch RNA domain. CONCLUSIONS: The models learned by adabmDCA are comparable to those obtained by state-of-the-art techniques for this task, in terms of the quality of the inferred contact map as well as of the synthetically generated sequences. In addition, the code implements both equilibrium and out-of-equilibrium learning, which allows for an accurate and lossless training when the equilibrium one is prohibitive in terms of computational time, and allows for pruning irrelevant parameters using an information-based criterion.

Asunto(s)

Aprendizaje Automático , Proteínas , Humanos , Proteínas/genética , ARN

11.

Efficient generative modeling of protein sequences using simple autoregressive models.

Trinquier, Jeanne; Uguzzoni, Guido; Pagnani, Andrea; Zamponi, Francesco; Weigt, Martin.

Nat Commun ; 12(1): 5800, 2021 10 04.

Artículo en Inglés | MEDLINE | ID: mdl-34608136

RESUMEN

Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost (by a factor between 102 and 103). Furthermore, the simple structure of our models has distinctive mathematical advantages, which translate into an improved applicability in sequence generation and evaluation. Within these models, we can easily estimate both the probability of a given sequence, and, using the model's entropy, the size of the functional sequence space related to a specific protein family. In the example of response regulators, we find a huge number of ca. 1068 possible sequences, which nevertheless constitute only the astronomically small fraction 10-80 of all amino-acid sequences of the same length. These findings illustrate the potential and the difficulty in exploring sequence space via generative sequence models.

Asunto(s)

Modelos Estadísticos , Proteínas/química , Secuencia de Aminoácidos , Biología Computacional , Bases de Datos de Proteínas , Epistasis Genética , Evolución Molecular , Aprendizaje Automático , Mutación , Proteínas/clasificación , Proteínas/genética , Alineación de Secuencia

12.

Sparse generative modeling via parameter reduction of Boltzmann machines: Application to protein-sequence families.

Barrat-Charlaix, Pierre; Muntoni, Anna Paola; Shimagaki, Kai; Weigt, Martin; Zamponi, Francesco.

Phys Rev E ; 104(2-1): 024407, 2021 Aug.

Artículo en Inglés | MEDLINE | ID: mdl-34525554

RESUMEN

Boltzmann machines (BMs) are widely used as generative models. For example, pairwise Potts models (PMs), which are instances of the BM class, provide accurate statistical models of families of evolutionarily related protein sequences. Their parameters are the local fields, which describe site-specific patterns of amino acid conservation, and the two-site couplings, which mirror the coevolution between pairs of sites. This coevolution reflects structural and functional constraints acting on protein sequences during evolution. The most conservative choice to describe the coevolution signal is to include all possible two-site couplings into the PM. This choice, typical of what is known as Direct Coupling Analysis, has been successful for predicting residue contacts in the three-dimensional structure, mutational effects, and generating new functional sequences. However, the resulting PM suffers from important overfitting effects: many couplings are small, noisy, and hardly interpretable; the PM is close to a critical point, meaning that it is highly sensitive to small parameter perturbations. In this work, we introduce a general parameter-reduction procedure for BMs, via a controlled iterative decimation of the less statistically significant couplings, identified by an information-based criterion that selects either weak or statistically unsupported couplings. For several protein families, our procedure allows one to remove more than 90% of the PM couplings, while preserving the predictive and generative properties of the original dense PM, and the resulting model is far away from criticality, hence more robust to noise.

13.

On the effect of phylogenetic correlations in coevolution-based contact prediction in proteins.

Rodriguez Horta, Edwin; Weigt, Martin.

PLoS Comput Biol ; 17(5): e1008957, 2021 05.

Artículo en Inglés | MEDLINE | ID: mdl-34029316

RESUMEN

Coevolution-based contact prediction, either directly by coevolutionary couplings resulting from global statistical sequence models or using structural supervision and deep learning, has found widespread application in protein-structure prediction from sequence. However, one of the basic assumptions in global statistical modeling is that sequences form an at least approximately independent sample of an unknown probability distribution, which is to be learned from data. In the case of protein families, this assumption is obviously violated by phylogenetic relations between protein sequences. It has turned out to be notoriously difficult to take phylogenetic correlations into account in coevolutionary model learning. Here, we propose a complementary approach: we develop strategies to randomize or resample sequence data, such that conservation patterns and phylogenetic relations are preserved, while intrinsic (i.e. structure- or function-based) coevolutionary couplings are removed. A comparison between the results of Direct Coupling Analysis applied to real and to resampled data shows that the largest coevolutionary couplings, i.e. those used for contact prediction, are only weakly influenced by phylogeny. However, the phylogeny-induced spurious couplings in the resampled data are compatible in size with the first false-positive contact predictions from real data. Dissecting functional from phylogeny-induced couplings might therefore extend accurate contact predictions to the range of intermediate-size couplings.

Asunto(s)

Evolución Molecular , Filogenia , Proteínas/química , Algoritmos , Biología Computacional/métodos , Conformación Proteica , Alineación de Secuencia

14.

FilterDCA: Interpretable supervised contact prediction using inter-domain coevolution.

Muscat, Maureen; Croce, Giancarlo; Sarti, Edoardo; Weigt, Martin.

PLoS Comput Biol ; 16(10): e1007621, 2020 10.

Artículo en Inglés | MEDLINE | ID: mdl-33035205

RESUMEN

Predicting three-dimensional protein structure and assembling protein complexes using sequence information belongs to the most prominent tasks in computational biology. Recently substantial progress has been obtained in the case of single proteins using a combination of unsupervised coevolutionary sequence analysis with structurally supervised deep learning. While reaching impressive accuracies in predicting residue-residue contacts, deep learning has a number of disadvantages. The need for large structural training sets limits the applicability to multi-protein complexes; and their deep architecture makes the interpretability of the convolutional neural networks intrinsically hard. Here we introduce FilterDCA, a simpler supervised predictor for inter-domain and inter-protein contacts. It is based on the fact that contact maps of proteins show typical contact patterns, which results from secondary structure and are reflected by patterns in coevolutionary analysis. We explicitly integrate averaged contacts patterns with coevolutionary scores derived by Direct Coupling Analysis, improving performance over standard coevolutionary analysis, while remaining fully transparent and interpretable. The FilterDCA code is available at http://gitlab.lcqb.upmc.fr/muscat/FilterDCA.

Asunto(s)

Biología Computacional/métodos , Conformación Proteica , Proteínas/química , Análisis de Secuencia de Proteína/métodos , Modelos Moleculares , Programas Informáticos , Aprendizaje Automático Supervisado

15.

An evolution-based model for designing chorismate mutase enzymes.

Russ, William P; Figliuzzi, Matteo; Stocker, Christian; Barrat-Charlaix, Pierre; Socolich, Michael; Kast, Peter; Hilvert, Donald; Monasson, Remi; Cocco, Simona; Weigt, Martin; Ranganathan, Rama.

Science ; 369(6502): 440-445, 2020 07 24.

Artículo en Inglés | MEDLINE | ID: mdl-32703877

RESUMEN

The rational design of enzymes is an important goal for both fundamental and practical reasons. Here, we describe a process to learn the constraints for specifying proteins purely from evolutionary sequence data, design and build libraries of synthetic genes, and test them for activity in vivo using a quantitative complementation assay. For chorismate mutase, a key enzyme in the biosynthesis of aromatic amino acids, we demonstrate the design of natural-like catalytic function with substantial sequence diversity. Further optimization focuses the generative model toward function in a specific genomic context. The data show that sequence-based statistical models suffice to specify proteins and provide access to an enormous space of functional sequences. This result provides a foundation for a general process for evolution-based design of artificial proteins.

Asunto(s)

Corismato Mutasa , Evolución Molecular , Modelos Genéticos , Modelos Estadísticos , Secuencia de Aminoácidos , Corismato Mutasa/química , Corismato Mutasa/genética , Proteínas de Escherichia coli/química , Proteínas de Escherichia coli/genética

16.

Statistical physics of interacting proteins: Impact of dataset size and quality assessed in synthetic sequences.

Gandarilla-Pérez, Carlos A; Mergny, Pierre; Weigt, Martin; Bitbol, Anne-Florence.

Phys Rev E ; 101(3-1): 032413, 2020 Mar.

Artículo en Inglés | MEDLINE | ID: mdl-32290011

RESUMEN

Identifying protein-protein interactions is crucial for a systems-level understanding of the cell. Recently, algorithms based on inverse statistical physics, e.g., direct coupling analysis (DCA), have allowed to use evolutionarily related sequences to address two conceptually related inference tasks: finding pairs of interacting proteins and identifying pairs of residues which form contacts between interacting proteins. Here we address two underlying questions: How are the performances of both inference tasks related? How does performance depend on dataset size and the quality? To this end, we formalize both tasks using Ising models defined over stochastic block models, with individual blocks representing single proteins and interblock couplings protein-protein interactions; controlled synthetic sequence data are generated by Monte Carlo simulations. We show that DCA is able to address both inference tasks accurately when sufficiently large training sets of known interaction partners are available and that an iterative pairing algorithm allows to make predictions even without a training set. Noise in the training data deteriorates performance. In both tasks we find a quadratic scaling relating dataset quality and size that is consistent with noise adding in square-root fashion and signal adding linearly when increasing the dataset. This implies that it is generally good to incorporate more data even if their quality are imperfect, thereby shedding light on the empirically observed performance of DCA applied to natural protein sequences.

Asunto(s)

Modelos Biológicos , Mapas de Interacción de Proteínas , Proteínas/metabolismo , Método de Montecarlo

17.

Aligning biological sequences by exploiting residue conservation and coevolution.

Muntoni, Anna Paola; Pagnani, Andrea; Weigt, Martin; Zamponi, Francesco.

Phys Rev E ; 102(6-1): 062409, 2020 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-33465950

RESUMEN

Sequences of nucleotides (for DNA and RNA) or amino acids (for proteins) are central objects in biology. Among the most important computational problems is that of sequence alignment, i.e., arranging sequences from different organisms in such a way to identify similar regions, to detect evolutionary relationships between sequences, and to predict biomolecular structure and function. This is typically addressed through profile models, which capture position specificities like conservation in sequences but assume an independent evolution of different positions. Over recent years, it has been well established that coevolution of different amino-acid positions is essential for maintaining three-dimensional structure and function. Modeling approaches based on inverse statistical physics can catch the coevolution signal in sequence ensembles, and they are now widely used in predicting protein structure, protein-protein interactions, and mutational landscapes. Here, we present DCAlign, an efficient alignment algorithm based on an approximate message-passing strategy, which is able to overcome the limitations of profile models, to include coevolution among positions in a general way, and to be therefore universally applicable to protein- and RNA-sequence alignment without the need of using complementary structural information. The potential of DCAlign is carefully explored using well-controlled simulated data, as well as real protein and RNA sequences.

Asunto(s)

Secuencia Conservada , Evolución Molecular , Modelos Genéticos

18.

Predicting Interacting Protein Pairs by Coevolutionary Paralog Matching.

Gueudré, Thomas; Baldassi, Carlo; Pagnani, Andrea; Weigt, Martin.

Methods Mol Biol ; 2074: 57-65, 2020.

Artículo en Inglés | MEDLINE | ID: mdl-31583630

RESUMEN

Even if we know that two families of homologous proteins interact, we do not necessarily know, which specific proteins interact inside each species. The reason is that most families contain paralogs, i.e., more than one homologous sequence per species. We have developed a tool to predict interacting paralogs between the two protein families, which is based on the idea of inter-protein coevolution: our algorithm matches those members of the two protein families, which belong to the same species and collectively maximize the detectable coevolutionary signal. It is applicable even in cases, where simpler methods based, e.g., on genomic co-localization of genes coding for interacting proteins or orthology-based methods fail. In this method paper, we present an efficient implementation of this idea based on freely available software.

Asunto(s)

Biología Computacional/métodos , Proteínas/química , Proteínas/metabolismo , Unión Proteica , Programas Informáticos

19.

Structures of a dimodular nonribosomal peptide synthetase reveal conformational flexibility.

Reimer, Janice M; Eivaskhani, Maximilian; Harb, Ingrid; Guarné, Alba; Weigt, Martin; Schmeing, T Martin.

Science ; 366(6466)2019 11 08.

Artículo en Inglés | MEDLINE | ID: mdl-31699907

RESUMEN

Nonribosomal peptide synthetases (NRPSs) are biosynthetic enzymes that synthesize natural product therapeutics using a modular synthetic logic, whereby each module adds one aminoacyl substrate to the nascent peptide. We have determined five x-ray crystal structures of large constructs of the NRPS linear gramicidin synthetase, including a structure of a full core dimodule in conformations organized for the condensation reaction and intermodular peptidyl substrate delivery. The structures reveal differences in the relative positions of adjacent modules, which are not strictly coupled to the catalytic cycle and are consistent with small-angle x-ray scattering data. The structures and covariation analysis of homologs allowed us to create mutants that improve the yield of a peptide from a module-swapped dimodular NRPS.

Asunto(s)

Proteínas Bacterianas/química , Brevibacillus/enzimología , Gramicidina/biosíntesis , Péptido Sintasas/química , Dominio Catalítico , Cristalografía por Rayos X

20.

A multi-scale coevolutionary approach to predict interactions between protein domains.

Croce, Giancarlo; Gueudré, Thomas; Ruiz Cuevas, Maria Virginia; Keidel, Victoria; Figliuzzi, Matteo; Szurmant, Hendrik; Weigt, Martin.

PLoS Comput Biol ; 15(10): e1006891, 2019 10.

Artículo en Inglés | MEDLINE | ID: mdl-31634362

RESUMEN

Interacting proteins and protein domains coevolve on multiple scales, from their correlated presence across species, to correlations in amino-acid usage. Genomic databases provide rapidly growing data for variability in genomic protein content and in protein sequences, calling for computational predictions of unknown interactions. We first introduce the concept of direct phyletic couplings, based on global statistical models of phylogenetic profiles. They strongly increase the accuracy of predicting pairs of related protein domains beyond simpler correlation-based approaches like phylogenetic profiling (80% vs. 30-50% positives out of the 1000 highest-scoring pairs). Combined with the direct coupling analysis of inter-protein residue-residue coevolution, we provide multi-scale evidence for direct but unknown interaction between protein families. An in-depth discussion shows these to be biologically sensible and directly experimentally testable. Negative phyletic couplings highlight alternative solutions for the same functionality, including documented cases of convergent evolution. Thereby our work proves the strong potential of global statistical modeling approaches to genome-wide coevolutionary analysis, far beyond the established use for individual protein complexes and domain-domain interactions.

Asunto(s)

Biología Computacional/métodos , Dominios y Motivos de Interacción de Proteínas/fisiología , Mapeo de Interacción de Proteínas/métodos , Algoritmos , Aminoácidos/metabolismo , Animales , Fenómenos Biofísicos , Evolución Molecular , Humanos , Modelos Estadísticos , Filogenia , Unión Proteica/fisiología , Dominios Proteicos/fisiología , Proteínas/química

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA