Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 42
Filtrar
1.
Proc Natl Acad Sci U S A ; 121(6): e2308895121, 2024 Feb 06.
Artigo em Inglês | MEDLINE | ID: mdl-38285950

RESUMO

Computational models of evolution are valuable for understanding the dynamics of sequence variation, to infer phylogenetic relationships or potential evolutionary pathways and for biomedical and industrial applications. Despite these benefits, few have validated their propensities to generate outputs with in vivo functionality, which would enhance their value as accurate and interpretable evolutionary algorithms. We demonstrate the power of epistasis inferred from natural protein families to evolve sequence variants in an algorithm we developed called sequence evolution with epistatic contributions (SEEC). Utilizing the Hamiltonian of the joint probability of sequences in the family as fitness metric, we sampled and experimentally tested for in vivo [Formula: see text]-lactamase activity in Escherichia coli TEM-1 variants. These evolved proteins can have dozens of mutations dispersed across the structure while preserving sites essential for both catalysis and interactions. Remarkably, these variants retain family-like functionality while being more active than their wild-type predecessor. We found that depending on the inference method used to generate the epistatic constraints, different parameters simulate diverse selection strengths. Under weaker selection, local Hamiltonian fluctuations reliably predict relative changes to variant fitness, recapitulating neutral evolution. SEEC has the potential to explore the dynamics of neofunctionalization, characterize viral fitness landscapes, and facilitate vaccine development.


Assuntos
Epistasia Genética , Proteínas , Filogenia , Proteínas/genética , Mutação , Fenótipo , Evolução Molecular , Aptidão Genética , Modelos Genéticos
2.
Proc Natl Acad Sci U S A ; 120(6): e2211098120, 2023 02 07.
Artigo em Inglês | MEDLINE | ID: mdl-36730204

RESUMO

The segmented RNA genome of influenza A viruses (IAVs) enables viral evolution through genetic reassortment after multiple IAVs coinfect the same cell, leading to viruses harboring combinations of eight genomic segments from distinct parental viruses. Existing data indicate that reassortant genotypes are not equiprobable; however, the low throughput of available virology techniques does not allow quantitative analysis. Here, we have developed a high-throughput single-cell droplet microfluidic system allowing encapsulation of IAV-infected cells, each cell being infected by a single progeny virion resulting from a coinfection process. Customized barcoded primers for targeted viral RNA sequencing enabled the analysis of 18,422 viral genotypes resulting from coinfection with two circulating human H1N1pdm09 and H3N2 IAVs. Results were highly reproducible, confirmed that genetic reassortment is far from random, and allowed accurate quantification of reassortants including rare events. In total, 159 out of the 254 possible reassortant genotypes were observed but with widely varied prevalence (from 0.038 to 8.45%). In cells where eight segments were detected, all 112 possible pairwise combinations of segments were observed. The inclusion of data from single cells where less than eight segments were detected allowed analysis of pairwise cosegregation between segments with very high confidence. Direct coupling analysis accurately predicted the fraction of pairwise segments and full genotypes. Overall, our results indicate that a large proportion of reassortant genotypes can emerge upon coinfection and be detected over a wide range of frequencies, highlighting the power of our tool for systematic and exhaustive monitoring of the reassortment potential of IAVs.


Assuntos
Coinfecção , Vírus da Influenza A , Influenza Humana , Humanos , Vírus da Influenza A/genética , Vírus da Influenza A Subtipo H3N2/genética , Infecções por Orthomyxoviridae , Vírus Reordenados/genética , RNA Viral/genética , Análise de Sequência de RNA
3.
Proc Natl Acad Sci U S A ; 119(4)2022 01 25.
Artigo em Inglês | MEDLINE | ID: mdl-35022216

RESUMO

The emergence of new variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a major concern given their potential impact on the transmissibility and pathogenicity of the virus as well as the efficacy of therapeutic interventions. Here, we predict the mutability of all positions in SARS-CoV-2 protein domains to forecast the appearance of unseen variants. Using sequence data from other coronaviruses, preexisting to SARS-CoV-2, we build statistical models that not only capture amino acid conservation but also more complex patterns resulting from epistasis. We show that these models are notably superior to conservation profiles in estimating the already observable SARS-CoV-2 variability. In the receptor binding domain of the spike protein, we observe that the predicted mutability correlates well with experimental measures of protein stability and that both are reliable mutability predictors (receiver operating characteristic areas under the curve ∼0.8). Most interestingly, we observe an increasing agreement between our model and the observed variability as more data become available over time, proving the anticipatory capacity of our model. When combined with data concerning the immune response, our approach identifies positions where current variants of concern are highly overrepresented. These results could assist studies on viral evolution and future viral outbreaks and, in particular, guide the exploration and anticipation of potentially harmful future SARS-CoV-2 variants.


Assuntos
COVID-19/virologia , Epistasia Genética , Epitopos , Mutação , SARS-CoV-2/genética , Glicoproteína da Espícula de Coronavírus/química , Glicoproteína da Espícula de Coronavírus/genética , Proteínas Virais/química , Algoritmos , Área Sob a Curva , Biologia Computacional/métodos , Análise Mutacional de DNA , Bases de Dados de Proteínas , Aprendizado Profundo , Epitopos/química , Genoma Viral , Humanos , Modelos Estatísticos , Mutagênese , Probabilidade , Domínios Proteicos , Curva ROC
4.
Brief Bioinform ; 23(2)2022 03 10.
Artigo em Inglês | MEDLINE | ID: mdl-35037015

RESUMO

Direct coupling analysis (DCA) has been widely used to infer evolutionary coupled residue pairs from the multiple sequence alignment (MSA) of homologous sequences. However, effectively selecting residue pairs with significant evolutionary couplings according to the result of DCA is a non-trivial task. In this study, we developed a general statistical framework for significant evolutionary coupling detection, referred to as irreproducible discovery rate (IDR)-DCA, which is based on reproducibility analysis of the coupling scores obtained from DCA on manually created MSA replicates. IDR-DCA was applied to select residue pairs for contact prediction for monomeric proteins, protein-protein interactions and monomeric RNAs, in which three different versions of DCA were applied. We demonstrated that with the application of IDR-DCA, the residue pairs selected using a universal threshold always yielded stable performance for contact prediction. Comparing with the application of carefully tuned coupling score cutoffs, IDR-DCA always showed better performance. The robustness of IDR-DCA was also supported through the MSA downsampling analysis. We further demonstrated the effectiveness of applying constraints obtained from residue pairs selected by IDR-DCA to assist RNA secondary structure prediction.


Assuntos
Algoritmos , Proteínas , Estrutura Secundária de Proteína , Proteínas/química , RNA , Reprodutibilidade dos Testes , Alinhamento de Sequência
5.
Rep Prog Phys ; 86(5)2023 04 04.
Artigo em Inglês | MEDLINE | ID: mdl-36944245

RESUMO

This review is about statistical genetics, an interdisciplinary topic between statistical physics and population biology. The focus is on the phase ofquasi-linkage equilibrium(QLE). Our goals here are to clarify under which conditions the QLE phase can be expected to hold in population biology and how the stability of the QLE phase is lost. The QLE state, which has many similarities to a thermal equilibrium state in statistical mechanics, was discovered by M Kimura for a two-locus two-allele model, and was extended and generalized to the global genome scale byNeher&Shraiman (2011). What we will refer to as the Kimura-Neher-Shraiman theory describes a population evolving due to the mutations, recombination, natural selection and possibly genetic drift. A QLE phase exists at sufficiently high recombination rate (r) and/or mutation ratesµwith respect to selection strength. We show how in QLE it is possible to infer the epistatic parameters of the fitness function from the knowledge of the (dynamical) distribution of genotypes in a population. We further consider the breakdown of the QLE regime for high enough selection strength. We review recent results for the selection-mutation and selection-recombination dynamics. Finally, we identify and characterize a new phase which we call the non-random coexistence where variability persists in the population without either fixating or disappearing.


Assuntos
Modelos Genéticos , Seleção Genética , Desequilíbrio de Ligação , Mutação , Genótipo , Genética Populacional
6.
Brief Bioinform ; 22(3)2021 05 20.
Artigo em Inglês | MEDLINE | ID: mdl-32672331

RESUMO

Membrane proteins are unique in that they interact with lipid bilayers, making them indispensable for transporting molecules and relaying signals between and across cells. Due to the significance of the protein's functions, mutations often have profound effects on the fitness of the host. This is apparent both from experimental studies, which implicated numerous missense variants in diseases, as well as from evolutionary signals that allow elucidating the physicochemical constraints that intermembrane and aqueous environments bring. In this review, we report on the current state of knowledge acquired on missense variants (referred to as to single amino acid variants) affecting membrane proteins as well as the insights that can be extrapolated from data already available. This includes an overview of the annotations for membrane protein variants that have been collated within databases dedicated to the topic, bioinformatics approaches that leverage evolutionary information in order to shed light on previously uncharacterized membrane protein structures or interaction interfaces, tools for predicting the effects of mutations tailored specifically towards the characteristics of membrane proteins as well as two clinically relevant case studies explaining the implications of mutated membrane proteins in cancer and cardiomyopathy.


Assuntos
Cardiomiopatias/genética , Evolução Molecular , Proteínas de Membrana , Mutação de Sentido Incorreto , Proteínas de Neoplasias , Neoplasias/genética , Substituição de Aminoácidos , Biologia Computacional , Humanos , Proteínas de Membrana/química , Proteínas de Membrana/genética , Proteínas de Neoplasias/química , Proteínas de Neoplasias/genética , Conformação Proteica
7.
Proc Natl Acad Sci U S A ; 117(49): 31519-31526, 2020 12 08.
Artigo em Inglês | MEDLINE | ID: mdl-33203681

RESUMO

Genome-wide epistasis analysis is a powerful tool to infer gene interactions, which can guide drug and vaccine development and lead to deeper understanding of microbial pathogenesis. We have considered all complete severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes deposited in the Global Initiative on Sharing All Influenza Data (GISAID) repository until four different cutoff dates, and used direct coupling analysis together with an assumption of quasi-linkage equilibrium to infer epistatic contributions to fitness from polymorphic loci. We find eight interactions, of which three are between pairs where one locus lies in gene ORF3a, both loci holding nonsynonymous mutations. We also find interactions between two loci in gene nsp13, both holding nonsynonymous mutations, and four interactions involving one locus holding a synonymous mutation. Altogether, we infer interactions between loci in viral genes ORF3a and nsp2, nsp12, and nsp6, between ORF8 and nsp4, and between loci in genes nsp2, nsp13, and nsp14. The paper opens the prospect to use prominent epistatically linked pairs as a starting point to search for combinatorial weaknesses of recombinant viral pathogens.


Assuntos
Epistasia Genética/genética , Genes Virais/genética , SARS-CoV-2/genética , COVID-19/patologia , Proteínas do Nucleocapsídeo de Coronavírus/genética , RNA-Polimerase RNA-Dependente de Coronavírus/genética , Exorribonucleases/genética , Genoma Viral/genética , Humanos , Metiltransferases/genética , RNA Helicases/genética , Seleção Genética/genética , Proteínas não Estruturais Virais/genética , Proteínas Virais/genética , Proteínas Viroporinas/genética
8.
Molecules ; 28(4)2023 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-36838825

RESUMO

A growing body of evidence suggests that only a few amino acids ("hot-spots") at the interface contribute most of the binding energy in transient protein-protein interactions. However, experimental protocols to identify these hot-spots are highly labor-intensive and expensive. Computational methods, including evolutionary couplings, have been proposed to predict the hot-spots, but they generally fail to provide details of the interacting amino acids. Here we showed that unbiased evolutionary methods followed by biased molecular dynamic simulations could achieve this goal and reveal critical elements of protein complexes. We applied the methodology to selected G-protein coupled receptors (GPCRs), known for their therapeutic properties. We used the structure-prior-assisted direct coupling analysis (SP-DCA) to predict the binding interfaces of A2aR/D2R, CB1R/D2R, A2aR/CB1R, 5HT2AR/D2R, and 5-HT2AR/mGluR2 receptor heterodimers, which all agreed with published data. In order to highlight details of the interactions, we performed molecular dynamic (MD) simulations using the newly developed AWSEM energy model. We found that these receptors interact primarily through critical residues at the C and N terminal domains and the third intracellular loop (ICL3). The MD simulations showed that these residues are energetically necessary for dimerization and revealed their native conformational state. We subsequently applied the methodology to the 5-HT2AR/5-HTR4R heterodimer, given its implication in drug addiction and neurodegenerative pathologies such as Alzheimer's disease (AD). Further, the SP-DCA analysis showed that 5-HT2AR and 5-HTR4R heterodimerize through the C-terminal domain of 5-HT2AR and ICL3 of 5-HT4R. However, elucidating the details of GPCR interactions would accelerate the discovery of druggable sites and improve our knowledge of the etiology of common diseases, including AD.


Assuntos
Simulação de Dinâmica Molecular , Receptores Acoplados a Proteínas G , Receptores Acoplados a Proteínas G/metabolismo , Membrana Celular/metabolismo , Dimerização , Aminoácidos/metabolismo
9.
Mol Biol Evol ; 38(1): 318-328, 2021 01 04.
Artigo em Inglês | MEDLINE | ID: mdl-32770229

RESUMO

The recent technological advances underlying the screening of large combinatorial libraries in high-throughput mutational scans deepen our understanding of adaptive protein evolution and boost its applications in protein design. Nevertheless, the large number of possible genotypes requires suitable computational methods for data analysis, the prediction of mutational effects, and the generation of optimized sequences. We describe a computational method that, trained on sequencing samples from multiple rounds of a screening experiment, provides a model of the genotype-fitness relationship. We tested the method on five large-scale mutational scans, yielding accurate predictions of the mutational effects on fitness. The inferred fitness landscape is robust to experimental and sampling noise and exhibits high generalization power in terms of broader sequence space exploration and higher fitness variant predictions. We investigate the role of epistasis and show that the inferred model provides structural information about the 3D contacts in the molecular fold.


Assuntos
Evolução Molecular , Aptidão Genética , Epistasia Genética , Mutação , Aprendizado de Máquina não Supervisionado
10.
RNA ; 26(7): 794-802, 2020 07.
Artigo em Inglês | MEDLINE | ID: mdl-32276988

RESUMO

RNA molecules play many pivotal roles in a cell that are still not fully understood. Any detailed understanding of RNA function requires knowledge of its three-dimensional structure, yet experimental RNA structure resolution remains demanding. Recent advances in sequencing provide unprecedented amounts of sequence data that can be statistically analyzed by methods such as direct coupling analysis (DCA) to determine spatial proximity or contacts of specific nucleic acid pairs, which improve the quality of structure prediction. To quantify this structure prediction improvement, we here present a well curated data set of about 70 RNA structures of high resolution and compare different nucleotide-nucleotide contact prediction methods available in the literature. We observe only minor differences between the performances of the different methods. Moreover, we discuss how robust these predictions are for different contact definitions and how strongly they depend on procedures used to curate and align the families of homologous RNA sequences.


Assuntos
RNA/genética , Análise de Dados , Conjuntos de Dados como Assunto , Conformação de Ácido Nucleico , Alinhamento de Sequência/métodos
11.
RNA ; 26(11): 1530-1540, 2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-32747608

RESUMO

Chaperone proteins-the most disordered among all protein groups-help RNAs fold into their functional structure by destabilizing misfolded configurations or stabilizing the functional ones. But disentangling the mechanism underlying RNA chaperoning is challenging, mostly because of inherent disorder of the chaperones and the transient nature of their interactions with RNA. In particular, it is unclear how specific the interactions are and what role is played by amino acid charge and polarity patterns. Here, we address these questions in the RNA chaperone StpA. We adapted direct coupling analysis (DCA) into the αßDCA method that can treat in tandem sequences written in two alphabets, nucleotides and amino acids. With αßDCA, we could analyze StpA-RNA interactions and show consistency with a previously proposed two-pronged mechanism: StpA disrupts specific positions in the group I intron while globally and loosely binding to the entire structure. Moreover, the interactions are strongly associated with the charge pattern: Negatively charged regions in the destabilizing StpA amino-terminal affect a few specific positions in the RNA, located in stems and in the pseudoknot. In contrast, positive regions in the carboxy-terminal contain strongly coupled amino acids that promote nonspecific or weakly specific binding to the RNA. The present study opens new avenues to examine the functions of disordered proteins and to design disruptive proteins based on their charge patterns.


Assuntos
Proteínas de Ligação a DNA/química , Proteínas de Ligação a DNA/metabolismo , Proteínas de Escherichia coli/química , Proteínas de Escherichia coli/metabolismo , Escherichia coli/metabolismo , Chaperonas Moleculares/química , Chaperonas Moleculares/metabolismo , RNA/metabolismo , Algoritmos , Sequência de Aminoácidos , Sequência de Bases , Proteínas de Ligação a DNA/genética , Escherichia coli/química , Proteínas de Escherichia coli/genética , Íntrons , Modelos Moleculares , Chaperonas Moleculares/genética , Conformação de Ácido Nucleico , Ligação Proteica , RNA/química , Dobramento de RNA
12.
Proc Natl Acad Sci U S A ; 116(34): 16856-16865, 2019 08 20.
Artigo em Inglês | MEDLINE | ID: mdl-31399549

RESUMO

Direct coupling analysis (DCA) for protein folding has made very good progress, but it is not effective for proteins that lack many sequence homologs, even coupled with time-consuming conformation sampling with fragments. We show that we can accurately predict interresidue distance distribution of a protein by deep learning, even for proteins with ∼60 sequence homologs. Using only the geometric constraints given by the resulting distance matrix we may construct 3D models without involving extensive conformation sampling. Our method successfully folded 21 of the 37 CASP12 hard targets with a median family size of 58 effective sequence homologs within 4 h on a Linux computer of 20 central processing units. In contrast, DCA-predicted contacts cannot be used to fold any of these hard targets in the absence of extensive conformation sampling, and the best CASP12 group folded only 11 of them by integrating DCA-predicted contacts into fragment-based conformation sampling. Rigorous experimental validation in CASP13 shows that our distance-based folding server successfully folded 17 of 32 hard targets (with a median family size of 36 sequence homologs) and obtained 70% precision on the top L/5 long-range predicted contacts. The latest experimental validation in CAMEO shows that our server predicted correct folds for 2 membrane proteins while all of the other servers failed. These results demonstrate that it is now feasible to predict correct fold for many more proteins lack of similar structures in the Protein Data Bank even on a personal computer.


Assuntos
Aprendizado Profundo , Dobramento de Proteína , Algoritmos , Proteínas de Membrana/química , Proteínas de Membrana/metabolismo , Modelos Moleculares , Alinhamento de Sequência , Fatores de Tempo
13.
BMC Bioinformatics ; 22(1): 317, 2021 Jun 10.
Artigo em Inglês | MEDLINE | ID: mdl-34112081

RESUMO

BACKGROUND: To assign structural and functional annotations to the ever increasing amount of sequenced proteins, the main approach relies on sequence-based homology search methods, e.g. BLAST or the current state-of-the-art methods based on profile Hidden Markov Models, which rely on significant alignments of query sequences to annotated proteins or protein families. While powerful, these approaches do not take coevolution between residues into account. Taking advantage of recent advances in the field of contact prediction, we propose here to represent proteins by Potts models, which model direct couplings between positions in addition to positional composition, and to compare proteins by aligning these models. Due to non-local dependencies, the problem of aligning Potts models is hard and remains the main computational bottleneck for their use. METHODS: We introduce here an Integer Linear Programming formulation of the problem and PPalign, a program based on this formulation, to compute the optimal pairwise alignment of Potts models representing proteins in tractable time. The approach is assessed with respect to a non-redundant set of reference pairwise sequence alignments from SISYPHUS benchmark which have lowest sequence identity (between [Formula: see text] and [Formula: see text]) and enable to build reliable Potts models for each sequence to be aligned. This experimentation confirms that Potts models can be aligned in reasonable time ([Formula: see text] in average on these alignments). The contribution of couplings is evaluated in comparison with HHalign and independent-site PPalign. Although Potts models were not fully optimized for alignment purposes and simple gap scores were used, PPalign yields a better mean [Formula: see text] score and finds significantly better alignments than HHalign and PPalign without couplings in some cases. CONCLUSIONS: These results show that pairwise couplings from protein Potts models can be used to improve the alignment of remotely related protein sequences in tractable time. Our experimentation suggests yet that new research on the inference of Potts models is now needed to make them more comparable and suitable for homology search. We think that PPalign's guaranteed optimality will be a powerful asset to perform unbiased investigations in this direction.


Assuntos
Algoritmos , Proteínas , Sequência de Aminoácidos , Humanos , Proteínas/genética , Alinhamento de Sequência , Homologia de Sequência
14.
Mol Biol Evol ; 37(4): 1179-1192, 2020 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-31670785

RESUMO

Protein structure is tightly intertwined with function according to the laws of evolution. Understanding how structure determines function has been the aim of structural biology for decades. Here, we have wondered instead whether it is possible to exploit the function for which a protein was evolutionary selected to gain information on protein structure and on the landscape explored during the early stages of molecular and natural evolution. To answer to this question, we developed a new methodology, which we named CAMELS (Coupling Analysis by Molecular Evolution Library Sequencing), that is able to obtain the in vitro evolution of a protein from an artificial selection based on function. We were able to observe with CAMELS many features of the TEM-1 beta-lactamase local fold exclusively by generating and sequencing large libraries of mutational variants. We demonstrated that we can, whenever a functional phenotypic selection of a protein is available, sketch the structural and evolutionary landscape of a protein without utilizing purified proteins, collecting physical measurements, or relying on the pool of natural protein variants.


Assuntos
Evolução Molecular Direcionada/métodos , Relação Estrutura-Atividade , beta-Lactamases/genética , Dobramento de Proteína , Análise de Sequência de DNA
15.
Proc Natl Acad Sci U S A ; 115(47): 11911-11916, 2018 11 20.
Artigo em Inglês | MEDLINE | ID: mdl-30385633

RESUMO

Protein assemblies consisting of structural maintenance of chromosomes (SMC) and kleisin subunits are essential for the process of chromosome segregation across all domains of life. Prokaryotic condensin belonging to this class of protein complexes is composed of a homodimer of SMC that associates with a kleisin protein subunit called ScpA. While limited structural data exist for the proteins that comprise the (SMC)-kleisin complex, the complete structure of the entire complex remains unknown. Using an integrative approach combining both crystallographic data and coevolutionary information, we predict an atomic-scale structure of the whole condensin complex, which our results indicate being composed of a single ring. Coupling coevolutionary information with molecular-dynamics simulations, we study the interaction surfaces between the subunits and examine the plausibility of alternative stoichiometries of the complex. Our analysis also reveals several additional configurational states of the condensin hinge domain and the SMC-kleisin interaction domains, which are likely involved with the functional opening and closing of the condensin ring. This study provides the foundation for future investigations of the structure-function relationship of the various SMC-kleisin protein complexes at atomic resolution.


Assuntos
Adenosina Trifosfatases/fisiologia , Adenosina Trifosfatases/ultraestrutura , Proteínas de Ligação a DNA/fisiologia , Proteínas de Ligação a DNA/ultraestrutura , Complexos Multiproteicos/fisiologia , Complexos Multiproteicos/ultraestrutura , Adenosina Trifosfatases/metabolismo , Sequência de Aminoácidos , Proteínas de Bactérias/metabolismo , Proteínas de Bactérias/fisiologia , Proteínas de Ciclo Celular/metabolismo , Proteínas de Ciclo Celular/fisiologia , Proteínas Cromossômicas não Histona/metabolismo , Segregação de Cromossomos/fisiologia , Cromossomos/metabolismo , Proteínas de Ligação a DNA/metabolismo , Bases de Dados de Proteínas , Complexos Multiproteicos/metabolismo , Proteínas Nucleares/metabolismo , Domínios Proteicos , Relação Estrutura-Atividade
16.
Int J Mol Sci ; 22(20)2021 Oct 09.
Artigo em Inglês | MEDLINE | ID: mdl-34681569

RESUMO

We present Annealed Mutational approximated Landscape (AMaLa), a new method to infer fitness landscapes from Directed Evolution experiments sequencing data. Such experiments typically start from a single wild-type sequence, which undergoes Darwinian in vitro evolution via multiple rounds of mutation and selection for a target phenotype. In the last years, Directed Evolution is emerging as a powerful instrument to probe fitness landscapes under controlled experimental conditions and as a relevant testing ground to develop accurate statistical models and inference algorithms (thanks to high-throughput screening and sequencing). Fitness landscape modeling either uses the enrichment of variants abundances as input, thus requiring the observation of the same variants at different rounds or assuming the last sequenced round as being sampled from an equilibrium distribution. AMaLa aims at effectively leveraging the information encoded in the whole time evolution. To do so, while assuming statistical sampling independence between sequenced rounds, the possible trajectories in sequence space are gauged with a time-dependent statistical weight consisting of two contributions: (i) an energy term accounting for the selection process and (ii) a generalized Jukes-Cantor model for the purely mutational step. This simple scheme enables accurately describing the Directed Evolution dynamics and inferring a fitness landscape that correctly reproduces the measures of the phenotype under selection (e.g., antibiotic drug resistance), notably outperforming widely used inference strategies. In addition, we assess the reliability of AMaLa by showing how the inferred statistical model could be used to predict relevant structural properties of the wild-type sequence.


Assuntos
Biologia Computacional/métodos , Evolução Molecular Direcionada/métodos , Mutação , Algoritmos , Evolução Molecular , Aptidão Genética , Sequenciamento de Nucleotídeos em Larga Escala , Modelos Genéticos , Análise de Sequência de DNA
17.
Methods ; 162-163: 68-73, 2019 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-31028927

RESUMO

Structured RNA plays many functionally relevant roles in molecular life. Structural information, while required to understand the functional cycles in detail, is challenging to gather. Computational methods promise to complement experimental efforts by predicting three-dimensional RNA models. Here, we provide a concise view of the state of the art methodologies with a focus on the strengths and the weaknesses of the different approaches. Furthermore, we analyzed the recent developments regarding the use of coevolutionary information and how it can boost the prediction performances. We finally discuss some open perspectives and challenges for the near future in the RNA structural stability field.


Assuntos
Biologia Computacional/métodos , Modelos Moleculares , Conformação de Ácido Nucleico , RNA/química , Análise de Sequência de RNA/métodos , RNA/genética , Estabilidade de RNA/genética , Software
18.
Proc Natl Acad Sci U S A ; 114(13): E2662-E2671, 2017 03 28.
Artigo em Inglês | MEDLINE | ID: mdl-28289198

RESUMO

Proteins have evolved to perform diverse cellular functions, from serving as reaction catalysts to coordinating cellular propagation and development. Frequently, proteins do not exert their full potential as monomers but rather undergo concerted interactions as either homo-oligomers or with other proteins as hetero-oligomers. The experimental study of such protein complexes and interactions has been arduous. Theoretical structure prediction methods are an attractive alternative. Here, we investigate homo-oligomeric interfaces by tracing residue coevolution via the global statistical direct coupling analysis (DCA). DCA can accurately infer spatial adjacencies between residues. These adjacencies can be included as constraints in structure prediction techniques to predict high-resolution models. By taking advantage of the ongoing exponential growth of sequence databases, we go significantly beyond anecdotal cases of a few protein families and apply DCA to a systematic large-scale study of nearly 2,000 Pfam protein families with sufficient sequence information and structurally resolved homo-oligomeric interfaces. We find that large interfaces are commonly identified by DCA. We further demonstrate that DCA can differentiate between subfamilies with different binding modes within one large Pfam family. Sequence-derived contact information for the subfamilies proves sufficient to assemble accurate structural models of the diverse protein-oligomers. Thus, we provide an approach to investigate oligomerization for arbitrary protein families leading to structural models complementary to often-difficult experimental methods. Combined with ever more abundant sequential data, we anticipate that this study will be instrumental to allow the structural description of many heteroprotein complexes in the future.


Assuntos
Evolução Molecular , Proteínas/química , Bases de Dados de Proteínas , Modelos Moleculares , Biologia Molecular/métodos , Conformação Proteica , Domínios e Motivos de Interação entre Proteínas , Proteínas/metabolismo
19.
Entropy (Basel) ; 21(11): 1127, 2020 Jan 23.
Artigo em Inglês | MEDLINE | ID: mdl-32002010

RESUMO

Extracting structural information from sequence co-variation has become a common computational biology practice in the recent years, mainly due to the availability of large sequence alignments of protein families. However, identifying features that are specific to sub-classes and not shared by all members of the family using sequence-based approaches has remained an elusive problem. We here present a coevolutionary-based method to differentially analyze subfamily specific structural features by a continuous sequence reweighting (SR) approach. We introduce the underlying principles and test its predictive capabilities on the Response Regulator family, whose subfamilies have been previously shown to display distinct, specific homo-dimerization patterns. Our results show that this reweighting scheme is effective in assigning structural features known a priori to subfamilies, even when sequence data is relatively scarce. Furthermore, sequence reweighting allows assessing if individual structural contacts pertain to specific subfamilies and it thus paves the way for the identification specificity-determining contacts from sequence variation data.

20.
BMC Bioinformatics ; 20(Suppl 2): 100, 2019 Mar 14.
Artigo em Inglês | MEDLINE | ID: mdl-30871477

RESUMO

BACKGROUND: The ability to predict which pairs of amino acid residues in a protein are in contact with each other offers many advantages for various areas of research that focus on proteins. For example, contact prediction can be used to reduce the computational complexity of predicting the structure of proteins and even to help identify functionally important regions of proteins. These predictions are becoming especially important given the relatively low number of experimentally determined protein structures compared to the amount of available protein sequence data. RESULTS: Here we have developed and benchmarked a set of machine learning methods for performing residue-residue contact prediction, including random forests, direct-coupling analysis, support vector machines, and deep networks (stacked denoising autoencoders). These methods are able to predict contacting residue pairs given only the amino acid sequence of a protein. According to our own evaluations performed at a resolution of +/- two residues, the predictors we trained with the random forest algorithm were our top performing methods with average top 10 prediction accuracy scores of 85.13% (short range), 74.49% (medium range), and 54.49% (long range). Our ensemble models (stacked denoising autoencoders combined with support vector machines) were our best performing deep network predictors and achieved top 10 prediction accuracy scores of 75.51% (short range), 60.26% (medium range), and 43.85% (long range) using the same evaluation. These tests were blindly performed on targets from the CASP11 dataset; and the results suggested that our models achieved comparable performance to contact predictors developed by groups that participated in CASP11. CONCLUSIONS: Due to the challenging nature of contact prediction, it is beneficial to develop and benchmark a variety of different prediction methods. Our work has produced useful tools with a simple interface that can provide contact predictions to users without requiring a lengthy installation process. In addition to this, we have released our C++ implementation of the direct-coupling analysis method as a standalone software package. Both this tool and our RFcon web server are freely available to the public at http://dna.cs.miami.edu/RFcon /.


Assuntos
Biologia Computacional/métodos , Aprendizado de Máquina/normas , Proteínas/metabolismo , Sequência de Aminoácidos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA