Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 40
Filtrar
1.
PLoS Biol ; 22(5): e3002594, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38754362

RESUMEN

The standard genetic code defines the rules of translation for nearly every life form on Earth. It also determines the amino acid changes accessible via single-nucleotide mutations, thus influencing protein evolvability-the ability of mutation to bring forth adaptive variation in protein function. One of the most striking features of the standard genetic code is its robustness to mutation, yet it remains an open question whether such robustness facilitates or frustrates protein evolvability. To answer this question, we use data from massively parallel sequence-to-function assays to construct and analyze 6 empirical adaptive landscapes under hundreds of thousands of rewired genetic codes, including those of codon compression schemes relevant to protein engineering and synthetic biology. We find that robust genetic codes tend to enhance protein evolvability by rendering smooth adaptive landscapes with few peaks, which are readily accessible from throughout sequence space. However, the standard genetic code is rarely exceptional in this regard, because many alternative codes render smoother landscapes than the standard code. By constructing low-dimensional visualizations of these landscapes, which each comprise more than 16 million mRNA sequences, we show that such alternative codes radically alter the topological features of the network of high-fitness genotypes. Whereas the genetic codes that optimize evolvability depend to some extent on the detailed relationship between amino acid sequence and protein function, we also uncover general design principles for engineering nonstandard genetic codes for enhanced and diminished evolvability, which may facilitate directed protein evolution experiments and the bio-containment of synthetic organisms, respectively.


Asunto(s)
Evolución Molecular , Código Genético , Proteínas , Proteínas/genética , Proteínas/metabolismo , Mutación/genética , Codón/genética , Modelos Genéticos , Biología Sintética/métodos , Biosíntesis de Proteínas , Ingeniería de Proteínas/métodos
2.
Proc Natl Acad Sci U S A ; 119(7)2022 02 15.
Artículo en Inglés | MEDLINE | ID: mdl-35145034

RESUMEN

Evolutionary adaptation often occurs by the fixation of beneficial mutations. This mode of adaptation can be characterized quantitatively by a spectrum of adaptive substitutions, i.e., a distribution for types of changes fixed in adaptation. Recent work establishes that the changes involved in adaptation reflect common types of mutations, raising the question of how strongly the mutation spectrum shapes the spectrum of adaptive substitutions. We address this question with a codon-based model for the spectrum of adaptive amino acid substitutions, applied to three large datasets covering thousands of amino acid changes identified in natural and experimental adaptation in Saccharomyces cerevisiae, Escherichia coli, and Mycobacterium tuberculosis Using species-specific mutation spectra based on prior knowledge, we find that the mutation spectrum has a proportional influence on the spectrum of adaptive substitutions in all three species. Indeed, we find that by inferring the mutation rates that best explain the spectrum of adaptive substitutions, we can accurately recover the species-specific mutation spectra. However, we also find that the predictive power of the model differs substantially between the three species. To better understand these differences, we use population simulations to explore the factors that influence how closely the spectrum of adaptive substitutions mirrors the mutation spectrum. The results show that the influence of the mutation spectrum decreases with increasing mutational supply ([Formula: see text]) and that predictive power is strongly affected by the number and diversity of beneficial mutations.


Asunto(s)
Adaptación Fisiológica , Escherichia coli/genética , Mycobacterium tuberculosis/genética , Saccharomyces cerevisiae/genética , Proteínas Bacterianas/genética , Proteínas Bacterianas/metabolismo , Escherichia coli/fisiología , Proteínas Fúngicas/genética , Proteínas Fúngicas/metabolismo , Regulación Bacteriana de la Expresión Génica , Regulación Fúngica de la Expresión Génica , Mutación , Mycobacterium tuberculosis/fisiología , Saccharomyces cerevisiae/fisiología , Especificidad de la Especie
3.
Proc Natl Acad Sci U S A ; 119(39): e2204233119, 2022 09 27.
Artículo en Inglés | MEDLINE | ID: mdl-36129941

RESUMEN

Contemporary high-throughput mutagenesis experiments are providing an increasingly detailed view of the complex patterns of genetic interaction that occur between multiple mutations within a single protein or regulatory element. By simultaneously measuring the effects of thousands of combinations of mutations, these experiments have revealed that the genotype-phenotype relationship typically reflects not only genetic interactions between pairs of sites but also higher-order interactions among larger numbers of sites. However, modeling and understanding these higher-order interactions remains challenging. Here we present a method for reconstructing sequence-to-function mappings from partially observed data that can accommodate all orders of genetic interaction. The main idea is to make predictions for unobserved genotypes that match the type and extent of epistasis found in the observed data. This information on the type and extent of epistasis can be extracted by considering how phenotypic correlations change as a function of mutational distance, which is equivalent to estimating the fraction of phenotypic variance due to each order of genetic interaction (additive, pairwise, three-way, etc.). Using these estimated variance components, we then define an empirical Bayes prior that in expectation matches the observed pattern of epistasis and reconstruct the genotype-phenotype mapping by conducting Gaussian process regression under this prior. To demonstrate the power of this approach, we present an application to the antibody-binding domain GB1 and also provide a detailed exploration of a dataset consisting of high-throughput measurements for the splicing efficiency of human pre-mRNA [Formula: see text] splice sites, for which we also validate our model predictions via additional low-throughput experiments.


Asunto(s)
Epistasis Genética , Precursores del ARN , Teorema de Bayes , Mapeo Cromosómico , Biología Computacional , Genotipo , Humanos , Modelos Genéticos , Mutación , Fenotipo , Empalme del ARN
4.
Proc Natl Acad Sci U S A ; 118(40)2021 10 05.
Artículo en Inglés | MEDLINE | ID: mdl-34599093

RESUMEN

Density estimation in sequence space is a fundamental problem in machine learning that is also of great importance in computational biology. Due to the discrete nature and large dimensionality of sequence space, how best to estimate such probability distributions from a sample of observed sequences remains unclear. One common strategy for addressing this problem is to estimate the probability distribution using maximum entropy (i.e., calculating point estimates for some set of correlations based on the observed sequences and predicting the probability distribution that is as uniform as possible while still matching these point estimates). Building on recent advances in Bayesian field-theoretic density estimation, we present a generalization of this maximum entropy approach that provides greater expressivity in regions of sequence space where data are plentiful while still maintaining a conservative maximum entropy character in regions of sequence space where data are sparse or absent. In particular, we define a family of priors for probability distributions over sequence space with a single hyperparameter that controls the expected magnitude of higher-order correlations. This family of priors then results in a corresponding one-dimensional family of maximum a posteriori estimates that interpolate smoothly between the maximum entropy estimate and the observed sample frequencies. To demonstrate the power of this method, we use it to explore the high-dimensional geometry of the distribution of 5' splice sites found in the human genome and to understand patterns of chromosomal abnormalities across human cancers.


Asunto(s)
Aneuploidia , Biología Computacional/métodos , Modelos Teóricos , Neoplasias/genética , Sitios de Empalme de ARN , Humanos , Probabilidad
5.
Am Nat ; 202(4): 534-557, 2023 10.
Artículo en Inglés | MEDLINE | ID: mdl-37792926

RESUMEN

AbstractThe joint distribution of selection coefficients and mutation rates is a key determinant of the genetic architecture of molecular adaptation. Three different distributions are of immediate interest: (1) the "nominal" distribution of possible changes, prior to mutation or selection; (2) the "de novo" distribution of realized mutations; and (3) the "fixed" distribution of selectively established mutations. Here, we formally characterize the relationships between these joint distributions under the strong-selection/weak-mutation (SSWM) regime. The de novo distribution is enriched relative to the nominal distribution for the highest rate mutations, and the fixed distribution is further enriched for the most highly beneficial mutations. Whereas mutation rates and selection coefficients are often assumed to be uncorrelated, we show that even with no correlation in the nominal distribution, the resulting de novo and fixed distributions can have correlations with any combination of signs. Nonetheless, we suggest that natural systems with a finite number of beneficial mutations will frequently have the kind of nominal distribution that induces negative correlations in the fixed distribution. We apply our mathematical framework, along with population simulations, to explore joint distributions of selection coefficients and mutation rates from deep mutational scanning and cancer informatics. Finally, we consider the evolutionary implications of these joint distributions together with two additional joint distributions relevant to parallelism and the rate of adaptation.


Asunto(s)
Tasa de Mutación , Selección Genética , Modelos Genéticos , Mutación , Evolución Biológica , Evolución Molecular
6.
Annu Rev Genomics Hum Genet ; 20: 99-127, 2019 08 31.
Artículo en Inglés | MEDLINE | ID: mdl-31091417

RESUMEN

Over the last decade, a rich variety of massively parallel assays have revolutionized our understanding of how biological sequences encode quantitative molecular phenotypes. These assays include deep mutational scanning, high-throughput SELEX, and massively parallel reporter assays. Here, we review these experimental methods and how the data they produce can be used to quantitatively model sequence-function relationships. In doing so, we touch on a diverse range of topics, including the identification of clinically relevant genomic variants, the modeling of transcription factor binding to DNA, the functional and evolutionary landscapes of proteins, and cis-regulatory mechanisms in both transcription and mRNA splicing. We further describe a unified conceptual framework and a core set of mathematical modeling strategies that studies in these diverse areas can make use of. Finally, we highlight key aspects of experimental design and mathematical modeling that are important for the results of such studies to be interpretable and reproducible.


Asunto(s)
Epistasis Genética , Estudios de Asociación Genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Modelos Genéticos , Técnica SELEX de Producción de Aptámeros/métodos , ADN/genética , ADN/metabolismo , Genotipo , Humanos , Mutación , Fenotipo , Unión Proteica , Empalme del ARN , Factores de Transcripción/genética , Factores de Transcripción/metabolismo , Transcripción Genética
7.
Proc Natl Acad Sci U S A ; 115(32): E7550-E7558, 2018 08 07.
Artículo en Inglés | MEDLINE | ID: mdl-30037990

RESUMEN

Genotype-phenotype relationships are notoriously complicated. Idiosyncratic interactions between specific combinations of mutations occur and are difficult to predict. Yet it is increasingly clear that many interactions can be understood in terms of global epistasis. That is, mutations may act additively on some underlying, unobserved trait, and this trait is then transformed via a nonlinear function to the observed phenotype as a result of subsequent biophysical and cellular processes. Here we infer the shape of such global epistasis in three proteins, based on published high-throughput mutagenesis data. To do so, we develop a maximum-likelihood inference procedure using a flexible family of monotonic nonlinear functions spanned by an I-spline basis. Our analysis uncovers dramatic nonlinearities in all three proteins; in some proteins a model with global epistasis accounts for virtually all of the measured variation, whereas in others we find substantial local epistasis as well. This method allows us to test hypotheses about the form of global epistasis and to distinguish variance components attributable to global epistasis, local epistasis, and measurement error.


Asunto(s)
Epistasis Genética , Evolución Molecular , Aptitud Genética , Modelos Genéticos , Genotipo , Modelos Estadísticos , Mutación , Dinámicas no Lineales , Fenotipo
8.
Mol Biol Evol ; 34(9): 2163-2172, 2017 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-28645195

RESUMEN

While mutational biases strongly influence neutral molecular evolution, the role of mutational biases in shaping the course of adaptation is less clear. Here we consider the frequency of transitions relative to transversions among adaptive substitutions. Because mutation rates for transitions are higher than those for transversions, if mutational biases influence the dynamics of adaptation, then transitions should be overrepresented among documented adaptive substitutions. To test this hypothesis, we assembled two sets of data on putatively adaptive amino acid replacements that have occurred in parallel during evolution, either in nature or in the laboratory. We find that the frequency of transitions in these data sets is much higher than would be predicted under a null model where mutation has no effect. Our results are qualitatively similar even if we restrict ourself to changes that have occurred, not merely twice, but three or more times. These results suggest that the course of adaptation is biased by mutation.


Asunto(s)
Adaptación Fisiológica/genética , Sesgo , Evolución Biológica , Evolución Molecular , Modelos Genéticos , Mutación/genética , Tasa de Mutación , Filogenia , Mutación Puntual/genética , Homología de Secuencia de Aminoácido
9.
Heredity (Edinb) ; 121(5): 449-465, 2018 11.
Artículo en Inglés | MEDLINE | ID: mdl-30232363

RESUMEN

Understanding evolution on complex fitness landscapes is difficult both because of the large dimensionality of sequence space and the stochasticity inherent to population-genetic processes. Here, I present an integrated suite of mathematical tools for understanding evolution on time-invariant fitness landscapes when mutations occur sufficiently rarely that the population is typically monomorphic and evolution can be modeled as a sequence of well-separated fixation events. The basic intuition behind this suite of tools is that surrounding any particular genotype lies a region of the fitness landscape that is easy to evolve to, while other pieces of the fitness landscape are difficult to evolve to (due to distance, being across a fitness valley, etc.). I propose a rigorous definition for this "dynamical neighborhood" of a genotype which captures several aspects of the distribution of waiting times to evolve from one genotype to another. The neighborhood structure of the landscape as a whole can be summarized as a matrix, and I show how this matrix can be used to approximate the expected waiting time for certain evolutionary events to occur and to provide an intuitive interpretation to existing formal results on the index of dispersion of the molecular clock.


Asunto(s)
Adaptación Fisiológica/genética , Aptitud Genética , Mutación , Genotipo
10.
Proc Natl Acad Sci U S A ; 112(25): E3226-35, 2015 Jun 23.
Artículo en Inglés | MEDLINE | ID: mdl-26056312

RESUMEN

The phenotypic effect of an allele at one genetic site may depend on alleles at other sites, a phenomenon known as epistasis. Epistasis can profoundly influence the process of evolution in populations and shape the patterns of protein divergence across species. Whereas epistasis between adaptive substitutions has been studied extensively, relatively little is known about epistasis under purifying selection. Here we use computational models of thermodynamic stability in a ligand-binding protein to explore the structure of epistasis in simulations of protein sequence evolution. Even though the predicted effects on stability of random mutations are almost completely additive, the mutations that fix under purifying selection are enriched for epistasis. In particular, the mutations that fix are contingent on previous substitutions: Although nearly neutral at their time of fixation, these mutations would be deleterious in the absence of preceding substitutions. Conversely, substitutions under purifying selection are subsequently entrenched by epistasis with later substitutions: They become increasingly deleterious to revert over time. Our results imply that, even under purifying selection, protein sequence evolution is often contingent on history and so it cannot be predicted by the phenotypic effects of mutations assayed in the ancestral background.


Asunto(s)
Evolución Molecular , Proteínas/genética , Epistasis Genética , Modelos Teóricos , Mutación , Estabilidad Proteica , Termodinámica
11.
Theor Popul Biol ; 115: 69-80, 2017 06.
Artículo en Inglés | MEDLINE | ID: mdl-28476403

RESUMEN

Matrix projection models are a central tool in many areas of population biology. In most applications, one starts from the projection matrix to quantify the asymptotic growth rate of the population (the dominant eigenvalue), the stable stage distribution, and the reproductive values (the dominant right and left eigenvectors, respectively). Any primitive projection matrix also has an associated ergodic Markov chain that contains information about the genealogy of the population. In this paper, we show that these facts can be used to specify any matrix population model as a triple consisting of the ergodic Markov matrix, the dominant eigenvalue and one of the corresponding eigenvectors. This decomposition of the projection matrix separates properties associated with lineages from those associated with individuals. It also clarifies the relationships between many quantities commonly used to describe such models, including the relationship between eigenvalue sensitivities and elasticities. We illustrate the utility of such a decomposition by introducing a new method for aggregating classes in a matrix population model to produce a simpler model with a smaller number of classes. Unlike the standard method, our method has the advantage of preserving reproductive values and elasticities. It also has conceptually satisfying properties such as commuting with changes of units.


Asunto(s)
Genealogía y Heráldica , Modelos Teóricos , Dinámica Poblacional , Humanos , Cadenas de Markov , Grupos de Población , Reproducción
12.
Theor Popul Biol ; 112: 14-21, 2016 12.
Artículo en Inglés | MEDLINE | ID: mdl-27497738

RESUMEN

Underdominant mutations have fixed between divergent species, yet classical models suggest that rare underdominant alleles are purged quickly except in small or subdivided populations. We predict that underdominant alleles that also influence mate choice, such as those affecting coloration patterns visible to mates and predators alike, can fix more readily. We analyze a mechanistic model of positive assortative mating in which individuals have n chances to sample compatible mates. This one-parameter model naturally spans random mating (n=1) and complete assortment (n→∞), yet it produces sexual selection whose strength depends non-monotonically on n. This sexual selection interacts with viability selection to either inhibit or facilitate fixation. As mating opportunities increase, underdominant alleles fix as frequently as neutral mutations, even though sexual selection and underdominance independently each suppress rare alleles. This mechanism allows underdominant alleles to fix in large populations and illustrates how life history can affect evolutionary change.


Asunto(s)
Evolución Biológica , Preferencia en el Apareamiento Animal , Mutación/genética , Reproducción/genética , Alelos , Animales , Humanos , Probabilidad
14.
Theor Popul Biol ; 99: 98-113, 2015 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-25450112

RESUMEN

The formula for the probability of fixation of a new mutation is widely used in theoretical population genetics and molecular evolution. Here we derive a series of identities, inequalities and approximations for the exact probability of fixation of a new mutation under the Moran process (equivalent results hold for the approximate probability of fixation under the Wright-Fisher process, after an appropriate change of variables). We show that the logarithm of the fixation probability has particularly simple behavior when the selection coefficient is measured as a difference of Malthusian fitnesses, and we exploit this simplicity to derive inequalities and approximations. We also present a comprehensive comparison of both existing and new approximations for the fixation probability, highlighting those approximations that induce a reversible Markov chain when used to describe the dynamics of evolution under weak mutation. To demonstrate the power of these results, we consider the classical problem of determining the total substitution rate across an ensemble of biallelic loci and prove that, at equilibrium, a strict majority of substitutions are due to drift rather than selection.


Asunto(s)
Evolución Molecular , Modelos Genéticos , Mutación/genética , Alelos , Genética de Población/métodos , Cadenas de Markov , Modelos Estadísticos , Probabilidad , Selección Genética
15.
Nature ; 497(7451): E1-2; discussion E2-3, 2013 May 30.
Artículo en Inglés | MEDLINE | ID: mdl-23719465
16.
Proc Natl Acad Sci U S A ; 113(12): 3136-8, 2016 Mar 22.
Artículo en Inglés | MEDLINE | ID: mdl-26966235

Asunto(s)
Flujo Genético , Humanos
17.
bioRxiv ; 2024 Jun 24.
Artículo en Inglés | MEDLINE | ID: mdl-38798625

RESUMEN

Quantitative models that describe how biological sequences encode functional activities are ubiquitous in modern biology. One important aspect of these models is that they commonly exhibit gauge freedoms, i.e., directions in parameter space that do not affect model predictions. In physics, gauge freedoms arise when physical theories are formulated in ways that respect fundamental symmetries. However, the connections that gauge freedoms in models of sequence-function relationships have to the symmetries of sequence space have yet to be systematically studied. Here we study the gauge freedoms of models that respect a specific symmetry of sequence space: the group of position-specific character permutations. We find that gauge freedoms arise when model parameters transform under redundant irreducible matrix representations of this group. Based on this finding, we describe an "embedding distillation" procedure that enables analytic calculation of the number of independent gauge freedoms, as well as efficient computation of a sparse basis for the space of gauge freedoms. We also study how parameter transformation behavior affects parameter interpretability. We find that in many (and possibly all) nontrivial models, the ability to interpret individual model parameters as quantifying intrinsic allelic effects requires that gauge freedoms be present. This finding establishes an incompatibility between two distinct notions of parameter interpretability. Our work thus advances the understanding of symmetries, gauge freedoms, and parameter interpretability in sequence-function relationships. Significance Statement: Gauge freedoms-diections in parameter space that do not affect model predictions-are ubiquitous in mathematical models of biological sequence-function relationships. But in contrast to theoretical physics, where gauge freedoms play a central role, little is understood about the mathematical properties of gauge freedoms in models of sequence-function relationships. Here we identify a connection between specific symmetries of sequence space and the gauge freedoms present in a large class of commonly used models for sequence-function relationships. We show that this connection can be used to perform useful mathematical computations, and we discuss the impact of model transformation properties on parameter interpretability. The results fill a major gap in the understanding of quantitative sequence-function relationships.

18.
ArXiv ; 2024 Apr 17.
Artículo en Inglés | MEDLINE | ID: mdl-38699164

RESUMEN

Biological sequences do not come at random. Instead, they appear with particular frequencies that reflect properties of the associated system or phenomenon. Knowing how biological sequences are distributed in sequence space is thus a natural first step toward understanding the underlying mechanisms. Here we propose a new method for inferring the probability distribution from which a sample of biological sequences were drawn for the case where the sequences are composed of elements that admit a natural ordering. Our method is based on Bayesian field theory, a physics-based machine learning approach, and can be regarded as a nonparametric extension of the traditional maximum entropy estimate. As an example, we use it to analyze the aneuploidy data pertaining to gliomas from The Cancer Genome Atlas project. In addition, we demonstrate two follow-up analyses that can be performed with the resulting probability distribution. One of them is to investigate the associations among the sequence sites. This provides us a way to infer the governing biological grammar. The other is to study the global geometry of the probability landscape, which allows us to look at the problem from an evolutionary point of view. It can be seen that this methodology enables us to learn from a sample of sequences about how a biological system or phenomenon in the real world works.

19.
bioRxiv ; 2024 Mar 02.
Artículo en Inglés | MEDLINE | ID: mdl-38013993

RESUMEN

Deep neural networks (DNNs) have greatly advanced the ability to predict genome function from sequence. Interpreting genomic DNNs in terms of biological mechanisms, however, remains difficult. Here we introduce SQUID, a genomic DNN interpretability framework based on surrogate modeling. SQUID approximates genomic DNNs in user-specified regions of sequence space using surrogate models, i.e., simpler models that are mechanistically interpretable. Importantly, SQUID removes the confounding effects that nonlinearities and heteroscedastic noise in functional genomics data can have on model interpretation. Benchmarking analysis on multiple genomic DNNs shows that SQUID, when compared to established interpretability methods, identifies motifs that are more consistent across genomic loci and yields improved single-nucleotide variant-effect predictions. SQUID also supports surrogate models that quantify epistatic interactions within and between cis-regulatory elements. SQUID thus advances the ability to mechanistically interpret genomic DNNs.

20.
bioRxiv ; 2024 Jun 24.
Artículo en Inglés | MEDLINE | ID: mdl-38798671

RESUMEN

Quantitative models of sequence-function relationships are ubiquitous in computational biology, e.g., for modeling the DNA binding of transcription factors or the fitness landscapes of proteins. Interpreting these models, however, is complicated by the fact that the values of model parameters can often be changed without affecting model predictions. Before the values of model parameters can be meaningfully interpreted, one must remove these degrees of freedom (called "gauge freedoms" in physics) by imposing additional constraints (a process called "fixing the gauge"). However, strategies for fixing the gauge of sequence-function relationships have received little attention. Here we derive an analytically tractable family of gauges for a large class of sequence-function relationships. These gauges are derived in the context of models with all-order interactions, but an important subset of these gauges can be applied to diverse types of models, including additive models, pairwise-interaction models, and models with higher-order interactions. Many commonly used gauges are special cases of gauges within this family. We demonstrate the utility of this family of gauges by showing how different choices of gauge can be used both to explore complex activity landscapes and to reveal simplified models that are approximately correct within localized regions of sequence space. The results provide practical gauge-fixing strategies and demonstrate the utility of gauge-fixing for model exploration and interpretation. Significance Statement: Computational biology relies heavily on mathematical models that predict biological activities from DNA, RNA, or protein sequences. Interpreting the parameters of these models, however, remains difficult. Here we address a core challenge for model interpretation-the presence of 'gauge freedoms', i.e., ways of changing model parameters without affecting model predictions. The results unify commonly used methods for eliminating gauge freedoms and show how these methods can be used to simplify complex models in localized regions of sequence space. This work thus overcomes a major obstacle in the interpretation of quantitative sequence-function relationships.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA