Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 36
Filter
1.
Mol Cell ; 71(6): 1012-1026.e3, 2018 09 20.
Article in English | MEDLINE | ID: mdl-30174293

ABSTRACT

Pre-mRNA splicing is an essential step in the expression of most human genes. Mutations at the 5' splice site (5'ss) frequently cause defective splicing and disease due to interference with the initial recognition of the exon-intron boundary by U1 small nuclear ribonucleoprotein (snRNP), a component of the spliceosome. Here, we use a massively parallel splicing assay (MPSA) in human cells to quantify the activity of all 32,768 unique 5'ss sequences (NNN/GYNNNN) in three different gene contexts. Our results reveal that although splicing efficiency is mostly governed by the 5'ss sequence, there are substantial differences in this efficiency across gene contexts. Among other uses, these MPSA measurements facilitate the prediction of 5'ss sequence variants that are likely to cause aberrant splicing. This approach provides a framework to assess potential pathogenic variants in the human genome and streamline the development of splicing-corrective therapies.


Subject(s)
Alternative Splicing/genetics , RNA Splice Sites/genetics , RNA Splice Sites/physiology , Alternative Splicing/physiology , Carrier Proteins/genetics , Conserved Sequence/genetics , Exons , Genes, BRCA2 , HeLa Cells , Humans , Introns , Mutation , RNA Splicing/genetics , RNA Splicing/physiology , RNA, Small Nuclear/physiology , Ribonucleoprotein, U1 Small Nuclear/physiology , Spliceosomes , Survival of Motor Neuron 1 Protein/genetics , Transcriptional Elongation Factors
2.
Proc Natl Acad Sci U S A ; 119(39): e2204233119, 2022 09 27.
Article in English | MEDLINE | ID: mdl-36129941

ABSTRACT

Contemporary high-throughput mutagenesis experiments are providing an increasingly detailed view of the complex patterns of genetic interaction that occur between multiple mutations within a single protein or regulatory element. By simultaneously measuring the effects of thousands of combinations of mutations, these experiments have revealed that the genotype-phenotype relationship typically reflects not only genetic interactions between pairs of sites but also higher-order interactions among larger numbers of sites. However, modeling and understanding these higher-order interactions remains challenging. Here we present a method for reconstructing sequence-to-function mappings from partially observed data that can accommodate all orders of genetic interaction. The main idea is to make predictions for unobserved genotypes that match the type and extent of epistasis found in the observed data. This information on the type and extent of epistasis can be extracted by considering how phenotypic correlations change as a function of mutational distance, which is equivalent to estimating the fraction of phenotypic variance due to each order of genetic interaction (additive, pairwise, three-way, etc.). Using these estimated variance components, we then define an empirical Bayes prior that in expectation matches the observed pattern of epistasis and reconstruct the genotype-phenotype mapping by conducting Gaussian process regression under this prior. To demonstrate the power of this approach, we present an application to the antibody-binding domain GB1 and also provide a detailed exploration of a dataset consisting of high-throughput measurements for the splicing efficiency of human pre-mRNA [Formula: see text] splice sites, for which we also validate our model predictions via additional low-throughput experiments.


Subject(s)
Epistasis, Genetic , RNA Precursors , Bayes Theorem , Chromosome Mapping , Computational Biology , Genotype , Humans , Models, Genetic , Mutation , Phenotype , RNA Splicing
3.
Proc Natl Acad Sci U S A ; 119(23): e2201301119, 2022 06 07.
Article in English | MEDLINE | ID: mdl-35653571

ABSTRACT

In σ-dependent transcriptional pausing, the transcription initiation factor σ, translocating with RNA polymerase (RNAP), makes sequence-specific protein­DNA interactions with a promoter-like sequence element in the transcribed region, inducing pausing. It has been proposed that, in σ-dependent pausing, the RNAP active center can access off-pathway "backtracked" states that are substrates for the transcript-cleavage factors of the Gre family and on-pathway "scrunched" states that mediate pause escape. Here, using site-specific protein­DNA photocrosslinking to define positions of the RNAP trailing and leading edges and of σ relative to DNA at the λPR' promoter, we show directly that σ-dependent pausing in the absence of GreB in vitro predominantly involves a state backtracked by 2­4 bp, and σ-dependent pausing in the presence of GreB in vitro and in vivo predominantly involves a state scrunched by 2­3 bp. Analogous experiments with a library of 47 (∼16,000) transcribed-region sequences show that the state scrunched by 2­3 bp­and only that state­is associated with the consensus sequence, T−3N−2Y−1G+1, (where −1 corresponds to the position of the RNA 3' end), which is identical to the consensus for pausing in initial transcription and which is related to the consensus for pausing in transcription elongation. Experiments with heteroduplex templates show that sequence information at position T−3 resides in the DNA nontemplate strand. A cryoelectron microscopy structure of a complex engaged in σ-dependent pausing reveals positions of DNA scrunching on the DNA nontemplate and template strands and suggests that position T−3 of the consensus sequence exerts its effects by facilitating scrunching.


Subject(s)
DNA-Directed RNA Polymerases , Transcription, Genetic , Cryoelectron Microscopy , DNA , DNA-Directed RNA Polymerases/metabolism , Escherichia coli/genetics
4.
Proc Natl Acad Sci U S A ; 118(40)2021 10 05.
Article in English | MEDLINE | ID: mdl-34599093

ABSTRACT

Density estimation in sequence space is a fundamental problem in machine learning that is also of great importance in computational biology. Due to the discrete nature and large dimensionality of sequence space, how best to estimate such probability distributions from a sample of observed sequences remains unclear. One common strategy for addressing this problem is to estimate the probability distribution using maximum entropy (i.e., calculating point estimates for some set of correlations based on the observed sequences and predicting the probability distribution that is as uniform as possible while still matching these point estimates). Building on recent advances in Bayesian field-theoretic density estimation, we present a generalization of this maximum entropy approach that provides greater expressivity in regions of sequence space where data are plentiful while still maintaining a conservative maximum entropy character in regions of sequence space where data are sparse or absent. In particular, we define a family of priors for probability distributions over sequence space with a single hyperparameter that controls the expected magnitude of higher-order correlations. This family of priors then results in a corresponding one-dimensional family of maximum a posteriori estimates that interpolate smoothly between the maximum entropy estimate and the observed sample frequencies. To demonstrate the power of this method, we use it to explore the high-dimensional geometry of the distribution of 5' splice sites found in the human genome and to understand patterns of chromosomal abnormalities across human cancers.


Subject(s)
Aneuploidy , Computational Biology/methods , Models, Theoretical , Neoplasms/genetics , RNA Splice Sites , Humans , Probability
5.
Proc Natl Acad Sci U S A ; 118(27)2021 07 06.
Article in English | MEDLINE | ID: mdl-34187896

ABSTRACT

Chemical modifications of RNA 5'-ends enable "epitranscriptomic" regulation, influencing multiple aspects of RNA fate. In transcription initiation, a large inventory of substrates compete with nucleoside triphosphates for use as initiating entities, providing an ab initio mechanism for altering the RNA 5'-end. In Escherichia coli cells, RNAs with a 5'-end hydroxyl are generated by use of dinucleotide RNAs as primers for transcription initiation, "primer-dependent initiation." Here, we use massively systematic transcript end readout (MASTER) to detect and quantify RNA 5'-ends generated by primer-dependent initiation for ∼410 (∼1,000,000) promoter sequences in E. coli The results show primer-dependent initiation in E. coli involves any of the 16 possible dinucleotide primers and depends on promoter sequences in, upstream, and downstream of the primer binding site. The results yield a consensus sequence for primer-dependent initiation, YTSS-2NTSS-1NTSSWTSS+1, where TSS is the transcription start site, NTSS-1NTSS is the primer binding site, Y is pyrimidine, and W is A or T. Biochemical and structure-determination studies show that the base pair (nontemplate-strand base:template-strand base) immediately upstream of the primer binding site (Y:RTSS-2, where R is purine) exerts its effect through the base on the DNA template strand (RTSS-2) through interchain base stacking with the RNA primer. Results from analysis of a large set of natural, chromosomally encoded Ecoli promoters support the conclusions from MASTER. Our findings provide a mechanistic and structural description of how TSS-region sequence hard-codes not only the TSS position but also the potential for epitranscriptomic regulation through primer-dependent transcription initiation.


Subject(s)
DNA Primers/metabolism , Escherichia coli/genetics , Promoter Regions, Genetic , Transcription Initiation, Genetic , Base Sequence , Binding Sites , Chromosomes, Bacterial/genetics , Gene Expression Regulation, Bacterial , RNA, Messenger/genetics , RNA, Messenger/metabolism , Transcription Initiation Site
6.
Annu Rev Genomics Hum Genet ; 20: 99-127, 2019 08 31.
Article in English | MEDLINE | ID: mdl-31091417

ABSTRACT

Over the last decade, a rich variety of massively parallel assays have revolutionized our understanding of how biological sequences encode quantitative molecular phenotypes. These assays include deep mutational scanning, high-throughput SELEX, and massively parallel reporter assays. Here, we review these experimental methods and how the data they produce can be used to quantitatively model sequence-function relationships. In doing so, we touch on a diverse range of topics, including the identification of clinically relevant genomic variants, the modeling of transcription factor binding to DNA, the functional and evolutionary landscapes of proteins, and cis-regulatory mechanisms in both transcription and mRNA splicing. We further describe a unified conceptual framework and a core set of mathematical modeling strategies that studies in these diverse areas can make use of. Finally, we highlight key aspects of experimental design and mathematical modeling that are important for the results of such studies to be interpretable and reproducible.


Subject(s)
Epistasis, Genetic , Genetic Association Studies , High-Throughput Nucleotide Sequencing/methods , Models, Genetic , SELEX Aptamer Technique/methods , DNA/genetics , DNA/metabolism , Genotype , Humans , Mutation , Phenotype , Protein Binding , RNA Splicing , Transcription Factors/genetics , Transcription Factors/metabolism , Transcription, Genetic
7.
Bioinformatics ; 36(7): 2272-2274, 2020 04 01.
Article in English | MEDLINE | ID: mdl-31821414

ABSTRACT

SUMMARY: Sequence logos are visually compelling ways of illustrating the biological properties of DNA, RNA and protein sequences, yet it is currently difficult to generate and customize such logos within the Python programming environment. Here we introduce Logomaker, a Python API for creating publication-quality sequence logos. Logomaker can produce both standard and highly customized logos from either a matrix-like array of numbers or a multiple-sequence alignment. Logos are rendered as native matplotlib objects that are easy to stylize and incorporate into multi-panel figures. AVAILABILITY AND IMPLEMENTATION: Logomaker can be installed using the pip package manager and is compatible with both Python 2.7 and Python 3.6. Documentation is provided at http://logomaker.readthedocs.io; source code is available at http://github.com/jbkinney/logomaker.


Subject(s)
Documentation , Software , DNA , Position-Specific Scoring Matrices
8.
Proc Natl Acad Sci U S A ; 115(21): E4796-E4805, 2018 05 22.
Article in English | MEDLINE | ID: mdl-29728462

ABSTRACT

Gene regulation is one of the most ubiquitous processes in biology. However, while the catalog of bacterial genomes continues to expand rapidly, we remain ignorant about how almost all of the genes in these genomes are regulated. At present, characterizing the molecular mechanisms by which individual regulatory sequences operate requires focused efforts using low-throughput methods. Here, we take a first step toward multipromoter dissection and show how a combination of massively parallel reporter assays, mass spectrometry, and information-theoretic modeling can be used to dissect multiple bacterial promoters in a systematic way. We show this approach on both well-studied and previously uncharacterized promoters in the enteric bacterium Escherichia coli In all cases, we recover nucleotide-resolution models of promoter mechanism. For some promoters, including previously unannotated ones, the approach allowed us to further extract quantitative biophysical models describing input-output relationships. Given the generality of the approach presented here, it opens up the possibility of quantitatively dissecting the mechanisms of promoter function in E. coli and a wide range of other bacteria.


Subject(s)
Escherichia coli Proteins/metabolism , Escherichia coli/genetics , Gene Expression Regulation, Bacterial , Genome, Bacterial , Green Fluorescent Proteins/metabolism , Promoter Regions, Genetic , Escherichia coli/growth & development , Escherichia coli/metabolism , Escherichia coli Proteins/genetics , Transcriptional Activation
9.
PLoS Comput Biol ; 15(2): e1006226, 2019 02.
Article in English | MEDLINE | ID: mdl-30716072

ABSTRACT

Despite the central importance of transcriptional regulation in biology, it has proven difficult to determine the regulatory mechanisms of individual genes, let alone entire gene networks. It is particularly difficult to decipher the biophysical mechanisms of transcriptional regulation in living cells and determine the energetic properties of binding sites for transcription factors and RNA polymerase. In this work, we present a strategy for dissecting transcriptional regulatory sequences using in vivo methods (massively parallel reporter assays) to formulate quantitative models that map a transcription factor binding site's DNA sequence to transcription factor-DNA binding energy. We use these models to predict the binding energies of transcription factor binding sites to within 1 kBT of their measured values. We further explore how such a sequence-energy mapping relates to the mechanisms of trancriptional regulation in various promoter contexts. Specifically, we show that our models can be used to design specific induction responses, analyze the effects of amino acid mutations on DNA sequence preference, and determine how regulatory context affects a transcription factor's sequence specificity.


Subject(s)
Binding Sites/genetics , Computational Biology/methods , Sequence Analysis, DNA/methods , Chromosome Mapping , DNA/chemistry , Gene Expression Regulation/genetics , Gene Regulatory Networks , Models, Molecular , Promoter Regions, Genetic/genetics , Protein Binding , Transcription Factors/chemistry , Transcription Factors/metabolism , Transcription, Genetic/physiology
10.
Genome Res ; 26(3): 315-30, 2016 Mar.
Article in English | MEDLINE | ID: mdl-26733669

ABSTRACT

Eukaryotic chromosomes initiate DNA synthesis from multiple replication origins in a temporally specific manner during S phase. The replicative helicase Mcm2-7 functions in both initiation and fork progression and thus is an important target of regulation. Mcm4, a helicase subunit, possesses an unstructured regulatory domain that mediates control from multiple kinase signaling pathways, including the Dbf4-dependent Cdc7 kinase (DDK). Following replication stress in S phase, Dbf4 and Sld3, an initiation factor and essential target of Cyclin-Dependent Kinase (CDK), are targets of the checkpoint kinase Rad53 for inhibition of initiation from origins that have yet to be activated, so-called late origins. Here, whole-genome DNA replication profile analysis is used to access under various conditions the effect of mutations that alter the Mcm4 regulatory domain and the Rad53 targets, Sld3 and Dbf4. Late origin firing occurs under genotoxic stress when the controls on Mcm4, Sld3, and Dbf4 are simultaneously eliminated. The regulatory domain of Mcm4 plays an important role in the timing of late origin firing, both in an unperturbed S phase and in dNTP limitation. Furthermore, checkpoint control of Sld3 impacts fork progression under replication stress. This effect is parallel to the role of the Mcm4 regulatory domain in monitoring fork progression. Hypomorph mutations in sld3 are suppressed by a mcm4 regulatory domain mutation. Thus, in response to cellular conditions, the functions executed by Sld3, Dbf4, and the regulatory domain of Mcm4 intersect to control origin firing and replication fork progression, thereby ensuring genome stability.


Subject(s)
Cell Cycle Proteins/metabolism , DNA Replication , DNA-Binding Proteins/metabolism , Minichromosome Maintenance Complex Component 4/metabolism , Replication Origin , Saccharomyces cerevisiae Proteins/metabolism , Alkylating Agents/pharmacology , Alleles , Checkpoint Kinase 2/metabolism , Chromosomes, Fungal , Cyclin-Dependent Kinases/metabolism , DNA Replication/drug effects , Hydroxyurea/pharmacology , Minichromosome Maintenance Complex Component 4/genetics , Mutation , Phenotype , Phosphorylation , Saccharomyces cerevisiae/drug effects , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/metabolism , Saccharomyces cerevisiae Proteins/genetics , Sequence Deletion , Signal Transduction
11.
Phys Rev Lett ; 121(16): 160605, 2018 Oct 19.
Article in English | MEDLINE | ID: mdl-30387642

ABSTRACT

How might a smooth probability distribution be estimated with accurately quantified uncertainty from a limited amount of sampled data? Here we describe a field-theoretic approach that addresses this problem remarkably well in one dimension, providing an exact nonparametric Bayesian posterior without relying on tunable parameters or large-data approximations. Strong non-Gaussian constraints, which require a nonperturbative treatment, are found to play a major role in reducing distribution uncertainty. A software implementation of this method is provided.

12.
Proc Natl Acad Sci U S A ; 111(9): 3354-9, 2014 Mar 04.
Article in English | MEDLINE | ID: mdl-24550517

ABSTRACT

How should one quantify the strength of association between two random variables without bias for relationships of a specific form? Despite its conceptual simplicity, this notion of statistical "equitability" has yet to receive a definitive mathematical formalization. Here we argue that equitability is properly formalized by a self-consistency condition closely related to Data Processing Inequality. Mutual information, a fundamental quantity in information theory, is shown to satisfy this equitability criterion. These findings are at odds with the recent work of Reshef et al. [Reshef DN, et al. (2011) Science 334(6062):1518-1524], which proposed an alternative definition of equitability and introduced a new statistic, the "maximal information coefficient" (MIC), said to satisfy equitability in contradistinction to mutual information. These conclusions, however, were supported only with limited simulation evidence, not with mathematical arguments. Upon revisiting these claims, we prove that the mathematical definition of equitability proposed by Reshef et al. cannot be satisfied by any (nontrivial) dependence measure. We also identify artifacts in the reported simulation evidence. When these artifacts are removed, estimates of mutual information are found to be more equitable than estimates of MIC. Mutual information is also observed to have consistently higher statistical power than MIC. We conclude that estimating mutual information provides a natural (and often practical) way to equitably quantify statistical associations in large datasets.


Subject(s)
Data Interpretation, Statistical , Information Theory , Statistics as Topic/methods , Bias , Mathematics
13.
Proc Natl Acad Sci U S A ; 111(18): E1899-908, 2014 May 06.
Article in English | MEDLINE | ID: mdl-24740181

ABSTRACT

Eukaryotic DNA synthesis initiates from multiple replication origins and progresses through bidirectional replication forks to ensure efficient duplication of the genome. Temporal control of initiation from origins and regulation of replication fork functions are important aspects for maintaining genome stability. Multiple kinase-signaling pathways are involved in these processes. The Dbf4-dependent Cdc7 kinase (DDK), cyclin-dependent kinase (CDK), and Mec1, the yeast Ataxia telangiectasia mutated/Ataxia telangiectasia mutated Rad3-related checkpoint regulator, all target the structurally disordered N-terminal serine/threonine-rich domain (NSD) of mini-chromosome maintenance subunit 4 (Mcm4), a subunit of the mini-chromosome maintenance (MCM) replicative helicase complex. Using whole-genome replication profile analysis and single-molecule DNA fiber analysis, we show that under replication stress the temporal pattern of origin activation and DNA replication fork progression are altered in cells with mutations within two separate segments of the Mcm4 NSD. The proximal segment of the NSD residing next to the DDK-docking domain mediates repression of late-origin firing by checkpoint signals because in its absence late origins become active despite an elevated DNA damage-checkpoint response. In contrast, the distal segment of the NSD at the N terminus plays no role in the temporal pattern of origin firing but has a strong influence on replication fork progression and on checkpoint signaling. Both fork progression and checkpoint response are regulated by the phosphorylation of the canonical CDK sites at the distal NSD. Together, our data suggest that the eukaryotic MCM helicase contains an intrinsic regulatory domain that integrates multiple signals to coordinate origin activation and replication fork progression under stress conditions.


Subject(s)
DNA Replication/physiology , DNA, Fungal/biosynthesis , DNA, Fungal/chemistry , Minichromosome Maintenance Complex Component 4/chemistry , Minichromosome Maintenance Complex Component 4/metabolism , Saccharomyces cerevisiae Proteins/chemistry , Saccharomyces cerevisiae Proteins/metabolism , Cell Cycle Checkpoints , Cell Cycle Proteins/metabolism , Cyclin-Dependent Kinases/metabolism , Genome, Fungal , Intracellular Signaling Peptides and Proteins/metabolism , Minichromosome Maintenance Complex Component 4/genetics , Mutation , Nucleic Acid Conformation , Phosphorylation , Protein Serine-Threonine Kinases/metabolism , Protein Structure, Tertiary , Protein Subunits , Replication Origin , Saccharomyces cerevisiae/cytology , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/metabolism , Saccharomyces cerevisiae Proteins/genetics , Signal Transduction
14.
Neural Comput ; 26(4): 637-53, 2014 Apr.
Article in English | MEDLINE | ID: mdl-24479782

ABSTRACT

Motivated by data-rich experiments in transcriptional regulation and sensory neuroscience, we consider the following general problem in statistical inference: when exposed to a high-dimensional signal S, a system of interest computes a representation R of that signal, which is then observed through a noisy measurement M. From a large number of signals and measurements, we wish to infer the "filter" that maps S to R. However, the standard method for solving such problems, likelihood-based inference, requires perfect a priori knowledge of the "noise function" mapping R to M. In practice such noise functions are usually known only approximately, if at all, and using an incorrect noise function will typically bias the inferred filter. Here we show that in the large data limit, this need for a precharacterized noise function can be circumvented by searching for filters that instead maximize the mutual information I[M; R] between observed measurements and predicted representations. Moreover, if the correct filter lies within the space of filters being explored, maximizing mutual information becomes equivalent to simultaneously maximizing every dependence measure that satisfies the data processing inequality. It is important to note that maximizing mutual information will typically leave a small number of directions in parameter space unconstrained. We term these directions diffeomorphic modes and present an equation that allows these modes to be derived systematically. The presence of diffeomorphic modes reflects a fundamental and nontrivial substructure within parameter space, one that is obscured by standard likelihood-based inference.


Subject(s)
Likelihood Functions , Models, Statistical , Algorithms , Animals , Gene Regulatory Networks , Humans
15.
bioRxiv ; 2024 Jun 24.
Article in English | MEDLINE | ID: mdl-38798625

ABSTRACT

Quantitative models that describe how biological sequences encode functional activities are ubiquitous in modern biology. One important aspect of these models is that they commonly exhibit gauge freedoms, i.e., directions in parameter space that do not affect model predictions. In physics, gauge freedoms arise when physical theories are formulated in ways that respect fundamental symmetries. However, the connections that gauge freedoms in models of sequence-function relationships have to the symmetries of sequence space have yet to be systematically studied. Here we study the gauge freedoms of models that respect a specific symmetry of sequence space: the group of position-specific character permutations. We find that gauge freedoms arise when model parameters transform under redundant irreducible matrix representations of this group. Based on this finding, we describe an "embedding distillation" procedure that enables analytic calculation of the number of independent gauge freedoms, as well as efficient computation of a sparse basis for the space of gauge freedoms. We also study how parameter transformation behavior affects parameter interpretability. We find that in many (and possibly all) nontrivial models, the ability to interpret individual model parameters as quantifying intrinsic allelic effects requires that gauge freedoms be present. This finding establishes an incompatibility between two distinct notions of parameter interpretability. Our work thus advances the understanding of symmetries, gauge freedoms, and parameter interpretability in sequence-function relationships. Significance Statement: Gauge freedoms-diections in parameter space that do not affect model predictions-are ubiquitous in mathematical models of biological sequence-function relationships. But in contrast to theoretical physics, where gauge freedoms play a central role, little is understood about the mathematical properties of gauge freedoms in models of sequence-function relationships. Here we identify a connection between specific symmetries of sequence space and the gauge freedoms present in a large class of commonly used models for sequence-function relationships. We show that this connection can be used to perform useful mathematical computations, and we discuss the impact of model transformation properties on parameter interpretability. The results fill a major gap in the understanding of quantitative sequence-function relationships.

16.
bioRxiv ; 2024 Mar 02.
Article in English | MEDLINE | ID: mdl-38013993

ABSTRACT

Deep neural networks (DNNs) have greatly advanced the ability to predict genome function from sequence. Interpreting genomic DNNs in terms of biological mechanisms, however, remains difficult. Here we introduce SQUID, a genomic DNN interpretability framework based on surrogate modeling. SQUID approximates genomic DNNs in user-specified regions of sequence space using surrogate models, i.e., simpler models that are mechanistically interpretable. Importantly, SQUID removes the confounding effects that nonlinearities and heteroscedastic noise in functional genomics data can have on model interpretation. Benchmarking analysis on multiple genomic DNNs shows that SQUID, when compared to established interpretability methods, identifies motifs that are more consistent across genomic loci and yields improved single-nucleotide variant-effect predictions. SQUID also supports surrogate models that quantify epistatic interactions within and between cis-regulatory elements. SQUID thus advances the ability to mechanistically interpret genomic DNNs.

17.
bioRxiv ; 2024 Jun 24.
Article in English | MEDLINE | ID: mdl-38798671

ABSTRACT

Quantitative models of sequence-function relationships are ubiquitous in computational biology, e.g., for modeling the DNA binding of transcription factors or the fitness landscapes of proteins. Interpreting these models, however, is complicated by the fact that the values of model parameters can often be changed without affecting model predictions. Before the values of model parameters can be meaningfully interpreted, one must remove these degrees of freedom (called "gauge freedoms" in physics) by imposing additional constraints (a process called "fixing the gauge"). However, strategies for fixing the gauge of sequence-function relationships have received little attention. Here we derive an analytically tractable family of gauges for a large class of sequence-function relationships. These gauges are derived in the context of models with all-order interactions, but an important subset of these gauges can be applied to diverse types of models, including additive models, pairwise-interaction models, and models with higher-order interactions. Many commonly used gauges are special cases of gauges within this family. We demonstrate the utility of this family of gauges by showing how different choices of gauge can be used both to explore complex activity landscapes and to reveal simplified models that are approximately correct within localized regions of sequence space. The results provide practical gauge-fixing strategies and demonstrate the utility of gauge-fixing for model exploration and interpretation. Significance Statement: Computational biology relies heavily on mathematical models that predict biological activities from DNA, RNA, or protein sequences. Interpreting the parameters of these models, however, remains difficult. Here we address a core challenge for model interpretation-the presence of 'gauge freedoms', i.e., ways of changing model parameters without affecting model predictions. The results unify commonly used methods for eliminating gauge freedoms and show how these methods can be used to simplify complex models in localized regions of sequence space. This work thus overcomes a major obstacle in the interpretation of quantitative sequence-function relationships.

18.
Nat Commun ; 15(1): 1880, 2024 Feb 29.
Article in English | MEDLINE | ID: mdl-38424098

ABSTRACT

Drugs that target pre-mRNA splicing hold great therapeutic potential, but the quantitative understanding of how these drugs work is limited. Here we introduce mechanistically interpretable quantitative models for the sequence-specific and concentration-dependent behavior of splice-modifying drugs. Using massively parallel splicing assays, RNA-seq experiments, and precision dose-response curves, we obtain quantitative models for two small-molecule drugs, risdiplam and branaplam, developed for treating spinal muscular atrophy. The results quantitatively characterize the specificities of risdiplam and branaplam for 5' splice site sequences, suggest that branaplam recognizes 5' splice sites via two distinct interaction modes, and contradict the prevailing two-site hypothesis for risdiplam activity at SMN2 exon 7. The results also show that anomalous single-drug cooperativity, as well as multi-drug synergy, are widespread among small-molecule drugs and antisense-oligonucleotide drugs that promote exon inclusion. Our quantitative models thus clarify the mechanisms of existing treatments and provide a basis for the rational development of new therapies.


Subject(s)
Muscular Atrophy, Spinal , Pyrimidines , RNA Splicing , Humans , RNA Splicing/genetics , Azo Compounds , Oligonucleotides/genetics , Oligonucleotides, Antisense/genetics , Oligonucleotides, Antisense/therapeutic use , RNA Splice Sites , Muscular Atrophy, Spinal/drug therapy , Muscular Atrophy, Spinal/genetics
19.
Proc Natl Acad Sci U S A ; 107(20): 9158-63, 2010 May 18.
Article in English | MEDLINE | ID: mdl-20439748

ABSTRACT

Cells use protein-DNA and protein-protein interactions to regulate transcription. A biophysical understanding of this process has, however, been limited by the lack of methods for quantitatively characterizing the interactions that occur at specific promoters and enhancers in living cells. Here we show how such biophysical information can be revealed by a simple experiment in which a library of partially mutated regulatory sequences are partitioned according to their in vivo transcriptional activities and then sequenced en masse. Computational analysis of the sequence data produced by this experiment can provide precise quantitative information about how the regulatory proteins at a specific arrangement of binding sites work together to regulate transcription. This ability to reliably extract precise information about regulatory biophysics in the face of experimental noise is made possible by a recently identified relationship between likelihood and mutual information. Applying our experimental and computational techniques to the Escherichia coli lac promoter, we demonstrate the ability to identify regulatory protein binding sites de novo, determine the sequence-dependent binding energy of the proteins that bind these sites, and, importantly, measure the in vivo interaction energy between RNA polymerase and a DNA-bound transcription factor. Our approach provides a generally applicable method for characterizing the biophysical basis of transcriptional regulation by a specified regulatory sequence. The principles of our method can also be applied to a wide range of other problems in molecular biology.


Subject(s)
Gene Expression Regulation/genetics , Models, Biological , Mutation/genetics , Promoter Regions, Genetic/genetics , Base Sequence , Binding Sites/genetics , Biophysics , Computational Biology/methods , Escherichia coli , Flow Cytometry , Gene Expression Regulation/physiology , Green Fluorescent Proteins/metabolism , Lac Operon/genetics , Likelihood Functions , Molecular Sequence Data , Monte Carlo Method , Sequence Analysis, DNA , Thermodynamics
SELECTION OF CITATIONS
SEARCH DETAIL