RESUMEN
Paused RNA polymerase (Pol II) is a pervasive feature of Drosophila embryos and mammalian stem cells, but its role in development is uncertain. Here, we demonstrate that a spectrum of paused Pol II determines the "time to synchrony"-the time required to achieve coordinated gene expression across the cells of a tissue. To determine whether synchronous patterns of gene activation are significant in development, we manipulated the timing of snail expression, which controls the coordinated invagination of â¼1,000 mesoderm cells during gastrulation. Replacement of the strongly paused snail promoter with moderately paused or nonpaused promoters causes stochastic activation of snail expression and increased variability of mesoderm invagination. Computational modeling of the dorsal-ventral patterning network recapitulates these variable and bistable gastrulation profiles and emphasizes the importance of timing of gene activation in development. We conclude that paused Pol II and transcriptional synchrony are essential for coordinating cell behavior during morphogenesis.
Asunto(s)
Drosophila melanogaster/embriología , Drosophila melanogaster/genética , Embrión no Mamífero/metabolismo , ARN Polimerasa II/metabolismo , Transcripción Genética , Animales , Secuencia de Bases , Proteínas de Drosophila/metabolismo , Drosophila melanogaster/enzimología , Gastrulación , Regulación del Desarrollo de la Expresión Génica , Modelos Biológicos , Datos de Secuencia Molecular , Morfogénesis , Regiones Promotoras GenéticasRESUMEN
After a COVID-related hiatus, the fifth biennial symposium on Evolution and Core Processes in Gene Regulation met at the Stowers Institute in Kansas City, Missouri July 21 to 24, 2022. This symposium, sponsored by the American Society for Biochemistry and Molecular Biology (ASBMB), featured experts in gene regulation and evolutionary biology. Topic areas covered enhancer evolution, the cis-regulatory code, and regulatory variation, with an overall focus on bringing the power of deep learning (DL) to decipher DNA sequence information. DL is a machine learning method that uses neural networks to learn complex rules that make predictions about diverse types of data. When DL models are trained to predict genomic data from DNA sequence information, their high prediction accuracy allows the identification of impactful genetic variants within and across species. In addition, the learned sequence rules can be extracted from the model and provide important clues about the mechanistic underpinnings of the cis-regulatory code.
Asunto(s)
COVID-19 , Aprendizaje Profundo , Humanos , Genómica , Redes Neurales de la Computación , Expresión GénicaRESUMEN
The Spt/Ada-Gcn5 Acetyltransferase (SAGA) coactivator complex has multiple modules with different enzymatic and non-enzymatic functions. How each module contributes to gene expression is not well understood. During Drosophila oogenesis, the enzymatic functions are not equally required, which may indicate that different genes require different enzymatic functions. An analogy for this phenomenon is the handyman principle: while a handyman has many tools, which tool he uses depends on what requires maintenance. Here we analyzed the role of the non-enzymatic core module during Drosophila oogenesis, which interacts with TBP. We show that depletion of SAGA-specific core subunits blocked egg chamber development at earlier stages than depletion of enzymatic subunits. These results, as well as additional genetic analyses, point to an interaction with TBP and suggest a differential role of SAGA modules at different promoter types. However, SAGA subunits co-occupied all promoter types of active genes in ChIP-seq and ChIP-nexus experiments, and the complex was not specifically associated with distinct promoter types in the ovary. The high-resolution genomic binding profiles were congruent with SAGA recruitment by activators upstream of the start site, and retention on chromatin by interactions with modified histones downstream of the start site. Our data illustrate that a distinct genetic requirement for specific components may conceal the fact that the entire complex is physically present and suggests that the biological context defines which module functions are critical.
Asunto(s)
Proteínas de Drosophila/metabolismo , Drosophila melanogaster/fisiología , Histona Acetiltransferasas/metabolismo , Oogénesis/fisiología , Regiones Promotoras Genéticas , Animales , Proteínas de Drosophila/genética , Drosophila melanogaster/genética , Histona Acetiltransferasas/genética , Oogénesis/genéticaRESUMEN
Core promoter types differ in the extent to which RNA polymerase II (Pol II) pauses after initiation, but how this affects their tissue-specific gene expression characteristics is not well understood. While promoters with Pol II pausing elements are active throughout development, TATA promoters are highly active in differentiated tissues. We therefore used a genomics approach on late-stage Drosophila embryos to analyze the properties of promoter types. Using tissue-specific Pol II ChIP-seq, we found that paused promoters have high levels of paused Pol II throughout the embryo, even in tissues where the gene is not expressed, while TATA promoters only show Pol II occupancy when the gene is active. The promoter types are associated with different chromatin accessibility in ATAC-seq data and have different expression characteristics in single-cell RNA-seq data. The two promoter types may therefore be optimized for different properties: paused promoters show more consistent expression when active, while TATA promoters have lower background expression when inactive. We propose that tissue-specific genes have evolved to use two different strategies for their differential expression across tissues.
Asunto(s)
Drosophila melanogaster/embriología , Perfilación de la Expresión Génica/métodos , Regiones Promotoras Genéticas , ARN Polimerasa II/metabolismo , Animales , Proteínas de Drosophila/genética , Proteínas de Drosophila/metabolismo , Drosophila melanogaster/genética , Drosophila melanogaster/metabolismo , Regulación del Desarrollo de la Expresión Génica , Especificidad de Órganos , Análisis de Secuencia de ARN , Análisis de la Célula Individual , TATA BoxRESUMEN
The TCT core promoter element is present in most ribosomal protein (RP) genes in Drosophila and humans. Here we show that TBP (TATA box-binding protein)-related factor TRF2, but not TBP, is required for transcription of the TCT-dependent RP genes. In cells, TCT-dependent transcription, but not TATA-dependent transcription, increases or decreases upon overexpression or depletion of TRF2. In vitro, purified TRF2 activates TCT but not TATA promoters. ChIP-seq (chromatin immunoprecipitation [ChIP] combined with deep sequencing) experiments revealed the preferential localization of TRF2 at TCT versus TATA promoters. Hence, a specialized TRF2-based RNA polymerase II system functions in the synthesis of RPs and complements the RNA polymerase I and III systems.
Asunto(s)
Drosophila/genética , Drosophila/metabolismo , Proteína 2 de Unión a Repeticiones Teloméricas/metabolismo , Transcripción Genética/genética , Secuencias de Aminoácidos , Animales , Línea Celular , Expresión Génica , Regiones Promotoras Genéticas , Transporte de Proteínas , TATA Box/genética , Proteína de Unión a TATA-Box/metabolismoRESUMEN
The HMG-box protein Capicua (Cic) is a conserved transcriptional repressor that functions downstream of receptor tyrosine kinase (RTK) signaling pathways in a relatively simple switch: In the absence of signaling, Cic represses RTK-responsive genes by binding to nearly invariant sites in DNA, whereas activation of RTK signaling down-regulates Cic activity, leading to derepression of its targets. This mechanism controls gene expression in both Drosophila and mammals, but whether Cic can also function via other regulatory mechanisms remains unknown. Here, we characterize an RTK-independent role of Cic in regulating spatially restricted expression of Toll/IL-1 signaling targets in Drosophila embryogenesis. We show that Cic represses those targets by binding to suboptimal DNA sites of lower affinity than its known consensus sites. This binding depends on Dorsal/NF-κB, which translocates into the nucleus upon Toll activation and binds next to the Cic sites. As a result, Cic binds to and represses Toll targets only in regions with nuclear Dorsal. These results reveal a mode of Cic regulation unrelated to the well-established RTK/Cic depression axis and implicate cooperative binding in conjunction with low-affinity binding sites as an important mechanism of enhancer regulation. Given that Cic plays a role in many developmental and pathological processes in mammals, our results raise the possibility that some of these Cic functions are independent of RTK regulation and may depend on cofactor-assisted DNA binding.
Asunto(s)
Proteínas de Drosophila/metabolismo , Drosophila/genética , Proteínas HMGB/metabolismo , Proteínas Tirosina Quinasas Receptoras/metabolismo , Proteínas Represoras/metabolismo , Transducción de Señal , Receptores Toll-Like/metabolismo , Animales , Factores de Transcripción con Motivo Hélice-Asa-Hélice Básico/genética , Factores de Transcripción con Motivo Hélice-Asa-Hélice Básico/metabolismo , Núcleo Celular/genética , Núcleo Celular/metabolismo , Drosophila/embriología , Drosophila/enzimología , Drosophila/metabolismo , Proteínas de Drosophila/genética , Femenino , Regulación del Desarrollo de la Expresión Génica , Proteínas HMGB/genética , Masculino , Proteínas Nucleares/genética , Proteínas Nucleares/metabolismo , Fosfoproteínas/genética , Fosfoproteínas/metabolismo , Regiones Promotoras Genéticas , Proteínas Tirosina Quinasas Receptoras/genética , Proteínas Represoras/genética , Receptores Toll-Like/genética , Factores de Transcripción/genética , Factores de Transcripción/metabolismoRESUMEN
Histone modifications are frequently used as markers for enhancer states, but how to interpret enhancer states in the context of embryonic development is not clear. The poised enhancer signature, involving H3K4me1 and low levels of H3K27ac, has been reported to mark inactive enhancers that are poised for future activation. However, future activation is not always observed, and alternative reasons for the widespread occurrence of this enhancer signature have not been investigated. By analyzing enhancers during dorsal-ventral (DV) axis formation in the Drosophila embryo, we find that the poised enhancer signature is specifically generated during patterning in the tissue where the enhancers are not induced, including at enhancers that are known to be repressed by a transcriptional repressor. These results suggest that, rather than serving exclusively as an intermediate step before future activation, the poised enhancer state may be a mark for spatial regulation during tissue patterning. We discuss the possibility that the poised enhancer state is more generally the result of repression by transcriptional repressors.
Asunto(s)
Tipificación del Cuerpo/genética , Desarrollo Embrionario/genética , Elementos de Facilitación Genéticos/genética , Transcripción Genética , Animales , Drosophila/genética , Drosophila/crecimiento & desarrollo , Represión Epigenética/genética , Regulación del Desarrollo de la Expresión Génica , Código de Histonas/genética , N-Metiltransferasa de Histona-Lisina/genética , Factores de Transcripción/genéticaRESUMEN
Hoxa1 has diverse functional roles in differentiation and development. We identify and characterize properties of regions bound by HOXA1 on a genome-wide basis in differentiating mouse ES cells. HOXA1-bound regions are enriched for clusters of consensus binding motifs for HOX, PBX, and MEIS, and many display co-occupancy of PBX and MEIS. PBX and MEIS are members of the TALE family and genome-wide analysis of multiple TALE members (PBX, MEIS, TGIF, PREP1, and PREP2) shows that nearly all HOXA1 targets display occupancy of one or more TALE members. The combinatorial binding patterns of TALE proteins define distinct classes of HOXA1 targets, which may create functional diversity. Transgenic reporter assays in zebrafish confirm enhancer activities for many HOXA1-bound regions and the importance of HOX-PBX and TGIF motifs for their regulation. Proteomic analyses show that HOXA1 physically interacts on chromatin with PBX, MEIS, and PREP family members, but not with TGIF, suggesting that TGIF may have an independent input into HOXA1-bound regions. Therefore, TALE proteins appear to represent a wide repertoire of HOX cofactors, which may coregulate enhancers through distinct mechanisms. We also discover extensive auto- and cross-regulatory interactions among the Hoxa1 and TALE genes, indicating that the specificity of HOXA1 during development may be regulated though a complex cross-regulatory network of HOXA1 and TALE proteins. This study provides new insight into a regulatory network involving combinatorial interactions between HOXA1 and TALE proteins.
Asunto(s)
Proteínas de Homeodominio/genética , Mapas de Interacción de Proteínas/genética , Proteínas Represoras/genética , Factores de Transcripción/genética , Transcripción Genética , Animales , Cromatina/genética , Genoma/genética , Ratones , Células Madre Embrionarias de Ratones , Unión Proteica/genética , ProteómicaRESUMEN
Hoxa1 has important functional roles in neural crest specification, hindbrain patterning and heart and ear development, yet the enhancers and genes that are targeted by Hoxa1 are largely unknown. In this study, we performed a comprehensive analysis of Hoxa1 target genes using genome-wide Hoxa1 binding data in mouse ES cells differentiated with retinoic acid (RA) into neural fates in combination with differential gene expression analysis in Hoxa1 gain- and loss-of-function mouse and zebrafish embryos. Our analyses reveal that Hoxa1-bound regions show epigenetic marks of enhancers, occupancy of Hox cofactors and differential expression of nearby genes, suggesting that these regions are enriched for enhancers. In support of this, 80 of them mapped to regions with known reporter activity in transgenic mouse embryos based on the Vista enhancer database. Two additional enhancers in Dok5 and Wls1 were shown to mediate neural expression in developing mouse and zebrafish. Overall, our analysis of the putative target genes indicate that Hoxa1 has input to components of major signaling pathways, including Wnt, TGF-ß, Hedgehog and Hippo, and frequently does so by targeting multiple components of a pathway such as secreted inhibitors, ligands, receptors and down-stream components. We also identified genes implicated in heart and ear development, neural crest migration and neuronal patterning and differentiation, which may underlie major Hoxa1 mutant phenotypes. Finally, we found evidence for a high degree of evolutionary conservation of many binding regions and downstream targets of Hoxa1 between mouse and zebrafish. Our genome-wide analyses in ES cells suggests that we have enriched for in vivo relevant target genes and pathways associated with functional roles of Hoxa1 in mouse development.
Asunto(s)
Células Madre Embrionarias/fisiología , Proteínas de Homeodominio/genética , Neuronas/fisiología , Factores de Transcripción/genética , Animales , Diferenciación Celular/fisiología , Desarrollo Embrionario/efectos de los fármacos , Desarrollo Embrionario/genética , Desarrollo Embrionario/fisiología , Células Madre Embrionarias/citología , Células Madre Embrionarias/metabolismo , Femenino , Redes Reguladoras de Genes , Genes Homeobox , Proteínas de Homeodominio/metabolismo , Ratones , Ratones Endogámicos C57BL , Ratones Transgénicos , Cresta Neural/citología , Neuronas/citología , Neuronas/metabolismo , Embarazo , Rombencéfalo/citología , Transducción de Señal , Factores de Transcripción/metabolismo , Tretinoina/metabolismo , Pez CebraRESUMEN
The Drosophila genome activator Vielfaltig (Vfl), also known as Zelda (Zld), is thought to prime enhancers for activation by patterning transcription factors (TFs). Such priming is accompanied by increased chromatin accessibility, but the mechanisms by which this occurs are poorly understood. Here, we analyze the effect of Zld on genome-wide nucleosome occupancy and binding of the patterning TF Dorsal (Dl). Our results show that early enhancers are characterized by an intrinsically high nucleosome barrier. Zld tackles this nucleosome barrier through local depletion of nucleosomes with the effect being dependent on the number and position of Zld motifs. Without Zld, Dl binding decreases at enhancers and redistributes to open regions devoid of enhancer activity. We propose that Zld primes enhancers by lowering the high nucleosome barrier just enough to assist TFs in accessing their binding motifs and promoting spatially controlled enhancer activation if the right patterning TFs are present. We envision that genome activators in general will utilize this mechanism to activate the zygotic genome in a robust and precise manner.
Asunto(s)
Proteínas de Drosophila/metabolismo , Drosophila melanogaster/genética , Regulación del Desarrollo de la Expresión Génica , Nucleosomas/metabolismo , Factores de Transcripción/metabolismo , Animales , Cromatina/genética , Cromatina/metabolismo , Proteínas de Drosophila/genética , Drosophila melanogaster/embriología , Estudios de Asociación Genética , Proteínas Nucleares , Nucleosomas/genética , Regiones Promotoras Genéticas , Alineación de Secuencia , Análisis de Secuencia de ADN , Factores de Transcripción/genética , Activación TranscripcionalRESUMEN
The rapid expansion of genomics methods has enabled developmental biologists to address fundamental questions of developmental gene regulation on a genome-wide scale. These efforts have demonstrated that transcription of developmental control genes by RNA polymerase II (Pol II) is commonly regulated at the transition to productive elongation, resulting in the promoter-proximal accumulation of transcriptionally engaged but paused Pol II prior to gene induction. Here we review the mechanisms and possible functions of Pol II pausing and their implications for development.
Asunto(s)
Regulación del Desarrollo de la Expresión Génica , ARN Polimerasa II/metabolismo , Animales , Drosophila/genética , Drosophila/crecimiento & desarrollo , Drosophila/metabolismo , Genes de Insecto , Regiones Promotoras Genéticas , Elongación de la Transcripción GenéticaRESUMEN
St. Louis and its famous Gateway Arch were the setting of the Special Symposium: Evolution and Core Processes in Gene Regulation, sponsored by the American Society for Biochemistry and Molecular Biology. Biochemists and evolutionary biologists highlighted growing connections between studies of biochemical mechanism and natural selection on gene expression.
Asunto(s)
Investigación Biomédica/tendencias , Evolución Molecular , Regulación de la Expresión Génica , Animales , Bioquímica , Congresos como Asunto , Código Genético , Humanos , Missouri , Biología Molecular , Recursos HumanosRESUMEN
Throughout Metazoa, developmental processes are controlled by a surprisingly limited number of conserved signaling pathways. Precisely how these signaling cassettes were assembled in early animal evolution remains poorly understood, as do the molecular transitions that potentiated the acquisition of their myriad developmental functions. Here we analyze the molecular evolution of the proto-oncogene yes-associated protein (Yap)/Yorkie, a key effector of the Hippo signaling pathway that controls organ size in both Drosophila and mammals. Based on heterologous functional analysis of evolutionarily distant Yap/Yorkie orthologs, we demonstrate that a structurally distinct interaction interface between Yap/Yorkie and its partner TEAD/Scalloped became fixed in the eumetazoan common ancestor. We then combine transcriptional profiling of tissues expressing phylogenetically diverse forms of Yap/Yorkie with ChIP-seq validation to identify a common downstream gene expression program underlying the control of tissue growth in Drosophila. Intriguingly, a subset of the newly identified Yorkie target genes are also induced by Yap in mammalian tissues, thus revealing a conserved Yap-dependent gene expression signature likely to mediate organ size control throughout bilaterian animals. Combined, these experiments provide new mechanistic insights while revealing the ancient evolutionary history of Hippo signaling.
Asunto(s)
Proteínas de Drosophila/metabolismo , Evolución Molecular , Péptidos y Proteínas de Señalización Intracelular/metabolismo , Proteínas Serina-Treonina Quinasas/metabolismo , Transactivadores/genética , Animales , Secuencia de Bases , Proteínas de Drosophila/química , Proteínas de Drosophila/genética , Drosophila melanogaster/genética , Drosophila melanogaster/crecimiento & desarrollo , Ojo/crecimiento & desarrollo , Ojo/metabolismo , Perfilación de la Expresión Génica , Regulación del Desarrollo de la Expresión Génica , Humanos , Mamíferos/metabolismo , Datos de Secuencia Molecular , Proteínas Nucleares/química , Proteínas Nucleares/genética , Proteínas Nucleares/metabolismo , Filogenia , Estructura Terciaria de Proteína , Proto-Oncogenes Mas , Análisis de Secuencia de ARN , Transactivadores/química , Transactivadores/metabolismo , Proteínas Señalizadoras YAPRESUMEN
MOTIVATION: Chromatin immunoprecipitation coupled to next-generation sequencing (ChIP-seq) is widely used to study the in vivo binding sites of transcription factors (TFs) and their regulatory targets. Recent improvements to ChIP-seq, such as increased resolution, promise deeper insights into transcriptional regulation, yet require novel computational tools to fully leverage their advantages. RESULTS: To this aim, we have developed peakzilla, which can identify closely spaced TF binding sites at high resolution (i.e. resolves individual binding sites even if spaced closely), as we demonstrate using semisynthetic datasets, performing ChIP-seq for the TF Twist in Drosophila embryos with different experimental fragment sizes, and analyzing ChIP-exo datasets. We show that the increased resolution reached by peakzilla is highly relevant, as closely spaced Twist binding sites are strongly enriched in transcriptional enhancers, suggesting a signature to discriminate functional from abundant non-functional or neutral TF binding. Peakzilla is easy to use, as it estimates all the necessary parameters from the data and is freely available. AVAILABILITY AND IMPLEMENTATION: The peakzilla program is available from https://github.com/steinmann/peakzilla or http://www.starklab.org/data/peakzilla/. CONTACT: stark@starklab.org. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Inmunoprecipitación de Cromatina/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Factores de Transcripción/metabolismo , Animales , Sitios de Unión , Drosophila/genética , Proteínas de Drosophila/metabolismo , Elementos de Facilitación Genéticos , Humanos , Ratones , Proteína 1 Relacionada con Twist/metabolismoRESUMEN
Identifying the molecular origins by which new morphological structures evolve is one of the long-standing problems in evolutionary biology. To date, vanishingly few examples provide a compelling account of how new morphologies were initially formed, thereby limiting our understanding of how diverse forms of life derived their complex features. Here, we provide evidence that the large projections on the Drosophila eugracilis phallus that are implicated in sexual conflict have evolved through the partial co-option of the trichome genetic network. These unicellular apical projections on the phallus postgonal sheath are reminiscent of trichomes that cover the Drosophila body but are up to 20-fold larger in size. During their development, they express the transcription factor Shavenbaby, the master regulator of the trichome network. Consistent with the co-option of the Shavenbaby network during the evolution of the D. eugracilis projections, somatic mosaic CRISPR-Cas9 mutagenesis shows that shavenbaby is necessary for their proper length. Moreover, misexpression of Shavenbaby in the sheath of D. melanogaster, a naive species that lacks these projections, is sufficient to induce small trichomes. These induced projections rely on a genetic network that is shared to a large extent with the D. eugracilis projections, indicating its partial co-option but also some genetic rewiring. Thus, by leveraging a genetically tractable evolutionary novelty, our work shows that the trichome-forming network is flexible enough that it can be partially co-opted in a new context and subsequently refined to produce unique apical projections that are barely recognizable compared with their simpler ancestral beginnings.
RESUMEN
Identifying the molecular origins by which new morphological structures evolve is one of the long standing problems in evolutionary biology. To date, vanishingly few examples provide a compelling account of how new morphologies were initially formed, thereby limiting our understanding of how diverse forms of life derived their complex features. Here, we provide evidence that the large projections on the Drosophila eugracilis phallus that are implicated in sexual conflict have evolved through co-option of the trichome genetic network. These unicellular apical projections on the phallus postgonal sheath are reminiscent of trichomes that cover the Drosophila body but are up to 20-fold larger in size. During their development, they express the transcription factor Shavenbaby, the master regulator of the trichome network. Consistent with the co-option of the Shavenbaby network during the evolution of the D. eugracilis projections, somatic mosaic CRISPR/Cas9 mutagenesis shows that shavenbaby is necessary for their proper length. Moreover, mis-expression of Shavenbaby in the sheath of D. melanogaster , a naïve species that lacks these extensions, is sufficient to induce small trichomes. These induced extensions rely on a genetic network that is shared to a large extent with the D. eugracilis projections, indicating its co-option but also some genetic rewiring. Thus, by leveraging a genetically tractable evolutionarily novelty, our work shows that the trichome-forming network is flexible enough that it can be co-opted in a new context, and subsequently refined to produce unique apical projections that are barely recognizable compared to their simpler ancestral beginnings.
RESUMEN
While the accessibility of enhancers is dynamically regulated during development, promoters tend to be constitutively accessible and poised for activation by paused Pol II. By studying Lola-I, a Drosophila zinc finger transcription factor, we show here that the promoter state can also be subject to developmental regulation independently of gene activation. Lola-I is ubiquitously expressed at the end of embryogenesis and causes its target promoters to become accessible and acquire paused Pol II throughout the embryo. This promoter transition is required but not sufficient for tissue-specific target gene activation. Lola-I mediates this function by depleting promoter nucleosomes, similar to the action of pioneer factors at enhancers. These results uncover a level of regulation for promoters that is normally found at enhancers and reveal a mechanism for the de novo establishment of paused Pol II at promoters.
Asunto(s)
Drosophila , Embrión de Mamíferos , Animales , Regiones Promotoras Genéticas/genética , Drosophila/genética , Desarrollo Embrionario , Nucleosomas/genética , ARN Polimerasa II/genéticaRESUMEN
What new questions can we ask about transcriptional regulation given recent developments in large-scale approaches?
Asunto(s)
Regulación de la Expresión Génica , Regulación de la Expresión Génica/genéticaRESUMEN
Transcription factors (TF) are proteins that bind DNA in a sequence-specific manner to regulate gene transcription. Despite their unique intrinsic sequence preferences, in vivo genomic occupancy profiles of TFs differ across cellular contexts. Hence, deciphering the sequence determinants of TF binding, both intrinsic and context-specific, is essential to understand gene regulation and the impact of regulatory, non-coding genetic variation. Biophysical models trained on in vitro TF binding assays can estimate intrinsic affinity landscapes and predict occupancy based on TF concentration and affinity. However, these models cannot adequately explain context-specific, in vivo binding profiles. Conversely, deep learning models, trained on in vivo TF binding assays, effectively predict and explain genomic occupancy profiles as a function of complex regulatory sequence syntax, albeit without a clear biophysical interpretation. To reconcile these complementary models of in vitro and in vivo TF binding, we developed Affinity Distillation (AD), a method that extracts thermodynamic affinities de-novo from deep learning models of TF chromatin immunoprecipitation (ChIP) experiments by marginalizing away the influence of genomic sequence context. Applied to neural networks modeling diverse classes of yeast and mammalian TFs, AD predicts energetic impacts of sequence variation within and surrounding motifs on TF binding as measured by diverse in vitro assays with superior dynamic range and accuracy compared to motif-based methods. Furthermore, AD can accurately discern affinities of TF paralogs. Our results highlight thermodynamic affinity as a key determinant of in vivo binding, suggest that deep learning models of in vivo binding implicitly learn high-resolution affinity landscapes, and show that these affinities can be successfully distilled using AD. This new biophysical interpretation of deep learning models enables high-throughput in silico experiments to explore the influence of sequence context and variation on both intrinsic affinity and in vivo occupancy.
RESUMEN
Chromatin accessibility is integral to the process by which transcription factors (TFs) read out cis-regulatory DNA sequences, but it is difficult to differentiate between TFs that drive accessibility and those that do not. Deep learning models that learn complex sequence rules provide an unprecedented opportunity to dissect this problem. Using zygotic genome activation in Drosophila as a model, we analyzed high-resolution TF binding and chromatin accessibility data with interpretable deep learning and performed genetic validation experiments. We identify a hierarchical relationship between the pioneer TF Zelda and the TFs involved in axis patterning. Zelda consistently pioneers chromatin accessibility proportional to motif affinity, whereas patterning TFs augment chromatin accessibility in sequence contexts where they mediate enhancer activation. We conclude that chromatin accessibility occurs in two tiers: one through pioneering, which makes enhancers accessible but not necessarily active, and the second when the correct combination of TFs leads to enhancer activation.