ABSTRACT
To define the multi-cellular epigenomic and transcriptional landscape of cardiac cellular development, we generated single-cell chromatin accessibility maps of human fetal heart tissues. We identified eight major differentiation trajectories involving primary cardiac cell types, each associated with dynamic transcription factor (TF) activity signatures. We contrasted regulatory landscapes of iPSC-derived cardiac cell types and their in vivo counterparts, which enabled optimization of in vitro differentiation of epicardial cells. Further, we interpreted sequence based deep learning models of cell-type-resolved chromatin accessibility profiles to decipher underlying TF motif lexicons. De novo mutations predicted to affect chromatin accessibility in arterial endothelium were enriched in congenital heart disease (CHD) cases vs. controls. In vitro studies in iPSCs validated the functional impact of identified variation on the predicted developmental cell types. This work thus defines the cell-type-resolved cis-regulatory sequence determinants of heart development and identifies disruption of cell type-specific regulatory elements in CHD.
Subject(s)
Chromatin , Heart Defects, Congenital , Humans , Chromatin/genetics , Heart Defects, Congenital/genetics , Heart , Mutation , Single-Cell AnalysisABSTRACT
Thousands of proteins localize to the nucleus; however, it remains unclear which contain transcriptional effectors. Here, we develop HT-recruit, a pooled assay where protein libraries are recruited to a reporter, and their transcriptional effects are measured by sequencing. Using this approach, we measure gene silencing and activation for thousands of domains. We find a relationship between repressor function and evolutionary age for the KRAB domains, discover that Homeodomain repressor strength is collinear with Hox genetic organization, and identify activities for several domains of unknown function. Deep mutational scanning of the CRISPRi KRAB maps the co-repressor binding surface and identifies substitutions that improve stability/silencing. By tiling 238 proteins, we find repressors as short as ten amino acids. Finally, we report new activator domains, including a divergent KRAB. These results provide a resource of 600 human proteins containing effectors and demonstrate a scalable strategy for assigning functions to protein domains.
Subject(s)
High-Throughput Screening Assays , Transcription Factors/metabolism , Amino Acid Sequence , CRISPR-Cas Systems/genetics , Female , Gene Silencing , Genes, Reporter , HEK293 Cells , Homeodomain Proteins/genetics , Homeodomain Proteins/metabolism , Humans , K562 Cells , Lentivirus/physiology , Molecular Sequence Annotation , Mutation/genetics , Nuclear Proteins/metabolism , Promoter Regions, Genetic/genetics , Protein Domains , Repressor Proteins/chemistry , Repressor Proteins/metabolism , Reproducibility of Results , Transcription, Genetic , Zinc FingersABSTRACT
GENCODE produces high quality gene and transcript annotation for the human and mouse genomes. All GENCODE annotation is supported by experimental data and serves as a reference for genome biology and clinical genomics. The GENCODE consortium generates targeted experimental data, develops bioinformatic tools and carries out analyses that, along with externally produced data and methods, support the identification and annotation of transcript structures and the determination of their function. Here, we present an update on the annotation of human and mouse genes, including developments in the tools, data, analyses and major collaborations which underpin this progress. For example, we report the creation of a set of non-canonical ORFs identified in GENCODE transcripts, the LRGASP collaboration to assess the use of long transcriptomic data to build transcript models, the progress in collaborations with RefSeq and UniProt to increase convergence in the annotation of human and mouse protein-coding genes, the propagation of GENCODE across the human pan-genome and the development of new tools to support annotation of regulatory features by GENCODE. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org.
Subject(s)
Computational Biology , Genome, Human , Humans , Animals , Mice , Molecular Sequence Annotation , Computational Biology/methods , Genome, Human/genetics , Transcriptome/genetics , Gene Expression Profiling , Databases, GeneticABSTRACT
BACKGROUND: Many transcription factors (TFs), such as multi zinc-finger (ZF) TFs, have multiple DNA binding domains (DBDs), and deciphering the DNA binding motifs of individual DBDs is a major challenge. One example of such a TF is CCCTC-binding factor (CTCF), a TF with eleven ZFs that plays a variety of roles in transcriptional regulation, most notably anchoring DNA loops. Previous studies found that CTCF ZFs 3-7 bind CTCF's core motif and ZFs 9-11 bind a specific upstream motif, but the motifs of ZFs 1-2 have yet to be identified. RESULTS: We developed a new approach to identifying the binding motifs of individual DBDs of a TF through analyzing chromatin immunoprecipitation sequencing (ChIP-seq) experiments in which a single DBD is mutated: we train a deep convolutional neural network to predict whether wild-type TF binding sites are preserved in the mutant TF dataset and interpret the model. We applied this approach to mouse CTCF ChIP-seq data and identified the known binding preferences of CTCF ZFs 3-11 as well as a putative GAG binding motif for ZF 1. We analyzed other CTCF datasets to provide additional evidence that ZF 1 is associated with binding at the motif we identified, and we found that the presence of the motif for ZF 1 is associated with CTCF ChIP-seq peak strength. CONCLUSIONS: Our approach can be applied to any TF for which in vivo binding data from both the wild-type and mutated versions of the TF are available, and our findings provide new potential insights binding preferences of CTCF's DBDs.