RESUMO
Human limbs emerge during the fourth post-conception week as mesenchymal buds, which develop into fully formed limbs over the subsequent months1. This process is orchestrated by numerous temporally and spatially restricted gene expression programmes, making congenital alterations in phenotype common2. Decades of work with model organisms have defined the fundamental mechanisms underlying vertebrate limb development, but an in-depth characterization of this process in humans has yet to be performed. Here we detail human embryonic limb development across space and time using single-cell and spatial transcriptomics. We demonstrate extensive diversification of cells from a few multipotent progenitors to myriad differentiated cell states, including several novel cell populations. We uncover two waves of human muscle development, each characterized by different cell states regulated by separate gene expression programmes, and identify musculin (MSC) as a key transcriptional repressor maintaining muscle stem cell identity. Through assembly of multiple anatomically continuous spatial transcriptomic samples using VisiumStitcher, we map cells across a sagittal section of a whole fetal hindlimb. We reveal a clear anatomical segregation between genes linked to brachydactyly and polysyndactyly, and uncover transcriptionally and spatially distinct populations of the mesenchyme in the autopod. Finally, we perform single-cell RNA sequencing on mouse embryonic limbs to facilitate cross-species developmental comparison, finding substantial homology between the two species.
RESUMO
T cell development comprises a stepwise process of commitment from a multipotent precursor. To define molecular mechanisms controlling this progression, we probed five stages spanning the commitment process using RNA-seq and ChIP-seq to track genome-wide shifts in transcription, cohorts of active transcription factor genes, histone modifications at diverse classes of cis-regulatory elements, and binding repertoire of GATA-3 and PU.1, transcription factors with complementary roles in T cell development. The results highlight potential promoter-distal cis-regulatory elements in play and reveal both activation sites and diverse mechanisms of repression that silence genes used in alternative lineages. Histone marking is dynamic and reversible, and though permissive marks anticipate, repressive marks often lag behind changes in transcription. In vivo binding of PU.1 and GATA-3 relative to epigenetic marking reveals distinctive factor-specific rules for recruitment of these crucial transcription factors to different subsets of their potential sites, dependent on dose and developmental context.
Assuntos
Diferenciação Celular , Epigênese Genética , Linfócitos T/citologia , Animais , Fator de Transcrição GATA3/metabolismo , Regulação da Expressão Gênica , Estudo de Associação Genômica Ampla , Código das Histonas , Camundongos , Camundongos Endogâmicos C57BL , Regiões Promotoras Genéticas , Proteínas Proto-Oncogênicas/metabolismo , Receptores Notch/metabolismo , Elementos Reguladores de Transcrição , Transdução de Sinais , Linfócitos T/metabolismo , Transativadores/metabolismo , Transcrição GênicaRESUMO
Higher-order chromosomal organization for transcription regulation is poorly understood in eukaryotes. Using genome-wide Chromatin Interaction Analysis with Paired-End-Tag sequencing (ChIA-PET), we mapped long-range chromatin interactions associated with RNA polymerase II in human cells and uncovered widespread promoter-centered intragenic, extragenic, and intergenic interactions. These interactions further aggregated into higher-order clusters, wherein proximal and distal genes were engaged through promoter-promoter interactions. Most genes with promoter-promoter interactions were active and transcribed cooperatively, and some interacting promoters could influence each other implying combinatorial complexity of transcriptional controls. Comparative analyses of different cell lines showed that cell-specific chromatin interactions could provide structural frameworks for cell-specific transcription, and suggested significant enrichment of enhancer-promoter interactions for cell-specific functions. Furthermore, genetically-identified disease-associated noncoding elements were found to be spatially engaged with corresponding genes through long-range interactions. Overall, our study provides insights into transcription regulation by three-dimensional chromatin interactions for both housekeeping and cell-specific genes in human cells.
Assuntos
Cromatina/metabolismo , Regulação da Expressão Gênica , Regiões Promotoras Genéticas , RNA Polimerase II/metabolismo , Transcrição Gênica , Linhagem Celular Tumoral , Imunoprecipitação da Cromatina , Elementos Facilitadores Genéticos , Estudo de Associação Genômica Ampla , HumanosRESUMO
Transcription factors (TFs) are trans-acting proteins that bind cis-regulatory elements (CREs) in DNA to control gene expression. Here, we analyzed the genomic localization profiles of 529 sequence-specific TFs and 151 cofactors and chromatin regulators in the human cancer cell line HepG2, for a total of 680 broadly termed DNA-associated proteins (DAPs). We used this deep collection to model each TF's impact on gene expression, and identified a cohort of 26 candidate transcriptional repressors. We examine high occupancy target (HOT) sites in the context of three-dimensional genome organization and show biased motif placement in distal-promoter connections involving HOT sites. We also found a substantial number of closed chromatin regions with multiple DAPs bound, and explored their properties, finding that a MAFF/MAFK TF pair correlates with transcriptional repression. Altogether, these analyses provide novel insights into the regulatory logic of the human cell line HepG2 genome and show the usefulness of large genomic analyses for elucidation of individual TF functions.
RESUMO
Transcription factors are DNA-binding proteins that have key roles in gene regulation1,2. Genome-wide occupancy maps of transcriptional regulators are important for understanding gene regulation and its effects on diverse biological processes3-6. However, only a minority of the more than 1,600 transcription factors encoded in the human genome has been assayed. Here we present, as part of the ENCODE (Encyclopedia of DNA Elements) project, data and analyses from chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) experiments using the human HepG2 cell line for 208 chromatin-associated proteins (CAPs). These comprise 171 transcription factors and 37 transcriptional cofactors and chromatin regulator proteins, and represent nearly one-quarter of CAPs expressed in HepG2 cells. The binding profiles of these CAPs form major groups associated predominantly with promoters or enhancers, or with both. We confirm and expand the current catalogue of DNA sequence motifs for transcription factors, and describe motifs that correspond to other transcription factors that are co-enriched with the primary ChIP target. For example, FOX family motifs are enriched in ChIP-seq peaks of 37 other CAPs. We show that motif content and occupancy patterns can distinguish between promoters and enhancers. This catalogue reveals high-occupancy target regions at which many CAPs associate, although each contains motifs for only a minority of the numerous associated transcription factors. These analyses provide a more complete overview of the gene regulatory networks that define this cell type, and demonstrate the usefulness of the large-scale production efforts of the ENCODE Consortium.
Assuntos
Sequenciamento de Cromatina por Imunoprecipitação , Cromatina/genética , Cromatina/metabolismo , Proteínas de Ligação a DNA/metabolismo , Anotação de Sequência Molecular , Sequências Reguladoras de Ácido Nucleico/genética , Conjuntos de Dados como Assunto , Elementos Facilitadores Genéticos/genética , Células Hep G2 , Humanos , Motivos de Nucleotídeos/genética , Regiões Promotoras Genéticas/genética , Ligação Proteica , Fatores de Transcrição/metabolismoRESUMO
During mammalian embryogenesis, differential gene expression gradually builds the identity and complexity of each tissue and organ system1. Here we systematically quantified mouse polyA-RNA from day 10.5 of embryonic development to birth, sampling 17 tissues and organs. The resulting developmental transcriptome is globally structured by dynamic cytodifferentiation, body-axis and cell-proliferation gene sets that were further characterized by the transcription factor motif codes of their promoters. We decomposed the tissue-level transcriptome using single-cell RNA-seq (sequencing of RNA reverse transcribed into cDNA) and found that neurogenesis and haematopoiesis dominate at both the gene and cellular levels, jointly accounting for one-third of differential gene expression and more than 40% of identified cell types. By integrating promoter sequence motifs with companion ENCODE epigenomic profiles, we identified a prominent promoter de-repression mechanism in neuronal expression clusters that was attributable to known and novel repressors. Focusing on the developing limb, single-cell RNA data identified 25 candidate cell types that included progenitor and differentiating states with computationally inferred lineage relationships. We extracted cell-type transcription factor networks and complementary sets of candidate enhancer elements by using single-cell RNA-seq to decompose integrative cis-element (IDEAS) models that were derived from whole-tissue epigenome chromatin data. These ENCODE reference data, computed network components and IDEAS chromatin segmentations are companion resources to the matching epigenomic developmental matrix, and are available for researchers to further mine and integrate.
Assuntos
Embrião de Mamíferos/citologia , Embrião de Mamíferos/embriologia , Desenvolvimento Embrionário/genética , Regulação da Expressão Gênica no Desenvolvimento , Análise de Célula Única , Transcriptoma , Animais , Diferenciação Celular/genética , Linhagem da Célula/genética , Cromatina/genética , Embrião de Mamíferos/metabolismo , Elementos Facilitadores Genéticos , Epigenômica , Extremidades/embriologia , Feminino , Masculino , Camundongos , Poli A/genética , Poli A/metabolismo , Regiões Promotoras Genéticas , RNA-Seq , Fatores de Transcrição/metabolismoRESUMO
SUMMARY: Large-scale sharing of genomic quantification data requires standardized access interfaces. In this Global Alliance for Genomics and Health project, we developed RNAget, an API for secure access to genomic quantification data in matrix form. RNAget provides for slicing matrices to extract desired subsets of data and is applicable to all expression matrix-format data, including RNA sequencing and microarrays. Further, it generalizes to quantification matrices of other sequence-based genomics such as ATAC-seq and ChIP-seq. AVAILABILITY AND IMPLEMENTATION: https://ga4gh-rnaseq.github.io/schema/docs/index.html.
Assuntos
RNA , Software , Genômica , Genoma , Análise de Sequência de RNARESUMO
DNA-associated proteins (DAPs) classically regulate gene expression by binding to regulatory loci such as enhancers or promoters. As expanding catalogs of genome-wide DAP binding maps reveal thousands of loci that, unlike the majority of conventional enhancers and promoters, associate with dozens of different DAPs with apparently little regard for motif preference, an understanding of DAP association and coordination at such regulatory loci is essential to deciphering how these regions contribute to normal development and disease. In this study, we aggregated publicly available ChIP-seq data from 469 human DAPs assayed in three cell lines and integrated these data with an orthogonal data set of 352 nonredundant, in vitro-derived motifs mapped to the genome within DNase I hypersensitivity footprints to characterize regions with high numbers of DAP associations. We establish a generalizable definition for high occupancy target (HOT) loci and identify putative driver DAP motifs in HepG2 cells, including HNF4A, SP1, SP5, and ETV4, that are highly prevalent and show sequence conservation at HOT loci. The number of different DAPs associated with an element is positively associated with evidence of regulatory activity, and by systematically mutating 245 HOT loci with a massively parallel mutagenesis assay, we localized regulatory activity to a central core region that depends on the motif sequences of our previously nominated driver DAPs. In sum, this work leverages the increasingly large number of DAP motif and ChIP-seq data publicly available to explore how DAP associations contribute to genome-wide transcriptional regulation.
Assuntos
Elementos Facilitadores Genéticos , Regulação da Expressão Gênica , Regiões Promotoras Genéticas , Fatores de Transcrição/metabolismo , Composição de Bases , Linhagem Celular , Cromatina/química , Sequenciamento de Cromatina por Imunoprecipitação , DNA/química , Loci Gênicos , Genoma , Células Hep G2 , Humanos , Mutagênese , Mutação , Motivos de NucleotídeosRESUMO
MicroRNAs (miRNAs) play a critical role as posttranscriptional regulators of gene expression. The ENCODE Project profiled the expression of miRNAs in an extensive set of organs during a time-course of mouse embryonic development and captured the expression dynamics of 785 miRNAs. We found distinct organ-specific and developmental stage-specific miRNA expression clusters, with an overall pattern of increasing organ-specific expression as embryonic development proceeds. Comparative analysis of conserved miRNAs in mouse and human revealed stronger clustering of expression patterns by organ type rather than by species. An analysis of messenger RNA expression clusters compared with miRNA expression clusters identifies the potential role of specific miRNA expression clusters in suppressing the expression of mRNAs specific to other developmental programs in the organ in which these miRNAs are expressed during embryonic development. Our results provide the most comprehensive time-course of miRNA expression as part of an integrated ENCODE reference data set for mouse embryonic development.
Assuntos
Desenvolvimento Embrionário/genética , MicroRNAs/genética , Animais , Feminino , Regulação da Expressão Gênica no Desenvolvimento , Camundongos , Gravidez , RNA Mensageiro/genéticaRESUMO
RNA-sequencing (RNA-seq) is commonly used to identify genetic modules that respond to perturbations. In single cells, transcriptomes have been used as phenotypes, but this concept has not been applied to whole-organism RNA-seq. Also, quantifying and interpreting epistatic effects using expression profiles remains a challenge. We developed a single coefficient to quantify transcriptome-wide epistasis that reflects the underlying interactions and which can be interpreted intuitively. To demonstrate our approach, we sequenced four single and two double mutants of Caenorhabditis elegans From these mutants, we reconstructed the known hypoxia pathway. In addition, we uncovered a class of 56 genes with HIF-1-dependent expression that have opposite changes in expression in mutants of two genes that cooperate to negatively regulate HIF-1 abundance; however, the double mutant of these genes exhibits suppression epistasis. This class violates the classical model of HIF-1 regulation but can be explained by postulating a role of hydroxylated HIF-1 in transcriptional control.
Assuntos
Proteínas de Caenorhabditis elegans/genética , Caenorhabditis elegans/genética , Epistasia Genética , Redes Reguladoras de Genes , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Transcriptoma , Animais , Caenorhabditis elegans/crescimento & desenvolvimentoRESUMO
In situ hybridization methods are used across the biological sciences to map mRNA expression within intact specimens. Multiplexed experiments, in which multiple target mRNAs are mapped in a single sample, are essential for studying regulatory interactions, but remain cumbersome in most model organisms. Programmable in situ amplifiers based on the mechanism of hybridization chain reaction (HCR) overcome this longstanding challenge by operating independently within a sample, enabling multiplexed experiments to be performed with an experimental timeline independent of the number of target mRNAs. To assist biologists working across a broad spectrum of organisms, we demonstrate multiplexed in situ HCR in diverse imaging settings: bacteria, whole-mount nematode larvae, whole-mount fruit fly embryos, whole-mount sea urchin embryos, whole-mount zebrafish larvae, whole-mount chicken embryos, whole-mount mouse embryos and formalin-fixed paraffin-embedded human tissue sections. In addition to straightforward multiplexing, in situ HCR enables deep sample penetration, high contrast and subcellular resolution, providing an incisive tool for the study of interlaced and overlapping expression patterns, with implications for research communities across the biological sciences.
Assuntos
Hibridização In Situ/métodos , RNA Mensageiro/metabolismo , Animais , Drosophila , Embrião não Mamífero/metabolismo , Humanos , Peixe-ZebraRESUMO
Single-cell RNA-seq mammalian transcriptome studies are at an early stage in uncovering cell-to-cell variation in gene expression, transcript processing and editing, and regulatory module activity. Despite great progress recently, substantial challenges remain, including discriminating biological variation from technical noise. Here we apply the SMART-seq single-cell RNA-seq protocol to study the reference lymphoblastoid cell line GM12878. By using spike-in quantification standards, we estimate the absolute number of RNA molecules per cell for each gene and find significant variation in total mRNA content: between 50,000 and 300,000 transcripts per cell. We directly measure technical stochasticity by a pool/split design and find that there are significant differences in expression between individual cells, over and above technical variation. Specific gene coexpression modules were preferentially expressed in subsets of individual cells, including one enriched for mRNA processing and splicing factors. We assess cell-to-cell variation in alternative splicing and allelic bias and report evidence of significant differences in splice site usage that exceed splice variation in the pool/split comparison. Finally, we show that transcriptomes from small pools of 30-100 cells approach the information content and reproducibility of contemporary RNA-seq from large amounts of input material. Together, our results define an experimental and computational path forward for analyzing gene expression in rare cell types and cell states.
Assuntos
Perfilação da Expressão Gênica/métodos , Genes , Splicing de RNA , RNA/análise , Linhagem Celular Tumoral , Genoma Humano , Humanos , RNA/genética , Reprodutibilidade dos Testes , Análise de Sequência de RNA , TranscriptomaRESUMO
With the completion of the human genome sequence, attention turned to identifying and annotating its functional DNA elements. As a complement to genetic and comparative genomics approaches, the Encyclopedia of DNA Elements Project was launched to contribute maps of RNA transcripts, transcriptional regulator binding sites, and chromatin states in many cell types. The resulting genome-wide data reveal sites of biochemical activity with high positional resolution and cell type specificity that facilitate studies of gene regulation and interpretation of noncoding variants associated with human disease. However, the biochemically active regions cover a much larger fraction of the genome than do evolutionarily conserved regions, raising the question of whether nonconserved but biochemically active regions are truly functional. Here, we review the strengths and limitations of biochemical, evolutionary, and genetic approaches for defining functional DNA segments, potential sources for the observed differences in estimated genomic coverage, and the biological implications of these discrepancies. We also analyze the relationship between signal intensity, genomic coverage, and evolutionary conservation. Our results reinforce the principle that each approach provides complementary information and that we need to use combinations of all three to elucidate genome function in human biology and disease.
Assuntos
DNA/genética , Genoma Humano/genética , Evolução Biológica , Doença/genética , Humanos , Sequências Reguladoras de Ácido Nucleico/genética , SoftwareRESUMO
We tested whether self-organizing maps (SOMs) could be used to effectively integrate, visualize, and mine diverse genomics data types, including complex chromatin signatures. A fine-grained SOM was trained on 72 ChIP-seq histone modifications and DNase-seq data sets from six biologically diverse cell lines studied by The ENCODE Project Consortium. We mined the resulting SOM to identify chromatin signatures related to sequence-specific transcription factor occupancy, sequence motif enrichment, and biological functions. To highlight clusters enriched for specific functions such as transcriptional promoters or enhancers, we overlaid onto the map additional data sets not used during training, such as ChIP-seq, RNA-seq, CAGE, and information on cis-acting regulatory modules from the literature. We used the SOM to parse known transcriptional enhancers according to the cell-type-specific chromatin signature, and we further corroborated this pattern on the map by EP300 (also known as p300) occupancy. New candidate cell-type-specific enhancers were identified for multiple ENCODE cell types in this way, along with new candidates for ubiquitous enhancer activity. An interactive web interface was developed to allow users to visualize and custom-mine the ENCODE SOM. We conclude that large SOMs trained on chromatin data from multiple cell types provide a powerful way to identify complex relationships in genomic data at user-selected levels of granularity.
Assuntos
Cromatina/genética , Cromatina/metabolismo , Histonas/genética , Histonas/metabolismo , Fatores de Transcrição/genética , Algoritmos , Linhagem Celular , Mapeamento Cromossômico , Biologia Computacional , Mineração de Dados , Ontologia Genética , Células Endoteliais da Veia Umbilical Humana , Humanos , Células K562 , Regiões Promotoras Genéticas , Interface Usuário-ComputadorRESUMO
As studies of DNA methylation increase in scope, it has become evident that methylation has a complex relationship with gene expression, plays an important role in defining cell types, and is disrupted in many diseases. We describe large-scale single-base resolution DNA methylation profiling on a diverse collection of 82 human cell lines and tissues using reduced representation bisulfite sequencing (RRBS). Analysis integrating RNA-seq and ChIP-seq data illuminates the functional role of this dynamic mark. Loci that are hypermethylated across cancer types are enriched for sites bound by NANOG in embryonic stem cells, which supports and expands the model of a stem/progenitor cell signature in cancer. CpGs that are hypomethylated across cancer types are concentrated in megabase-scale domains that occur near the telomeres and centromeres of chromosomes, are depleted of genes, and are enriched for cancer-specific EZH2 binding and H3K27me3 (repressive chromatin). In noncancer samples, there are cell-type specific methylation signatures preserved in primary cell lines and tissues as well as methylation differences induced by cell culture. The relationship between methylation and expression is context-dependent, and we find that CpG-rich enhancers bound by EP300 in the bodies of expressed genes are unmethylated despite the dense gene-body methylation surrounding them. Non-CpG cytosine methylation occurs in human somatic tissue, is particularly prevalent in brain tissue, and is reproducible across many individuals. This study provides an atlas of DNA methylation across diverse and well-characterized samples and enables new discoveries about DNA methylation and its role in gene regulation and disease.
Assuntos
Ilhas de CpG , Metilação de DNA , Linhagem Celular Tumoral , Cromatina , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Humanos , Análise de Sequência com Séries de Oligonucleotídeos , Regiões Promotoras Genéticas , Alinhamento de Sequência , Análise de Sequência de DNA , Sulfitos/metabolismoRESUMO
RNA-seq data can be mined for sequence differences relative to the reference genome to identify both genomic SNPs and RNA editing events. We analyzed the long, polyA-selected, unstranded, deeply sequenced RNA-seq data from the ENCODE Project across 14 human cell lines for candidate RNA editing events. On average, 43% of the RNA sequencing variants that are not in dbSNP and are within gene boundaries are A-to-G(I) RNA editing candidates. The vast majority of A-to-G(I) edits are located in introns and 3' UTRs, with only 123 located in protein-coding sequence. In contrast, the majority of non-A-to-G variants (60%-80%) map near exon boundaries and have the characteristics of splice-mapping artifacts. After filtering out all candidates with evidence of private genomic variation using genome resequencing or ChIP-seq data, we find that up to 85% of the high-confidence RNA variants are A-to-G(I) editing candidates. Genes with A-to-G(I) edits are enriched in Gene Ontology terms involving cell division, viral defense, and translation. The distribution and character of the remaining non-A-to-G variants closely resemble known SNPs. We find no reproducible A-to-G(I) edits that result in nonsynonymous substitutions in all three lymphoblastoid cell lines in our study, unlike RNA editing in the brain. Given that only a fraction of sites are reproducibly edited in multiple cell lines and that we find a stronger association of editing and specific genes suggests that the editing of the transcript is more important than the editing of any individual site.
Assuntos
Edição de RNA , RNA/metabolismo , Linhagem Celular , Análise por Conglomerados , Biologia Computacional/métodos , Genômica/métodos , Humanos , Fases de Leitura Aberta , Polimorfismo de Nucleotídeo Único , RNA/química , Splicing de RNA , RNA Mensageiro/metabolismo , Análise de Sequência de RNARESUMO
Two major transcriptional regulators of Caenorhabditis elegans bodywall muscle (BWM) differentiation, hlh-1 and unc-120, are expressed in muscle where they are known to bind and regulate several well-studied muscle-specific genes. Simultaneously mutating both factors profoundly inhibits formation of contractile BWM. These observations were consistent with a simple network model in which the muscle regulatory factors drive tissue-specific transcription by binding selectively near muscle-specific targets to activate them. We tested this model by measuring the number, identity, and tissue-specificity of functional regulatory targets for each factor. Some joint regulatory targets (218) are BWM-specific and enriched for nearby HLH-1 binding. However, contrary to the simple model, the majority of genes regulated by one or both muscle factors are also expressed significantly in non-BWM tissues. We also mapped global factor occupancy by HLH-1, and created a genetic interaction map that identifies hlh-1 collaborating transcription factors. HLH-1 binding did not predict proximate regulatory action overall, despite enrichment for binding among BWM-specific positive regulatory targets of hlh-1. We conclude that these tissue-specific factors contribute much more broadly to the transcriptional output of muscle tissue than previously thought, offering a partial explanation for widespread HLH-1 occupancy. We also identify a novel regulatory connection between the BWM-specific hlh-1 network and the hlh-8/twist nonstriated muscle network. Finally, our results suggest a molecular basis for synthetic lethality in which hlh-1 and unc-120 mutant phenotypes are mutually buffered by joint additive regulation of essential target genes, with additional buffering suggested via newly identified hlh-1 interacting factors.
Assuntos
Caenorhabditis elegans/genética , Caenorhabditis elegans/metabolismo , Regulação da Expressão Gênica , Especificidade de Órgãos/genética , Fatores de Transcrição/metabolismo , Animais , Sequência de Bases , Sítios de Ligação , Proteínas de Caenorhabditis elegans/genética , Proteínas de Caenorhabditis elegans/metabolismo , Sequência Conservada , Perfilação da Expressão Gênica , Redes Reguladoras de Genes , Proteínas de Domínio MADS/genética , Proteínas de Domínio MADS/metabolismo , Proteínas Musculares , Músculos/metabolismo , Mutação , Fatores de Regulação Miogênica/genética , Fatores de Regulação Miogênica/metabolismo , Proteínas Nucleares , Motivos de Nucleotídeos , Ligação Proteica , Interferência de RNA , Fatores de Transcrição/genética , TranscriptomaRESUMO
A complex interplay between transcription factors (TFs) and the genome regulates transcription. However, connecting variation in genome sequence with variation in TF binding and gene expression is challenging due to environmental differences between individuals and cell types. To address this problem, we measured genome-wide differential allelic occupancy of 24 TFs and EP300 in a human lymphoblastoid cell line GM12878. Overall, 5% of human TF binding sites have an allelic imbalance in occupancy. At many sites, TFs clustered in TF-binding hubs on the same homolog in especially open chromatin. While genetic variation in core TF binding motifs generally resulted in large allelic differences in TF occupancy, most allelic differences in occupancy were subtle and associated with disruption of weak or noncanonical motifs. We also measured genome-wide differential allelic expression of genes with and without heterozygous exonic variants in the same cells. We found that genes with differential allelic expression were overall less expressed both in GM12878 cells and in unrelated human cell lines. Comparing TF occupancy with expression, we found strong association between allelic occupancy and expression within 100 bp of transcription start sites (TSSs), and weak association up to 100 kb from TSSs. Sites of differential allelic occupancy were significantly enriched for variants associated with disease, particularly autoimmune disease, suggesting that allelic differences in TF occupancy give functional insights into intergenic variants associated with disease. Our results have the potential to increase the power and interpretability of association studies by targeting functional intergenic variants in addition to protein coding sequences.
Assuntos
Alelos , Regulação da Expressão Gênica , Variação Genética , Fatores de Transcrição/metabolismo , Doenças Autoimunes/genética , Sequência de Bases , Sítios de Ligação , Linhagem Celular , Cromatina/metabolismo , Imunoprecipitação da Cromatina , Proteína p300 Associada a E1A/metabolismo , Éxons , Genoma Humano , Humanos , Íntrons , Polimorfismo de Nucleotídeo Único , Ligação Proteica , RNA Polimerase II/metabolismo , Elementos Reguladores de Transcrição , Análise de Sequência de RNARESUMO
Chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) has become a valuable and widely used approach for mapping the genomic location of transcription-factor binding and histone modifications in living cells. Despite its widespread use, there are considerable differences in how these experiments are conducted, how the results are scored and evaluated for quality, and how the data and metadata are archived for public use. These practices affect the quality and utility of any global ChIP experiment. Through our experience in performing ChIP-seq experiments, the ENCODE and modENCODE consortia have developed a set of working standards and guidelines for ChIP experiments that are updated routinely. The current guidelines address antibody validation, experimental replication, sequencing depth, data and metadata reporting, and data quality assessment. We discuss how ChIP quality, assessed in these ways, affects different uses of ChIP-seq data. All data sets used in the analysis have been deposited for public viewing and downloading at the ENCODE (http://encodeproject.org/ENCODE/) and modENCODE (http://www.modencode.org/) portals.