RESUMEN
After two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no single chromosome has been finished end to end, and hundreds of unresolved gaps persist1,2. Here we present a human genome assembly that surpasses the continuity of GRCh382, along with a gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome3, we reconstructed the centromeric satellite DNA array (approximately 3.1 Mb) and closed the 29 remaining gaps in the current reference, including new sequences from the human pseudoautosomal regions and from cancer-testis ampliconic gene families (CT-X and GAGE). These sequences will be integrated into future human reference genome releases. In addition, the complete chromosome X, combined with the ultra-long nanopore data, allowed us to map methylation patterns across complex tandem repeats and satellite arrays. Our results demonstrate that finishing the entire human genome is now within reach, and the data presented here will facilitate ongoing efforts to complete the other human chromosomes.
Asunto(s)
Cromosomas Humanos X/genética , Genoma Humano/genética , Telómero/genética , Centrómero/genética , Islas de CpG/genética , Metilación de ADN , ADN Satélite/genética , Femenino , Humanos , Mola Hidatiforme/genética , Masculino , Embarazo , Reproducibilidad de los Resultados , Testículo/metabolismoRESUMEN
In mammals, cytosine methylation (5mC) is widely distributed throughout the genome but is notably depleted from active promoters and enhancers. While the role of DNA methylation in promoter silencing has been well documented, the function of this epigenetic mark at enhancers remains unclear. Recent experiments have demonstrated that enhancers are enriched for 5-hydroxymethylcytosine (5hmC), an oxidization product of the Tet family of 5mC dioxygenases and an intermediate of DNA demethylation. These results support the involvement of Tet proteins in the regulation of dynamic DNA methylation at enhancers. By mapping DNA methylation and hydroxymethylation at base resolution, we find that deletion of Tet2 causes extensive loss of 5hmC at enhancers, accompanied by enhancer hypermethylation, reduction of enhancer activity, and delayed gene induction in the early steps of differentiation. Our results reveal that DNA demethylation modulates enhancer activity, and its disruption influences the timing of transcriptome reprogramming during cellular differentiation.
Asunto(s)
Diferenciación Celular/genética , Metilación de ADN/genética , Proteínas de Unión al ADN/metabolismo , Elementos de Facilitación Genéticos/genética , Proteínas Proto-Oncogénicas/metabolismo , 5-Metilcitosina/metabolismo , Animales , Secuencia de Bases , Línea Celular , Citosina/análogos & derivados , Citosina/metabolismo , Proteínas de Unión al ADN/genética , Dioxigenasas , Ratones , Ratones Noqueados , Oxidación-Reducción , Regiones Promotoras Genéticas/genética , Proteínas Proto-Oncogénicas/genética , Análisis de Secuencia de ADN , Transcriptoma/genética , Dedos de Zinc/genéticaRESUMEN
Understanding the diversity of human tissues is fundamental to disease and requires linking genetic information, which is identical in most of an individual's cells, with epigenetic mechanisms that could have tissue-specific roles. Surveys of DNA methylation in human tissues have established a complex landscape including both tissue-specific and invariant methylation patterns. Here we report high coverage methylomes that catalogue cytosine methylation in all contexts for the major human organ systems, integrated with matched transcriptomes and genomic sequence. By combining these diverse data types with each individuals' phased genome, we identified widespread tissue-specific differential CG methylation (mCG), partially methylated domains, allele-specific methylation and transcription, and the unexpected presence of non-CG methylation (mCH) in almost all human tissues. mCH correlated with tissue-specific functions, and using this mark, we made novel predictions of genes that escape X-chromosome inactivation in specific tissues. Overall, DNA methylation in several genomic contexts varies substantially among human tissues.
Asunto(s)
Metilación de ADN , Epigénesis Genética , Factores de Edad , Alelos , Mapeo Cromosómico , Femenino , Perfilación de la Expresión Génica , Regulación de la Expresión Génica , Variación Genética , Humanos , Masculino , Especificidad de ÓrganosRESUMEN
Higher-order chromatin structure is emerging as an important regulator of gene expression. Although dynamic chromatin structures have been identified in the genome, the full scope of chromatin dynamics during mammalian development and lineage specification remains to be determined. By mapping genome-wide chromatin interactions in human embryonic stem (ES) cells and four human ES-cell-derived lineages, we uncover extensive chromatin reorganization during lineage specification. We observe that although self-associating chromatin domains are stable during differentiation, chromatin interactions both within and between domains change in a striking manner, altering 36% of active and inactive chromosomal compartments throughout the genome. By integrating chromatin interaction maps with haplotype-resolved epigenome and transcriptome data sets, we find widespread allelic bias in gene expression correlated with allele-biased chromatin states of linked promoters and distal enhancers. Our results therefore provide a global view of chromatin dynamics and a resource for studying long-range control of gene expression in distinct human cell lineages.
Asunto(s)
Diferenciación Celular , Ensamble y Desensamble de Cromatina , Cromatina/química , Cromatina/metabolismo , Células Madre Embrionarias/citología , Células Madre Embrionarias/metabolismo , Epigénesis Genética/genética , Alelos , Desequilibrio Alélico/genética , Diferenciación Celular/genética , Linaje de la Célula/genética , Cromatina/genética , Ensamble y Desensamble de Cromatina/genética , Elementos de Facilitación Genéticos/genética , Epigenómica , Redes Reguladoras de Genes , Humanos , Regiones Promotoras Genéticas/genética , Reproducibilidad de los ResultadosRESUMEN
Allelic differences between the two homologous chromosomes can affect the propensity of inheritance in humans; however, the extent of such differences in the human genome has yet to be fully explored. Here we delineate allelic chromatin modifications and transcriptomes among a broad set of human tissues, enabled by a chromosome-spanning haplotype reconstruction strategy. The resulting large collection of haplotype-resolved epigenomic maps reveals extensive allelic biases in both chromatin state and transcription, which show considerable variation across tissues and between individuals, and allow us to investigate cis-regulatory relationships between genes and their control sequences. Analyses of histone modification maps also uncover intriguing characteristics of cis-regulatory elements and tissue-restricted activities of repetitive elements. The rich data sets described here will enhance our understanding of the mechanisms by which cis-regulatory elements control gene expression programs.
Asunto(s)
Alelos , Epigénesis Genética/genética , Epigenómica , Haplotipos/genética , Acetilación , Cromatina/genética , Cromatina/metabolismo , Cromosomas Humanos/genética , Conjuntos de Datos como Asunto , Elementos de Facilitación Genéticos/genética , Variación Genética/genética , Histonas/metabolismo , Humanos , Motivos de Nucleótidos , Especificidad de Órganos/genética , Transcripción Genética/genéticaRESUMEN
Long-read sequencing and novel long-range assays have revolutionized de novo genome assembly by automating the reconstruction of reference-quality genomes. In particular, Hi-C sequencing is becoming an economical method for generating chromosome-scale scaffolds. Despite its increasing popularity, there are limited open-source tools available. Errors, particularly inversions and fusions across chromosomes, remain higher than alternate scaffolding technologies. We present a novel open-source Hi-C scaffolder that does not require an a priori estimate of chromosome number and minimizes errors by scaffolding with the assistance of an assembly graph. We demonstrate higher accuracy than the state-of-the-art methods across a variety of Hi-C library preparations and input assembly sizes. The Python and C++ code for our method is openly available at https://github.com/machinegun/SALSA.
Asunto(s)
Cromosomas Humanos/genética , Genoma Humano , Genómica/métodos , Algoritmos , Animales , Biología Computacional , Simulación por Computador , Bases de Datos de Ácidos Nucleicos/estadística & datos numéricos , Biblioteca Genómica , Genómica/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/estadística & datos numéricos , Programas InformáticosRESUMEN
A large number of cis-regulatory sequences have been annotated in the human genome, but defining their target genes remains a challenge. One strategy is to identify the long-range looping interactions at these elements with the use of chromosome conformation capture (3C)-based techniques. However, previous studies lack either the resolution or coverage to permit a whole-genome, unbiased view of chromatin interactions. Here we report a comprehensive chromatin interaction map generated in human fibroblasts using a genome-wide 3C analysis method (Hi-C). We determined over one million long-range chromatin interactions at 5-10-kb resolution, and uncovered general principles of chromatin organization at different types of genomic features. We also characterized the dynamics of promoter-enhancer contacts after TNF-α signalling in these cells. Unexpectedly, we found that TNF-α-responsive enhancers are already in contact with their target promoters before signalling. Such pre-existing chromatin looping, which also exists in other cell types with different extracellular signalling, is a strong predictor of gene induction. Our observations suggest that the three-dimensional chromatin landscape, once established in a particular cell type, is relatively stable and could influence the selection or activation of target genes by a ubiquitous transcription activator in a cell-specific manner.
Asunto(s)
Cromatina/metabolismo , Mapeo Cromosómico , Genoma Humano , Línea Celular , Cromatina/química , Cromatina/genética , Elementos de Facilitación Genéticos/fisiología , Regulación de la Expresión Génica , Humanos , Imagenología Tridimensional , Regiones Promotoras Genéticas/fisiología , Unión Proteica , Transducción de Señal , Factor de Necrosis Tumoral alfa/metabolismoRESUMEN
The spatial organization of the genome is intimately linked to its biological function, yet our understanding of higher order genomic structure is coarse, fragmented and incomplete. In the nucleus of eukaryotic cells, interphase chromosomes occupy distinct chromosome territories, and numerous models have been proposed for how chromosomes fold within chromosome territories. These models, however, provide only few mechanistic details about the relationship between higher order chromatin structure and genome function. Recent advances in genomic technologies have led to rapid advances in the study of three-dimensional genome organization. In particular, Hi-C has been introduced as a method for identifying higher order chromatin interactions genome wide. Here we investigate the three-dimensional organization of the human and mouse genomes in embryonic stem cells and terminally differentiated cell types at unprecedented resolution. We identify large, megabase-sized local chromatin interaction domains, which we term 'topological domains', as a pervasive structural feature of the genome organization. These domains correlate with regions of the genome that constrain the spread of heterochromatin. The domains are stable across different cell types and highly conserved across species, indicating that topological domains are an inherent property of mammalian genomes. Finally, we find that the boundaries of topological domains are enriched for the insulator binding protein CTCF, housekeeping genes, transfer RNAs and short interspersed element (SINE) retrotransposons, indicating that these factors may have a role in establishing the topological domain structure of the genome.
Asunto(s)
Cromatina/genética , Cromatina/metabolismo , Genoma , Animales , Sitios de Unión , Factor de Unión a CCCTC , Diferenciación Celular , Cromatina/química , Cromosomas/química , Cromosomas/genética , Cromosomas/metabolismo , Células Madre Embrionarias/metabolismo , Evolución Molecular , Femenino , Genes Esenciales/genética , Heterocromatina/química , Heterocromatina/genética , Heterocromatina/metabolismo , Humanos , Masculino , Mamíferos/genética , Ratones , ARN de Transferencia/genética , Proteínas Represoras/metabolismo , Elementos de Nucleótido Esparcido Corto/genéticaRESUMEN
Phasing of single nucleotide (SNV), and structural variations into chromosome-wide haplotypes in humans has been challenging, and required either trio sequencing or restricting phasing to population-based haplotypes. Selvaraj et al demonstrated single individual SNV phasing is possible with proximity ligated (HiC) sequencing. Here, we demonstrate HiC can phase structural variants into phased scaffolds of SNVs. Since HiC data is noisy, and SV calling is challenging, we applied a range of supervised classification techniques, including Support Vector Machines and Random Forest, to phase deletions. Our approach was demonstrated on deletion calls and phasings on the NA12878 human genome. We used three NA12878 chromosomes and simulated chromosomes to train model parameters. The remaining NA12878 chromosomes withheld from training were used to evaluate phasing accuracy. Random Forest had the highest accuracy and correctly phased 86% of the deletions with allele-specific read evidence. Allele-specific read evidence was found for 76% of the deletions. HiC provides significant read evidence for accurately phasing 33% of the deletions. Also, eight of eight top ranked deletions phased by only HiC were validated using long range polymerase chain reaction and Sanger. Thus, deletions from a single individual can be accurately phased using a combination of shotgun and proximity ligation sequencing. InPhaDel software is available at: http://l337x911.github.io/inphadel/.
Asunto(s)
Variación Estructural del Genoma , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Polimorfismo de Nucleótido Simple/genética , Análisis de Secuencia de ADN/métodos , Algoritmos , Alelos , Genoma Humano , Haplotipos/genética , Humanos , Eliminación de Secuencia/genética , Programas InformáticosRESUMEN
BACKGROUND: The MHC and KIR loci are clinically relevant regions of the genome. Typing the sequence of these loci has a wide range of applications including organ transplantation, drug discovery, pharmacogenomics and furthering fundamental research in immune genetics. Rapid advances in biochemical and next-generation sequencing (NGS) technologies have enabled several strategies for precise genotyping and phasing of candidate HLA alleles. Nonetheless, as typing of candidate HLA alleles alone reveals limited aspects of the genetics of MHC region, it is insufficient for the comprehensive utility of the aforementioned applications. For this reason, we believe phasing the entire MHC and KIR locus onto a single locus-spanning haplotype can be a critical improvement for better understanding transplantation biology. RESULTS: Generating long-range (>1 Mb) phase information is traditionally very challenging. As proximity-ligation based methods of DNA sequencing preserves chromosome-span phase information, we have utilized this principle to demonstrate its utility towards generating full-length phasing of MHC and KIR loci in human samples. We accurately (~99%) reconstruct the complete haplotypes for over 90% of sequence variants (coding and non-coding) within these two loci that collectively span 4-megabases. CONCLUSIONS: By haplotyping a majority of coding and non-coding alleles at the MHC and KIR loci in a single assay, this method has the potential to assist transplantation matching and facilitate investigation of the genetic basis of human immunity and disease.
Asunto(s)
Haplotipos/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Complejo Mayor de Histocompatibilidad/genética , Receptores KIR/genética , Genotipo , HumanosRESUMEN
Knowledge of spatial chromosomal organizations is critical for the study of transcriptional regulation and other nuclear processes in the cell. Recently, chromosome conformation capture (3C) based technologies, such as Hi-C and TCC, have been developed to provide a genome-wide, three-dimensional (3D) view of chromatin organization. Appropriate methods for analyzing these data and fully characterizing the 3D chromosomal structure and its structural variations are still under development. Here we describe a novel Bayesian probabilistic approach, denoted as "Bayesian 3D constructor for Hi-C data" (BACH), to infer the consensus 3D chromosomal structure. In addition, we describe a variant algorithm BACH-MIX to study the structural variations of chromatin in a cell population. Applying BACH and BACH-MIX to a high resolution Hi-C dataset generated from mouse embryonic stem cells, we found that most local genomic regions exhibit homogeneous 3D chromosomal structures. We further constructed a model for the spatial arrangement of chromatin, which reveals structural properties associated with euchromatic and heterochromatic regions in the genome. We observed strong associations between structural properties and several genomic and epigenetic features of the chromosome. Using BACH-MIX, we further found that the structural variations of chromatin are correlated with these genomic and epigenetic features. Our results demonstrate that BACH and BACH-MIX have the potential to provide new insights into the chromosomal architecture of mammalian cells.
Asunto(s)
Teorema de Bayes , Cromosomas , Algoritmos , Animales , Células Madre Embrionarias/ultraestructura , Epigénesis Genética , Genómica , Ratones , Modelos TeóricosRESUMEN
SUMMARY: We propose a parametric model, HiCNorm, to remove systematic biases in the raw Hi-C contact maps, resulting in a simple, fast, yet accurate normalization procedure. Compared with the existing Hi-C normalization method developed by Yaffe and Tanay, HiCNorm has fewer parameters, runs >1000 times faster and achieves higher reproducibility. AVAILABILITY: Freely available on the web at: http://www.people.fas.harvard.edu/â¼junliu/HiCNorm/. CONTACT: jliu@stat.harvard.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Mapeo Cromosómico/métodos , Modelos Lineales , Programas Informáticos , Composición de Base , Cromatina/genética , Biblioteca Genómica , Internet , Reproducibilidad de los Resultados , Estadísticas no ParamétricasRESUMEN
While genetic variation at chromatin loops is relevant for human disease, the relationships between contact propensity (the probability that loci at loops physically interact), genetics, and gene regulation are unclear. We quantitatively interrogate these relationships by comparing Hi-C and molecular phenotype data across cell types and haplotypes. While chromatin loops consistently form across different cell types, they have subtle quantitative differences in contact frequency that are associated with larger changes in gene expression and H3K27ac. For the vast majority of loci with quantitative differences in contact frequency across haplotypes, the changes in magnitude are smaller than those across cell types; however, the proportional relationships between contact propensity, gene expression, and H3K27ac are consistent. These findings suggest that subtle changes in contact propensity have a biologically meaningful role in gene regulation and could be a mechanism by which regulatory genetic variants in loop anchors mediate effects on expression.
Asunto(s)
Cromatina/genética , ADN/genética , Regulación de la Expresión Génica , Histonas/genética , Sitios de Carácter Cuantitativo/genética , Adolescente , Adulto , Anciano , Línea Celular , Cromatina/metabolismo , ADN/metabolismo , Femenino , Histonas/metabolismo , Humanos , Células Madre Pluripotentes Inducidas , Masculino , Persona de Mediana Edad , Miocitos Cardíacos , Conformación de Ácido Nucleico , Polimorfismo de Nucleótido Simple , Secuenciación Completa del Genoma , Adulto JovenRESUMEN
The pluripotency of embryonic stem cells (ESCs) is maintained by a small group of master transcription factors including Oct4, Sox2 and Nanog. These core factors form a regulatory circuit controlling the transcription of a number of pluripotency factors including themselves. Although previous studies have identified transcriptional regulators of this core network, the cis-regulatory DNA sequences required for the transcription of these key pluripotency factors remain to be defined. We analyzed epigenomic data within the 1.5 Mb gene-desert regions around the Sox2 gene and identified a 13kb-long super-enhancer (SE) located 100kb downstream of Sox2 in mouse ESCs. This SE is occupied by Oct4, Sox2, Nanog, and the mediator complex, and physically interacts with the Sox2 locus via DNA looping. Using a simple and highly efficient double-CRISPR genome editing strategy we deleted the entire 13-kb SE and characterized transcriptional defects in the resulting monoallelic and biallelic deletion clones with RNA-seq. We showed that the SE is responsible for over 90% of Sox2 expression, and Sox2 is the only target gene along the chromosome. Our results support the functional significance of a SE in maintaining the pluripotency transcription program in mouse ESCs.
Asunto(s)
Repeticiones Palindrómicas Cortas Agrupadas y Regularmente Espaciadas , Células Madre Embrionarias/metabolismo , Elementos de Facilitación Genéticos , Factores de Transcripción SOXB1/genética , Animales , Línea Celular , Mapeo Cromosómico , Biología Computacional , Epigénesis Genética , Eliminación de Gen , Regulación de la Expresión Génica , Ratones , Factores de Transcripción SOXB1/metabolismo , Transcripción GenéticaRESUMEN
Rapid advances in high-throughput sequencing facilitate variant discovery and genotyping, but linking variants into a single haplotype remains challenging. Here we demonstrate HaploSeq, an approach for assembling chromosome-scale haplotypes by exploiting the existence of 'chromosome territories'. We use proximity ligation and sequencing to show that alleles on homologous chromosomes occupy distinct territories, and therefore this experimental protocol preferentially recovers physically linked DNA variants on a homolog. Computational analysis of such data sets allows for accurate (â¼99.5%) reconstruction of chromosome-spanning haplotypes for â¼95% of alleles in hybrid mouse cells with 30× sequencing coverage. To resolve haplotypes for a human genome, which has a low density of variants, we coupled HaploSeq with local conditional phasing to obtain haplotypes for â¼81% of alleles with â¼98% accuracy from just 17× sequencing. Whereas methods based on proximity ligation were originally designed to investigate spatial organization of genomes, our results lend support for their use as a general tool for haplotyping.