RESUMEN
Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.
Asunto(s)
Genoma Humano , Genómica , Humanos , Diploidia , Genoma Humano/genética , Haplotipos/genética , Análisis de Secuencia de ADN , Genómica/normas , Estándares de Referencia , Estudios de Cohortes , Alelos , Variación GenéticaRESUMEN
The Human Genome Project was an enormous accomplishment, providing a foundation for countless explorations into the genetics and genomics of the human species. Yet for many years, the human genome reference sequence remained incomplete and lacked representation of human genetic diversity. Recently, two major advances have emerged to address these shortcomings: complete gap-free human genome sequences, such as the one developed by the Telomere-to-Telomere Consortium, and high-quality pangenomes, such as the one developed by the Human Pangenome Reference Consortium. Facilitated by advances in long-read DNA sequencing and genome assembly algorithms, complete human genome sequences resolve regions that have been historically difficult to sequence, including centromeres, telomeres, and segmental duplications. In parallel, pangenomes capture the extensive genetic diversity across populations worldwide. Together, these advances usher in a new era of genomics research, enhancing the accuracy of genomic analysis, paving the path for precision medicine, and contributing to deeper insights into human biology.
Asunto(s)
Genoma Humano , Proyecto Genoma Humano , Humanos , Variación Genética , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Telómero/genéticaRESUMEN
Reference-free genome phasing is vital for understanding allele inheritance and the impact of single-molecule DNA variation on phenotypes. To achieve thorough phasing across homozygous or repetitive regions of the genome, long-read sequencing technologies are often used to perform phased de novo assembly. As a step toward reducing the cost and complexity of this type of analysis, we describe new methods for accurately phasing Oxford Nanopore Technologies (ONT) sequence data with the Shasta genome assembler and a modular tool for extending phasing to the chromosome scale called GFAse. We test using new variants of ONT PromethION sequencing, including those using proximity ligation, and show that newer, higher accuracy ONT reads substantially improve assembly quality.
Asunto(s)
Nanoporos , Humanos , Análisis de Secuencia de ADN/métodos , Secuenciación de Nanoporos/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas Informáticos , Genómica/métodosRESUMEN
Pangenomes reduce reference bias by representing genetic diversity better than a single reference sequence. Yet when comparing a sample to a pangenome, variants in the pangenome that are not part of the sample can be misleading, for example, causing false read mappings. These irrelevant variants are generally rarer in terms of allele frequency, and have previously been dealt with by filtering rare variants. However, this blunt heuristic both fails to remove some irrelevant variants and removes many relevant variants. We propose a new approach that imputes a personalized pangenome subgraph by sampling local haplotypes according to k-mer counts in the reads. We implement the approach in the vg toolkit ( https://github.com/vgteam/vg ) for the Giraffe short-read aligner and compare its accuracy to state-of-the-art methods using human pangenome graphs from the Human Pangenome Reference Consortium. This reduces small variant genotyping errors by four times relative to the Genome Analysis Toolkit and makes short-read structural variant genotyping of known variants competitive with long-read variant discovery methods.
RESUMEN
Long-read sequencing technologies substantially overcome the limitations of short-reads but have not been considered as a feasible replacement for population-scale projects, being a combination of too expensive, not scalable enough or too error-prone. Here we develop an efficient and scalable wet lab and computational protocol, Napu, for Oxford Nanopore Technologies long-read sequencing that seeks to address those limitations. We applied our protocol to cell lines and brain tissue samples as part of a pilot project for the National Institutes of Health Center for Alzheimer's and Related Dementias. Using a single PromethION flow cell, we can detect single nucleotide polymorphisms with F1-score comparable to Illumina short-read sequencing. Small indel calling remains difficult within homopolymers and tandem repeats, but achieves good concordance to Illumina indel calls elsewhere. Further, we can discover structural variants with F1-score on par with state-of-the-art de novo assembly methods. Our protocol phases small and structural variants at megabase scales and produces highly accurate, haplotype-specific methylation calls.
Asunto(s)
Genoma Humano , Secuenciación de Nanoporos , Humanos , Análisis de Secuencia de ADN/métodos , Haplotipos , Metilación , Proyectos Piloto , Secuenciación de Nucleótidos de Alto Rendimiento/métodosRESUMEN
The escalation of antibiotic resistance, pandemics, and nosocomial infections underscores the importance of research in both animal and human infectious diseases. Recent advancements in three-dimensional tissue cultures, or "organoids", have revolutionized the development of in vitro models for infectious diseases. Our study conducts a bibliometric analysis on the use of organoids in modeling infectious diseases, offering an in-depth overview of this field's current landscape. We examined scientific contributions from 2009 onward that focused on organoids in hostâpathogen interactions using the Web of Science Core Collection and OpenAlex database. Our analysis included temporal trends, reference aging, author, and institutional productivity, collaborative networks, citation metrics, keyword cluster dynamics, and disruptiveness of organoid models. VOSviewer, CiteSpace, and Python facilitated this analytical assessment. The findings reveal significant growth and advancements in organoid-based infectious disease research. Analysis of keywords and impactful publications identified three distinct developmental phases in this area that were significantly influenced by outbreaks of Zika and SARS-CoV-2 viruses. The research also highlights the synergistic efforts between academia and publishers in tackling global pandemic challenges. Through mostly consolidating research efforts, organoids are proving to be a promising tool in infectious disease research for both human and animal infectious disease. Their integration into the field necessitates methodological refinements for better physiological emulation and the establishment of extensive organoid biobanks. These improvements are crucial for fully harnessing the potential of organoids in understanding infectious diseases and advancing the development of targeted treatments and vaccines.
Asunto(s)
Bibliometría , Organoides , Organoides/virología , Animales , Humanos , Enfermedades Transmisibles/veterinaria , Enfermedades Transmisibles/epidemiología , Modelos Animales de Enfermedad , COVID-19/epidemiología , COVID-19/virologíaRESUMEN
Epilepsy will affect nearly 3% of people at some point during their lifetime. Previous copy number variants (CNVs) studies of epilepsy have used array-based technology and were restricted to the detection of large or exonic events. In contrast, whole-genome sequencing (WGS) has the potential to more comprehensively profile CNVs but existing analytic methods suffer from limited accuracy. We show that this is in part due to the non-uniformity of read coverage, even after intra-sample normalization. To improve on this, we developed PopSV, an algorithm that uses multiple samples to control for technical variation and enables the robust detection of CNVs. Using WGS and PopSV, we performed a comprehensive characterization of CNVs in 198 individuals affected with epilepsy and 301 controls. For both large and small variants, we found an enrichment of rare exonic events in epilepsy patients, especially in genes with predicted loss-of-function intolerance. Notably, this genome-wide survey also revealed an enrichment of rare non-coding CNVs near previously known epilepsy genes. This enrichment was strongest for non-coding CNVs located within 100 Kbp of an epilepsy gene and in regions associated with changes in the gene expression, such as expression QTLs or DNase I hypersensitive sites. Finally, we report on 21 potentially damaging events that could be associated with known or new candidate epilepsy genes. Our results suggest that comprehensive sequence-based profiling of CNVs could help explain a larger fraction of epilepsy cases.
Asunto(s)
Variaciones en el Número de Copia de ADN , Epilepsia/genética , Estudios de Casos y Controles , Estudios de Cohortes , Humanos , Sitios de Carácter Cuantitativo , Secuenciación Completa del GenomaRESUMEN
Developmental and epileptic encephalopathy (DEE) is a group of conditions characterized by the co-occurrence of epilepsy and intellectual disability (ID), typically with developmental plateauing or regression associated with frequent epileptiform activity. The cause of DEE remains unknown in the majority of cases. We performed whole-genome sequencing (WGS) in 197 individuals with unexplained DEE and pharmaco-resistant seizures and in their unaffected parents. We focused our attention on de novo mutations (DNMs) and identified candidate genes containing such variants. We sought to identify additional subjects with DNMs in these genes by performing targeted sequencing in another series of individuals with DEE and by mining various sequencing datasets. We also performed meta-analyses to document enrichment of DNMs in candidate genes by leveraging our WGS dataset with those of several DEE and ID series. By combining these strategies, we were able to provide a causal link between DEE and the following genes: NTRK2, GABRB2, CLTC, DHDDS, NUS1, RAB11A, GABBR2, and SNAP25. Overall, we established a molecular diagnosis in 63/197 (32%) individuals in our WGS series. The main cause of DEE in these individuals was de novo point mutations (53/63 solved cases), followed by inherited mutations (6/63 solved cases) and de novo CNVs (4/63 solved cases). De novo missense variants explained a larger proportion of individuals in our series than in other series that were primarily ascertained because of ID. Moreover, these DNMs were more frequently recurrent than those identified in ID series. These observations indicate that the genetic landscape of DEE might be different from that of ID without epilepsy.
Asunto(s)
Encefalopatías/genética , Epilepsia/genética , Mutación/genética , Niño , Preescolar , Femenino , Genoma Humano/genética , Estudio de Asociación del Genoma Completo/métodos , Humanos , Discapacidad Intelectual/genética , Masculino , Recurrencia , Convulsiones/genéticaAsunto(s)
Cuidados Críticos , Secuenciación de Nanoporos/métodos , Trastornos del Neurodesarrollo/diagnóstico , Adolescente , Preescolar , Femenino , Humanos , Lactante , Recién Nacido , Masculino , Persona de Mediana Edad , Mutación , Secuenciación de Nanoporos/economía , Trastornos del Neurodesarrollo/genética , Análisis de Secuencia de ADN/métodos , Estado Epiléptico/genéticaRESUMEN
Copy number variants (CNVs) are known to affect a large portion of the human genome and have been implicated in many diseases. Although whole-genome sequencing (WGS) can help identify CNVs, most analytical methods suffer from limited sensitivity and specificity, especially in regions of low mappability. To address this, we use PopSV, a CNV caller that relies on multiple samples to control for technical variation. We demonstrate that our calls are stable across different types of repeat-rich regions and validate the accuracy of our predictions using orthogonal approaches. Applying PopSV to 640 human genomes, we find that low-mappability regions are approximately 5 times more likely to harbor germline CNVs, in stark contrast to the nearly uniform distribution observed for somatic CNVs in 95 cancer genomes. In addition to known enrichments in segmental duplication and near centromeres and telomeres, we also report that CNVs are enriched in specific types of satellite and in some of the most recent families of transposable elements. Finally, using this comprehensive approach, we identify 3455 regions with recurrent CNVs that were missing from existing catalogs. In particular, we identify 347 genes with a novel exonic CNV in low-mappability regions, including 29 genes previously associated with disease.
Asunto(s)
Centrómero/genética , Mapeo Cromosómico/métodos , Variaciones en el Número de Copia de ADN , Genoma Humano/genética , Secuencias Repetitivas de Ácidos Nucleicos/genética , Telómero/genética , Genómica/métodos , Humanos , Neoplasias/genética , Neoplasias/patología , Polimorfismo de Nucleótido Simple , Reproducibilidad de los Resultados , Secuenciación Completa del Genoma/métodosRESUMEN
Genome sequencing projects are discovering millions of genetic variants in humans, and interpretation of their functional effects is essential for understanding the genetic basis of variation in human traits. Here we report sequencing and deep analysis of messenger RNA and microRNA from lymphoblastoid cell lines of 462 individuals from the 1000 Genomes Project--the first uniformly processed high-throughput RNA-sequencing data from multiple human populations with high-quality genome sequences. We discover extremely widespread genetic variation affecting the regulation of most genes, with transcript structure and expression level variation being equally common but genetically largely independent. Our characterization of causal regulatory variation sheds light on the cellular mechanisms of regulatory and loss-of-function variation, and allows us to infer putative causal variants for dozens of disease-associated loci. Altogether, this study provides a deep understanding of the cellular mechanisms of transcriptome variation and of the landscape of functional variants in the human genome.
Asunto(s)
Variación Genética/genética , Genoma Humano/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ARN , Transcriptoma/genética , Alelos , Línea Celular Transformada , Exones/genética , Perfilación de la Expresión Génica , Humanos , Polimorfismo de Nucleótido Simple/genética , Sitios de Carácter Cuantitativo/genética , ARN Mensajero/análisis , ARN Mensajero/genéticaRESUMEN
Chronic lymphocytic leukemia (CLL) has heterogeneous clinical and biological behavior. Whole-genome and -exome sequencing has contributed to the characterization of the mutational spectrum of the disease, but the underlying transcriptional profile is still poorly understood. We have performed deep RNA sequencing in different subpopulations of normal B-lymphocytes and CLL cells from a cohort of 98 patients, and characterized the CLL transcriptional landscape with unprecedented resolution. We detected thousands of transcriptional elements differentially expressed between the CLL and normal B cells, including protein-coding genes, noncoding RNAs, and pseudogenes. Transposable elements are globally derepressed in CLL cells. In addition, two thousand genes-most of which are not differentially expressed-exhibit CLL-specific splicing patterns. Genes involved in metabolic pathways showed higher expression in CLL, while genes related to spliceosome, proteasome, and ribosome were among the most down-regulated in CLL. Clustering of the CLL samples according to RNA-seq derived gene expression levels unveiled two robust molecular subgroups, C1 and C2. C1/C2 subgroups and the mutational status of the immunoglobulin heavy variable (IGHV) region were the only independent variables in predicting time to treatment in a multivariate analysis with main clinico-biological features. This subdivision was validated in an independent cohort of patients monitored through DNA microarrays. Further analysis shows that B-cell receptor (BCR) activation in the microenvironment of the lymph node may be at the origin of the C1/C2 differences.
Asunto(s)
Linfocitos B , Regulación Neoplásica de la Expresión Génica , Secuenciación de Nucleótidos de Alto Rendimiento , Leucemia Linfocítica Crónica de Células B/genética , Anciano , Secuencia de Bases , Femenino , Perfilación de la Expresión Génica , Humanos , Región Variable de Inmunoglobulina , Leucemia Linfocítica Crónica de Células B/patología , Masculino , Persona de Mediana Edad , Mutación , Ribosomas/genética , Empalmosomas/genéticaRESUMEN
Pangenome references address biases of reference genomes by storing a representative set of diverse haplotypes and their alignment, usually as a graph. Alternate alleles determined by variant callers can be used to construct pangenome graphs, but advances in long-read sequencing are leading to widely available, high-quality phased assemblies. Constructing a pangenome graph directly from assemblies, as opposed to variant calls, leverages the graph's ability to represent variation at different scales. Here we present the Minigraph-Cactus pangenome pipeline, which creates pangenomes directly from whole-genome alignments, and demonstrate its ability to scale to 90 human haplotypes from the Human Pangenome Reference Consortium. The method builds graphs containing all forms of genetic variation while allowing use of current mapping and genotyping tools. We measure the effect of the quality and completeness of reference genomes used for analysis within the pangenomes and show that using the CHM13 reference from the Telomere-to-Telomere Consortium improves the accuracy of our methods. We also demonstrate construction of a Drosophila melanogaster pangenome.
Asunto(s)
Drosophila melanogaster , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Animales , Drosophila melanogaster/genética , Haplotipos/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Alelos , Análisis de Secuencia de ADN , Genoma Humano/genéticaRESUMEN
The current reference genome is the backbone of diverse and rich annotations. Simple text formats, like VCF or BED, have been widely adopted and helped the critical exchange of genomic information. There is a dire need for tools and formats enabling pangenomic annotation to facilitate such enrichment of pangenomic references. The Graph Alignment Format (GAF) is a text format, tab-delimited like BED/VCF files, which was proposed to represent alignments. GAF could also be used to store paths representing annotations in a pangenome graph, but there are no tools to index and query them efficiently. Here, we present extensions to vg and HTSlib that provide efficient sorting, indexing, and querying for GAF files. With this approach, annotations overlapping a subgraph can be extracted quickly. Paths are sorted based on the IDs of traversed nodes, compressed with BGZIP, and indexed with HTSlib/tabix via our extensions for the GAF format. Compared to the binary GAM format, GAF files are easier to edit or inspect because they are plain text, and we show that they are twice as fast to sort and half as large on disk. In addition, we updated vg annotate , which takes BED or GFF3 annotation files relative to linear sequences and projects them into the pangenome. It can now produce GAF files representing these annotations' paths through the pangenome. We showcase these new tools on several applications. We projected annotations for all Human Pangenome Reference Consortium Year 1 haplotypes, including genes, segmental duplications, tandem repeats and repeats annotations, into the Minigraph-Cactus pangenome (GRCh38-based v1.1). We also projected known variants from the GWAS Catalog and expression QTLs from the GTEx project into the pangenome. Finally, we reanalyzed ATAC-seq data from ENCODE to demonstrate what a coverage track could look like in a pangenome graph. These rich annotations can be quickly queried with vg and visualized using existing tools like the Sequence Tube Map or Bandage.
RESUMEN
More than 50% of families with suspected rare monogenic diseases remain unsolved after whole genome analysis by short read sequencing (SRS). Long-read sequencing (LRS) could help bridge this diagnostic gap by capturing variants inaccessible to SRS, facilitating long-range mapping and phasing, and providing haplotype-resolved methylation profiling. To evaluate LRS's additional diagnostic yield, we sequenced a rare disease cohort of 98 samples, including 41 probands and some family members, using nanopore sequencing, achieving per sample â¼36x average coverage and 32 kilobase (kb) read N50 from a single flow cell. Our Napu pipeline generated assemblies, phased variants, and methylation calls. LRS covered, on average, coding exons in â¼280 genes and â¼5 known Mendelian disease genes that were not covered by SRS. In comparison to SRS, LRS detected additional rare, functionally annotated variants, including SVs and tandem repeats, and completely phased 87% of protein-coding genes. LRS detected additional de novo variants, and could be used to distinguish postzygotic mosaic variants from prezygotic de novos . Eleven probands were solved, with diverse underlying genetic causes including de novo and compound heterozygous variants, large-scale SVs, and epigenetic modifications. Our study demonstrates LRS's potential to enhance diagnostic yield for rare monogenic diseases, implying utility in future clinical genomics workflows.
RESUMEN
The basal breast cancer subtype is enriched for triple-negative breast cancer (TNBC) and displays consistent large chromosomal deletions. Here, we characterize evolution and maintenance of chromosome 4p (chr4p) loss in basal breast cancer. Analysis of The Cancer Genome Atlas data shows recurrent deletion of chr4p in basal breast cancer. Phylogenetic analysis of a panel of 23 primary tumor/patient-derived xenograft basal breast cancers reveals early evolution of chr4p deletion. Mechanistically we show that chr4p loss is associated with enhanced proliferation. Gene function studies identify an unknown gene, C4orf19, within chr4p, which suppresses proliferation when overexpressed-a member of the PDCD10-GCKIII kinase module we name PGCKA1. Genome-wide pooled overexpression screens using a barcoded library of human open reading frames identify chromosomal regions, including chr4p, that suppress proliferation when overexpressed in a context-dependent manner, implicating network interactions. Together, these results shed light on the early emergence of complex aneuploid karyotypes involving chr4p and adaptive landscapes shaping breast cancer genomes.
Asunto(s)
Neoplasias de la Mama , Redes Reguladoras de Genes , Humanos , Femenino , Neoplasias de la Mama/genética , Neoplasias de la Mama/patología , Animales , Ratones , Cromosomas Humanos Par 4/genética , Proliferación Celular/genética , Aberraciones Cromosómicas , Línea Celular Tumoral , Neoplasias de la Mama Triple Negativas/genética , Neoplasias de la Mama Triple Negativas/patologíaRESUMEN
Pangenomes, by including genetic diversity, should reduce reference bias by better representing new samples compared to them. Yet when comparing a new sample to a pangenome, variants in the pangenome that are not part of the sample can be misleading, for example, causing false read mappings. These irrelevant variants are generally rarer in terms of allele frequency, and have previously been dealt with using allele frequency filters. However, this is a blunt heuristic that both fails to remove some irrelevant variants and removes many relevant variants. We propose a new approach, inspired by local ancestry inference methods, that imputes a personalized pangenome subgraph based on sampling local haplotypes according to k-mer counts in the reads. Our approach is tailored for the Giraffe short read aligner, as the indexes it needs for read mapping can be built quickly. We compare the accuracy of our approach to state-of-the-art methods using graphs from the Human Pangenome Reference Consortium. The resulting personalized pangenome pipelines provide faster pangenome read mapping than comparable pipelines that use a linear reference, reduce small variant genotyping errors by 4x relative to the Genome Analysis Toolkit (GATK) best-practice pipeline, and for the first time make short-read structural variant genotyping competitive with long-read discovery methods.
RESUMEN
As a step towards simplifying and reducing the cost of haplotype resolved de novo assembly, we describe new methods for accurately phasing nanopore data with the Shasta genome assembler and a modular tool for extending phasing to the chromosome scale called GFAse. We test using new variants of Oxford Nanopore Technologies' (ONT) PromethION sequencing, including those using proximity ligation and show that newer, higher accuracy ONT reads substantially improve assembly quality.
RESUMEN
Long-read sequencing technologies substantially overcome the limitations of short-reads but to date have not been considered as feasible replacement at scale due to a combination of being too expensive, not scalable enough, or too error-prone. Here, we develop an efficient and scalable wet lab and computational protocol for Oxford Nanopore Technologies (ONT) long-read sequencing that seeks to provide a genuine alternative to short-reads for large-scale genomics projects. We applied our protocol to cell lines and brain tissue samples as part of a pilot project for the NIH Center for Alzheimer's and Related Dementias (CARD). Using a single PromethION flow cell, we can detect SNPs with F1-score better than Illumina short-read sequencing. Small indel calling remains to be difficult inside homopolymers and tandem repeats, but is comparable to Illumina calls elsewhere. Further, we can discover structural variants with F1-score comparable to state-of the-art methods involving Pacific Biosciences HiFi sequencing and trio information (but at a lower cost and greater throughput). Using ONT based phasing, we can then combine and phase small and structural variants at megabase scales. Our protocol also produces highly accurate, haplotype-specific methylation calls. Overall, this makes large-scale long-read sequencing projects feasible; the protocol is currently being used to sequence thousands of brain-based genomes as a part of the NIH CARD initiative. We provide the protocol and software as open-source integrated pipelines for generating phased variant calls and assemblies.
RESUMEN
BACKGROUND: Glioblastoma is a treatment-resistant brain cancer. Its hierarchical cellular nature and its tumor microenvironment (TME) before, during, and after treatments remain unresolved. METHODS: Here, we used single-cell RNA sequencing to analyze new and recurrent glioblastoma and the nearby subventricular zone (SVZ). RESULTS: We found 4 glioblastoma neural lineages are present in new and recurrent glioblastoma with an enrichment of the cancer mesenchymal lineage, immune cells, and reactive astrocytes in early recurrences. Cancer lineages were hierarchically organized around cycling oligodendrocytic and astrocytic progenitors that are transcriptomically similar but distinct to SVZ neural stem cells (NSCs). Furthermore, NSCs from the SVZ of patients with glioblastoma harbored glioblastoma chromosomal anomalies. Lastly, mesenchymal cancer cells and TME reactive astrocytes shared similar gene signatures which were induced by radiotherapy in a myeloid-dependent fashion in vivo. CONCLUSION: These data reveal the dynamic, immune-dependent nature of glioblastoma's response to treatments and identify distant NSCs as likely cells of origin.