RESUMO
Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.
Assuntos
Genoma Humano , Genômica , Humanos , Diploide , Genoma Humano/genética , Haplótipos/genética , Análise de Sequência de DNA , Genômica/normas , Padrões de Referência , Estudos de Coortes , Alelos , Variação GenéticaRESUMO
Pangenomics is emerging as a powerful computational paradigm in bioinformatics. This field uses population-level genome reference structures, typically consisting of a sequence graph, to mitigate reference bias and facilitate analyses that were challenging with previous reference-based methods. In this work, we extend these methods into transcriptomics to analyze sequencing data using the pantranscriptome: a population-level transcriptomic reference. Our toolchain, which consists of additions to the VG toolkit and a standalone tool, RPVG, can construct spliced pangenome graphs, map RNA sequencing data to these graphs, and perform haplotype-aware expression quantification of transcripts in a pantranscriptome. We show that this workflow improves accuracy over state-of-the-art RNA sequencing mapping methods, and that it can efficiently quantify haplotype-specific transcript expression without needing to characterize the haplotypes of a sample beforehand.
Assuntos
Biologia Computacional , Perfilação da Expressão Gênica , Haplótipos , Metagenômica , TranscriptomaRESUMO
BACKGROUND: Structural variants (SVs), including deletions, insertions, duplications, and inversions, are relatively long genomic variations implicated in a diverse range of processes from human disease to ecology and evolution. Given their complex signatures, tendency to occur in repeated regions, and large size, discovering SVs based on short reads is challenging compared to single-nucleotide variants. The increasing availability of long-read technologies has greatly facilitated SV discovery; however, these technologies remain too costly to apply routinely to population-level studies. Here, we combined short-read and long-read sequencing technologies to provide a comprehensive population-scale assessment of structural variation in a panel of Canadian soybean cultivars. RESULTS: We used Oxford Nanopore long-read sequencing data (~12× mean coverage) for 17 samples to both benchmark SV calls made from Illumina short-read data and predict SVs that were subsequently genotyped in a population of 102 samples using Illumina data. Benchmarking results show that variants discovered using Oxford Nanopore can be accurately genotyped from the Illumina data. We first use the genotyped deletions and insertions for population genetics analyses and show that results are comparable to those based on single-nucleotide variants. We observe that the population frequency and distribution within the genome of deletions and insertions are constrained by the location of genes. Gene Ontology and PFAM domain enrichment analyses also confirm previous reports that genes harboring high-frequency deletions and insertions are enriched for functions in defense response. Finally, we discover polymorphic transposable elements from the deletions and insertions and report evidence of the recent activity of a Stowaway MITE. CONCLUSIONS: We show that structural variants discovered using Oxford Nanopore data can be genotyped with high accuracy from Illumina data. Our results demonstrate that long-read and short-read sequencing technologies can be efficiently combined to enhance SV analysis in large populations, providing a reusable framework for their study in a wider range of samples and non-model species.
Assuntos
Nanoporos , Canadá , Elementos de DNA Transponíveis/genética , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Nucleotídeos , Análise de Sequência de DNA , Glycine max/genéticaRESUMO
We introduce Giraffe, a pangenome short-read mapper that can efficiently map to a collection of haplotypes threaded through a sequence graph. Giraffe maps sequencing reads to thousands of human genomes at a speed comparable to that of standard methods mapping to a single reference genome. The increased mapping accuracy enables downstream improvements in genome-wide genotyping pipelines for both small variants and larger structural variants. We used Giraffe to genotype 167,000 structural variants, discovered in long-read studies, in 5202 diverse human genomes that were sequenced using short reads. We conclude that pangenomics facilitates a more comprehensive characterization of variation and, as a result, has the potential to improve many genomic analyses.
Assuntos
Variação Genética , Genoma Humano , Genômica/métodos , Técnicas de Genotipagem , Algoritmos , Alelos , Biologia Computacional , Genoma Fúngico , Genótipo , Haplótipos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas , Saccharomyces/genética , Saccharomyces cerevisiae/genética , Análise de Sequência de DNARESUMO
Low-cost whole-genome assembly has enabled the collection of haplotype-resolved pangenomes for numerous organisms. In turn, this technological change is encouraging the development of methods that can precisely address the sequence and variation described in large collections of related genomes. These approaches often use graphical models of the pangenome to support algorithms for sequence alignment, visualization, functional genomics, and association studies. The additional information provided to these methods by the pangenome allows them to achieve superior performance on a variety of bioinformatic tasks, including read alignment, variant calling, and genotyping. Pangenome graphs stand to become a ubiquitous tool in genomics. Although it is unclear whether they will replace linearreference genomes, their ability to harmoniously relate multiple sequence and coordinate systems will make them useful irrespective of which pangenomic models become most common in the future.
Assuntos
Algoritmos , Biologia Computacional/métodos , Gráficos por Computador , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNARESUMO
Structural variants (SVs) remain challenging to represent and study relative to point mutations despite their demonstrated importance. We show that variation graphs, as implemented in the vg toolkit, provide an effective means for leveraging SV catalogs for short-read SV genotyping experiments. We benchmark vg against state-of-the-art SV genotypers using three sequence-resolved SV catalogs generated by recent long-read sequencing studies. In addition, we use assemblies from 12 yeast strains to show that graphs constructed directly from aligned de novo assemblies improve genotyping compared to graphs built from intermediate SV catalogs in the VCF format.
Assuntos
Variação Estrutural do Genoma , Técnicas de Genotipagem/métodos , Software , Genoma Fúngico , Saccharomyces cerevisiae , Sequenciamento Completo do Genoma/métodosRESUMO
In March 2019, 45 scientists and software engineers from around the world converged at the University of California, Santa Cruz for the first pangenomics codeathon. The purpose of the meeting was to propose technical specifications and standards for a usable human pangenome as well as to build relevant tools for genome graph infrastructures. During the meeting, the group held several intense and productive discussions covering a diverse set of topics, including advantages of graph genomes over a linear reference representation, design of new methods that can leverage graph-based data structures, and novel visualization and annotation approaches for pangenomes. Additionally, the participants self-organized themselves into teams that worked intensely over a three-day period to build a set of pipelines and tools for specific pangenomic applications. A summary of the questions raised and the tools developed are reported in this manuscript.