Search | VHL Regional Portal

1.

Analysis and benchmarking of small and large genomic variants across tandem repeats.

English, Adam C; Dolzhenko, Egor; Ziaei Jam, Helyaneh; McKenzie, Sean K; Olson, Nathan D; De Coster, Wouter; Park, Jonghun; Gu, Bida; Wagner, Justin; Eberle, Michael A; Gymrek, Melissa; Chaisson, Mark J P; Zook, Justin M; Sedlazeck, Fritz J.

Nat Biotechnol ; 2024 Apr 26.

Article in English | MEDLINE | ID: mdl-38671154

ABSTRACT

Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits and are linked to over 60 disease phenotypes. However, they are often excluded from at-scale studies because of challenges with variant calling and representation, as well as a lack of a genome-wide standard. Here, to promote the development of TR methods, we created a catalog of TR regions and explored TR properties across 86 haplotype-resolved long-read human assemblies. We curated variants from the Genome in a Bottle (GIAB) HG002 individual to create a TR dataset to benchmark existing and future TR analysis methods. We also present an improved variant comparison method that handles variants greater than 4 bp in length and varying allelic representation. The 8.1% of the genome covered by the TR catalog holds ~24.9% of variants per individual, including 124,728 small and 17,988 large variants for the GIAB HG002 'truth-set' TR benchmark. We demonstrate the utility of this pipeline across short-read and long-read technologies.

2.

A High-Quality Blue Whale Genome, Segmental Duplications, and Historical Demography.

Bukhman, Yury V; Morin, Phillip A; Meyer, Susanne; Chu, Li-Fang; Jacobsen, Jeff K; Antosiewicz-Bourget, Jessica; Mamott, Daniel; Gonzales, Maylie; Argus, Cara; Bolin, Jennifer; Berres, Mark E; Fedrigo, Olivier; Steill, John; Swanson, Scott A; Jiang, Peng; Rhie, Arang; Formenti, Giulio; Phillippy, Adam M; Harris, Robert S; Wood, Jonathan M D; Howe, Kerstin; Kirilenko, Bogdan M; Munegowda, Chetan; Hiller, Michael; Jain, Aashish; Kihara, Daisuke; Johnston, J Spencer; Ionkov, Alexander; Raja, Kalpana; Toh, Huishi; Lang, Aimee; Wolf, Magnus; Jarvis, Erich D; Thomson, James A; Chaisson, Mark J P; Stewart, Ron.

Mol Biol Evol ; 41(3)2024 Mar 01.

Article in English | MEDLINE | ID: mdl-38376487

ABSTRACT

The blue whale, Balaenoptera musculus, is the largest animal known to have ever existed, making it an important case study in longevity and resistance to cancer. To further this and other blue whale-related research, we report a reference-quality, long-read-based genome assembly of this fascinating species. We assembled the genome from PacBio long reads and utilized Illumina/10×, optical maps, and Hi-C data for scaffolding, polishing, and manual curation. We also provided long read RNA-seq data to facilitate the annotation of the assembly by NCBI and Ensembl. Additionally, we annotated both haplotypes using TOGA and measured the genome size by flow cytometry. We then compared the blue whale genome with other cetaceans and artiodactyls, including vaquita (Phocoena sinus), the world's smallest cetacean, to investigate blue whale's unique biological traits. We found a dramatic amplification of several genes in the blue whale genome resulting from a recent burst in segmental duplications, though the possible connection between this amplification and giant body size requires further study. We also discovered sites in the insulin-like growth factor-1 gene correlated with body size in cetaceans. Finally, using our assembly to examine the heterozygosity and historical demography of Pacific and Atlantic blue whale populations, we found that the genomes of both populations are highly heterozygous and that their genetic isolation dates to the last interglacial period. Taken together, these results indicate how a high-quality, annotated blue whale genome will serve as an important resource for biology, evolution, and conservation research.

Subject(s)

Balaenoptera , Neoplasms , Animals , Balaenoptera/genetics , Segmental Duplications, Genomic , Genome , Demography , Neoplasms/genetics

3.

Chromosome level genome assembly of the Etruscan shrew Suncus etruscus.

Bukhman, Yury V; Meyer, Susanne; Chu, Li-Fang; Abueg, Linelle; Antosiewicz-Bourget, Jessica; Balacco, Jennifer; Brecht, Michael; Dinatale, Erica; Fedrigo, Olivier; Formenti, Giulio; Fungtammasan, Arkarachai; Giri, Swagarika Jaharlal; Hiller, Michael; Howe, Kerstin; Kihara, Daisuke; Mamott, Daniel; Mountcastle, Jacquelyn; Pelan, Sarah; Rabbani, Keon; Sims, Ying; Tracey, Alan; Wood, Jonathan M D; Jarvis, Erich D; Thomson, James A; Chaisson, Mark J P; Stewart, Ron.

Sci Data ; 11(1): 176, 2024 Feb 07.

Article in English | MEDLINE | ID: mdl-38326333

ABSTRACT

Suncus etruscus is one of the world's smallest mammals, with an average body mass of about 2 grams. The Etruscan shrew's small body is accompanied by a very high energy demand and numerous metabolic adaptations. Here we report a chromosome-level genome assembly using PacBio long read sequencing, 10X Genomics linked short reads, optical mapping, and Hi-C linked reads. The assembly is partially phased, with the 2.472 Gbp primary pseudohaplotype and 1.515 Gbp alternate. We manually curated the primary assembly and identified 22 chromosomes, including X and Y sex chromosomes. The NCBI genome annotation pipeline identified 39,091 genes, 19,819 of them protein-coding. We also identified segmental duplications, inferred GO term annotations, and computed orthologs of human and mouse genes. This reference-quality genome will be an important resource for research on mammalian development, metabolism, and body size control.

Subject(s)

Chromosomes , Shrews , Animals , Mice , Chromosomes/genetics , Genome , Genomics , Molecular Sequence Annotation , Shrews/genetics

4.

Scalable, accessible and reproducible reference genome assembly and evaluation in Galaxy.

Larivière, Delphine; Abueg, Linelle; Brajuka, Nadolina; Gallardo-Alba, Cristóbal; Grüning, Bjorn; Ko, Byung June; Ostrovsky, Alex; Palmada-Flores, Marc; Pickett, Brandon D; Rabbani, Keon; Antunes, Agostinho; Balacco, Jennifer R; Chaisson, Mark J P; Cheng, Haoyu; Collins, Joanna; Couture, Melanie; Denisova, Alexandra; Fedrigo, Olivier; Gallo, Guido Roberto; Giani, Alice Maria; Gooder, Grenville MacDonald; Horan, Kathleen; Jain, Nivesh; Johnson, Cassidy; Kim, Heebal; Lee, Chul; Marques-Bonet, Tomas; O'Toole, Brian; Rhie, Arang; Secomandi, Simona; Sozzoni, Marcella; Tilley, Tatiana; Uliano-Silva, Marcela; van den Beek, Marius; Williams, Robert W; Waterhouse, Robert M; Phillippy, Adam M; Jarvis, Erich D; Schatz, Michael C; Nekrutenko, Anton; Formenti, Giulio.

Nat Biotechnol ; 42(3): 367-370, 2024 Mar.

Article in English | MEDLINE | ID: mdl-38278971

Subject(s)

Computational Biology , Software

5.

Benchmarking of small and large variants across tandem repeats.

English, Adam; Dolzhenko, Egor; Jam, Helyaneh Ziaei; Mckenzie, Sean; Olson, Nathan D; De Coster, Wouter; Park, Jonghun; Gu, Bida; Wagner, Justin; Eberle, Michael A; Gymrek, Melissa; Chaisson, Mark J P; Zook, Justin M; Sedlazeck, Fritz J.

bioRxiv ; 2023 Nov 01.

Article in English | MEDLINE | ID: mdl-37961319

ABSTRACT

Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits, and are linked to over 60 disease phenotypes. However, their complexity often excludes them from at-scale studies due to challenges with variant calling, representation, and lack of a genome-wide standard. To promote TR methods development, we create a comprehensive catalog of TR regions and explore its properties across 86 samples. We then curate variants from the GIAB HG002 individual to create a tandem repeat benchmark. We also present a variant comparison method that handles small and large alleles and varying allelic representation. The 8.1% of the genome covered by the TR catalog holds â¼24.9% of variants per individual, including 124,728 small and 17,988 large variants for the GIAB HG002 TR benchmark. We work with the GIAB community to demonstrate the utility of this benchmark across short and long read technologies.

6.

Advances in the discovery and analyses of human tandem repeats.

Chaisson, Mark J P; Sulovari, Arvis; Valdmanis, Paul N; Miller, Danny E; Eichler, Evan E.

Emerg Top Life Sci ; 7(3): 361-381, 2023 Dec 14.

Article in English | MEDLINE | ID: mdl-37905568

ABSTRACT

Long-read sequencing platforms provide unparalleled access to the structure and composition of all classes of tandemly repeated DNA from STRs to satellite arrays. This review summarizes our current understanding of their organization within the human genome, their importance with respect to disease, as well as the advances and challenges in understanding their genetic diversity and functional effects. Novel computational methods are being developed to visualize and associate these complex patterns of human variation with disease, expression, and epigenetic differences. We predict accurate characterization of this repeat-rich form of human variation will become increasingly relevant to both basic and clinical human genetics.

Subject(s)

DNA , Tandem Repeat Sequences , Humans , Tandem Repeat Sequences/genetics , Epigenesis, Genetic

7.

HQAlign: aligning nanopore reads for SV detection using current-level modeling.

Joshi, Dhaivat; Diggavi, Suhas; Chaisson, Mark J P; Kannan, Sreeram.

Bioinformatics ; 39(10)2023 10 03.

Article in English | MEDLINE | ID: mdl-37738608

ABSTRACT

MOTIVATION: Detection of structural variants (SVs) from the alignment of sample DNA reads to the reference genome is an important problem in understanding human diseases. Long reads that can span repeat regions, along with an accurate alignment of these long reads play an important role in identifying novel SVs. Long-read sequencers, such as nanopore sequencing, can address this problem by providing very long reads but with high error rates, making accurate alignment challenging. Many errors induced by nanopore sequencing have a bias because of the physics of the sequencing process and proper utilization of these error characteristics can play an important role in designing a robust aligner for SV detection problems. In this article, we design and evaluate HQAlign, an aligner for SV detection using nanopore sequenced reads. The key ideas of HQAlign include (i) using base-called nanopore reads along with the nanopore physics to improve alignments for SVs, (ii) incorporating SV-specific changes to the alignment pipeline, and (iii) adapting these into existing state-of-the-art long-read aligner pipeline, minimap2 (v2.24), for efficient alignments. RESULTS: We show that HQAlign captures about 4%-6% complementary SVs across different datasets, which are missed by minimap2 alignments while having a standalone performance at par with minimap2 for real nanopore reads data. For the common SV calls between HQAlign and minimap2, HQAlign improves the start and the end breakpoint accuracy by about 10%-50% for SVs across different datasets. Moreover, HQAlign improves the alignment rate to 89.35% from minimap2 85.64% for nanopore reads alignment to recent telomere-to-telomere CHM13 assembly, and it improves to 86.65% from 83.48% for nanopore reads alignment to GRCh37 human genome. AVAILABILITY AND IMPLEMENTATION: https://github.com/joshidhaivat/HQAlign.git.

Subject(s)

Nanopores , Humans , Sequence Analysis, DNA , High-Throughput Nucleotide Sequencing , Genome, Human , DNA

8.

vamos: variable-number tandem repeats annotation using efficient motif sets.

Ren, Jingwen; Gu, Bida; Chaisson, Mark J P.

Genome Biol ; 24(1): 175, 2023 07 27.

Article in English | MEDLINE | ID: mdl-37501141

ABSTRACT

Roughly 3% of the human genome is composed of variable-number tandem repeats (VNTRs): arrays of motifs at least six bases. These loci are highly polymorphic, yet current approaches that define and merge variants based on alignment breakpoints do not capture their full diversity. Here we present a method vamos: VNTR Annotation using efficient Motif Sets that instead annotates VNTR using repeat composition under different levels of motif diversity. Using vamos we estimate 7.4-16.7 alleles per locus when applied to 74 haplotype-resolved human assemblies, compared to breakpoint-based approaches that estimate 4.0-5.5 alleles per locus.

Subject(s)

Minisatellite Repeats , Humans

9.

The motif composition of variable number tandem repeats impacts gene expression.

Lu, Tsung-Yu; Smaruj, Paulina N; Fudenberg, Geoffrey; Mancuso, Nicholas; Chaisson, Mark J P.

Genome Res ; 33(4): 511-524, 2023 04.

Article in English | MEDLINE | ID: mdl-37037626

ABSTRACT

Understanding the impact of DNA variation on human traits is a fundamental question in human genetics. Variable number tandem repeats (VNTRs) make up â¼3% of the human genome but are often excluded from association analysis owing to poor read mappability or divergent repeat content. Although methods exist to estimate VNTR length from short-read data, it is known that VNTRs vary in both length and repeat (motif) composition. Here, we use a repeat-pangenome graph (RPGG) constructed on 35 haplotype-resolved assemblies to detect variation in both VNTR length and repeat composition. We align population-scale data from the Genotype-Tissue Expression (GTEx) Consortium to examine how variations in sequence composition may be linked to expression, including cases independent of overall VNTR length. We find that 9422 out of 39,125 VNTRs are associated with nearby gene expression through motif variations, of which only 23.4% are accessible from length. Fine-mapping identifies 174 genes to be likely driven by variation in certain VNTR motifs and not overall length. We highlight two genes, CACNA1C and RNF213, that have expression associated with motif variation, showing the utility of RPGG analysis as a new approach for trait association in multiallelic and highly variable loci.

Subject(s)

Adenosine Triphosphatases , Minisatellite Repeats , Humans , Minisatellite Repeats/genetics , Phenotype , Haplotypes , Gene Expression , Adenosine Triphosphatases/genetics , Ubiquitin-Protein Ligases/genetics

10.

HQAlign: Aligning nanopore reads for SV detection using current-level modeling.

Joshi, Dhaivat; Diggavi, Suhas; Chaisson, Mark J P; Kannan, Sreeram.

bioRxiv ; 2023 01 09.

Article in English | MEDLINE | ID: mdl-36712127

ABSTRACT

Motivation: Detection of structural variants (SV) from the alignment of sample DNA reads to the reference genome is an important problem in understanding human diseases. Long reads that can span repeat regions, along with an accurate alignment of these long reads play an important role in identifying novel SVs. Long read sequencers such as nanopore sequencing can address this problem by providing very long reads but with high error rates, making accurate alignment challenging. Many errors induced by nanopore sequencing have a bias because of the physics of the sequencing process and proper utilization of these error characteristics can play an important role in designing a robust aligner for SV detection problems. In this paper, we design and evaluate HQAlign, an aligner for SV detection using nanopore sequenced reads. The key ideas of HQAlign include (i) using basecalled nanopore reads along with the nanopore physics to improve alignments for SVs (ii) incorporating SV specific changes to the alignment pipeline (iii) adapting these into existing state-of-the-art long read aligner pipeline, minimap2 (v2.24), for efficient alignments. Results: We show that HQAlign captures about 4 - 6% complementary SVs across different datasets which are missed by minimap2 alignments while having a standalone performance at par with minimap2 for real nanopore reads data. For the common SV calls between HQAlign and minimap2, HQAlign improves the start and the end breakpoint accuracy for about 10 - 50% of SVs across different datasets. Moreover, HQAlign improves the alignment rate to 89.35% from minimap2 85.64% for nanopore reads alignment to recent telomere-to-telomere CHM13 assembly, and it improves to 86.65% from 83.48% for nanopore reads alignment to GRCh37 human genome. Availability: https://github.com/joshidhaivat/HQAlign.git.

11.

HQAlign: Aligning nanopore reads for SV detection using current-level modeling.

Joshi, Dhaivat; Diggavi, Suhas; Chaisson, Mark J P; Kannan, Sreeram.

ArXiv ; 2023 01 10.

Article in English | MEDLINE | ID: mdl-36713252

ABSTRACT

MOTIVATION: Detection of structural variants (SV) from the alignment of sample DNA reads to the reference genome is an important problem in understanding human diseases. Long reads that can span repeat regions, along with an accurate alignment of these long reads play an important role in identifying novel SVs. Long read sequencers such as nanopore sequencing can address this problem by providing very long reads but with high error rates, making accurate alignment challenging. Many errors induced by nanopore sequencing have a bias because of the physics of the sequencing process and proper utilization of these error characteristics can play an important role in designing a robust aligner for SV detection problems. In this paper, we design and evaluate HQAlign, an aligner for SV detection using nanopore sequenced reads. The key ideas of HQAlign include (i) using basecalled nanopore reads along with the nanopore physics to improve alignments for SVs (ii) incorporating SV specific changes to the alignment pipeline (iii) adapting these into existing state-of-the-art long read aligner pipeline, minimap2 (v2.24), for efficient alignments. RESULTS: We show that HQAlign captures about 4%-6% complementary SVs across different datasets which are missed by minimap2 alignments while having a standalone performance at par with minimap2 for real nanopore reads data. For the common SV calls between HQAlign and minimap2, HQAlign improves the start and the end breakpoint accuracy for about 10%-50% of SVs across different datasets. Moreover, HQAlign improves the alignment rate to 89.35% from minimap2 85.64% for nanopore reads alignment to recent telomere-to-telomere CHM13 assembly, and it improves to 86.65% from 83.48% for nanopore reads alignment to GRCh37 human genome.

12.

A haplotype-resolved genome assembly of the Nile rat facilitates exploration of the genetic basis of diabetes.

Toh, Huishi; Yang, Chentao; Formenti, Giulio; Raja, Kalpana; Yan, Lily; Tracey, Alan; Chow, William; Howe, Kerstin; Bergeron, Lucie A; Zhang, Guojie; Haase, Bettina; Mountcastle, Jacquelyn; Fedrigo, Olivier; Fogg, John; Kirilenko, Bogdan; Munegowda, Chetan; Hiller, Michael; Jain, Aashish; Kihara, Daisuke; Rhie, Arang; Phillippy, Adam M; Swanson, Scott A; Jiang, Peng; Clegg, Dennis O; Jarvis, Erich D; Thomson, James A; Stewart, Ron; Chaisson, Mark J P; Bukhman, Yury V.

BMC Biol ; 20(1): 245, 2022 11 08.

Article in English | MEDLINE | ID: mdl-36344967

ABSTRACT

BACKGROUND: The Nile rat (Avicanthis niloticus) is an important animal model because of its robust diurnal rhythm, a cone-rich retina, and a propensity to develop diet-induced diabetes without chemical or genetic modifications. A closer similarity to humans in these aspects, compared to the widely used Mus musculus and Rattus norvegicus models, holds the promise of better translation of research findings to the clinic. RESULTS: We report a 2.5 Gb, chromosome-level reference genome assembly with fully resolved parental haplotypes, generated with the Vertebrate Genomes Project (VGP). The assembly is highly contiguous, with contig N50 of 11.1 Mb, scaffold N50 of 83 Mb, and 95.2% of the sequence assigned to chromosomes. We used a novel workflow to identify 3613 segmental duplications and quantify duplicated genes. Comparative analyses revealed unique genomic features of the Nile rat, including some that affect genes associated with type 2 diabetes and metabolic dysfunctions. We discuss 14 genes that are heterozygous in the Nile rat or highly diverged from the house mouse. CONCLUSIONS: Our findings reflect the exceptional level of genomic resolution present in this assembly, which will greatly expand the potential of the Nile rat as a model organism.

Subject(s)

Diabetes Mellitus, Type 2 , Humans , Animals , Haplotypes , Diabetes Mellitus, Type 2/genetics , Murinae , Genome , Genomics

13.

Semi-automated assembly of high-quality diploid human reference genomes.

Jarvis, Erich D; Formenti, Giulio; Rhie, Arang; Guarracino, Andrea; Yang, Chentao; Wood, Jonathan; Tracey, Alan; Thibaud-Nissen, Francoise; Vollger, Mitchell R; Porubsky, David; Cheng, Haoyu; Asri, Mobin; Logsdon, Glennis A; Carnevali, Paolo; Chaisson, Mark J P; Chin, Chen-Shan; Cody, Sarah; Collins, Joanna; Ebert, Peter; Escalona, Merly; Fedrigo, Olivier; Fulton, Robert S; Fulton, Lucinda L; Garg, Shilpa; Gerton, Jennifer L; Ghurye, Jay; Granat, Anastasiya; Green, Richard E; Harvey, William; Hasenfeld, Patrick; Hastie, Alex; Haukness, Marina; Jaeger, Erich B; Jain, Miten; Kirsche, Melanie; Kolmogorov, Mikhail; Korbel, Jan O; Koren, Sergey; Korlach, Jonas; Lee, Joyce; Li, Daofeng; Lindsay, Tina; Lucas, Julian; Luo, Feng; Marschall, Tobias; Mitchell, Matthew W; McDaniel, Jennifer; Nie, Fan; Olsen, Hugh E; Olson, Nathan D.

Nature ; 611(7936): 519-531, 2022 Nov.

Article in English | MEDLINE | ID: mdl-36261518

ABSTRACT

The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome5. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity6. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent-child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.

Subject(s)

Chromosome Mapping , Diploidy , Genome, Human , Genomics , Humans , Chromosome Mapping/standards , Genome, Human/genetics , Haplotypes/genetics , High-Throughput Nucleotide Sequencing/methods , High-Throughput Nucleotide Sequencing/standards , Sequence Analysis, DNA/methods , Sequence Analysis, DNA/standards , Reference Standards , Genomics/methods , Genomics/standards , Chromosomes, Human/genetics , Genetic Variation/genetics

14.

TT-Mars: structural variants assessment based on haplotype-resolved assemblies.

Yang, Jianzhi; Chaisson, Mark J P.

Genome Biol ; 23(1): 110, 2022 05 06.

Article in English | MEDLINE | ID: mdl-35524317

ABSTRACT

Variant benchmarking is often performed by comparing a test callset to a gold standard set of variants. In repetitive regions of the genome, it may be difficult to establish what is the truth for a call, for example, when different alignment scoring metrics provide equally supported but different variant calls on the same data. Here, we provide an alternative approach, TT-Mars, that takes advantage of the recent production of high-quality haplotype-resolved genome assemblies by providing false discovery rates for variant calls based on how well their call reflects the content of the assembly, rather than comparing calls themselves.

Subject(s)

Polymorphism, Single Nucleotide , Software , Benchmarking , Genome , Haplotypes , High-Throughput Nucleotide Sequencing

15.

The Human Pangenome Project: a global resource to map genomic diversity.

Wang, Ting; Antonacci-Fulton, Lucinda; Howe, Kerstin; Lawson, Heather A; Lucas, Julian K; Phillippy, Adam M; Popejoy, Alice B; Asri, Mobin; Carson, Caryn; Chaisson, Mark J P; Chang, Xian; Cook-Deegan, Robert; Felsenfeld, Adam L; Fulton, Robert S; Garrison, Erik P; Garrison, Nanibaa' A; Graves-Lindsay, Tina A; Ji, Hanlee; Kenny, Eimear E; Koenig, Barbara A; Li, Daofeng; Marschall, Tobias; McMichael, Joshua F; Novak, Adam M; Purushotham, Deepak; Schneider, Valerie A; Schultz, Baergen I; Smith, Michael W; Sofia, Heidi J; Weissman, Tsachy; Flicek, Paul; Li, Heng; Miga, Karen H; Paten, Benedict; Jarvis, Erich D; Hall, Ira M; Eichler, Evan E; Haussler, David.

Nature ; 604(7906): 437-446, 2022 04.

Article in English | MEDLINE | ID: mdl-35444317

ABSTRACT

The human reference genome is the most widely used resource in human genetics and is due for a major update. Its current structure is a linear composite of merged haplotypes from more than 20 people, with a single individual comprising most of the sequence. It contains biases and errors within a framework that does not represent global human genomic variation. A high-quality reference with global representation of common variants, including single-nucleotide variants, structural variants and functional elements, is needed. The Human Pangenome Reference Consortium aims to create a more sophisticated and complete human reference genome with a graph-based, telomere-to-telomere representation of global genomic diversity. Here we leverage innovations in technology, study design and global partnerships with the goal of constructing the highest-possible quality human pangenome reference. Our goal is to improve data representation and streamline analyses to enable routine assembly of complete diploid genomes. With attention to ethical frameworks, the human pangenome reference will contain a more accurate and diverse representation of global genomic variation, improve gene-disease association studies across populations, expand the scope of genomics research to the most repetitive and polymorphic regions of the genome, and serve as the ultimate genetic resource for future biomedical research and precision medicine.

Subject(s)

Genome, Human , Genomics , Genome, Human/genetics , Haplotypes/genetics , High-Throughput Nucleotide Sequencing , Humans , Sequence Analysis, DNA

16.

Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs.

Lu, Tsung-Yu; Chaisson, Mark J P.

Nat Commun ; 12(1): 4250, 2021 07 12.

Article in English | MEDLINE | ID: mdl-34253730

ABSTRACT

Variable number tandem repeats (VNTRs) are composed of consecutive repetitive DNA with hypervariable repeat count and composition. They include protein coding sequences and associations with clinical disorders. It has been difficult to incorporate VNTR analysis in disease studies that use short-read sequencing because the traditional approach of mapping to the human reference is less effective for repetitive and divergent sequences. In this work, we solve VNTR mapping for short reads with a repeat-pangenome graph (RPGG), a data structure that encodes both the population diversity and repeat structure of VNTR loci from multiple haplotype-resolved assemblies. We develop software to build a RPGG, and use the RPGG to estimate VNTR composition with short reads. We use this to discover VNTRs with length stratified by continental population, and expression quantitative trait loci, indicating that RPGG analysis of VNTRs will be critical for future studies of diversity and disease.

Subject(s)

Genetic Variation , Genetics, Population , Genome, Human , Minisatellite Repeats/genetics , Chromosome Mapping , Gene Expression Regulation , Genetic Loci , Humans , Nucleotide Motifs/genetics , Quantitative Trait Loci/genetics

17.

lra: A long read aligner for sequences and contigs.

Ren, Jingwen; Chaisson, Mark J P.

PLoS Comput Biol ; 17(6): e1009078, 2021 06.

Article in English | MEDLINE | ID: mdl-34153026

ABSTRACT

It is computationally challenging to detect variation by aligning single-molecule sequencing (SMS) reads, or contigs from SMS assemblies. One approach to efficiently align SMS reads is sparse dynamic programming (SDP), where optimal chains of exact matches are found between the sequence and the genome. While straightforward implementations of SDP penalize gaps with a cost that is a linear function of gap length, biological variation is more accurately represented when gap cost is a concave function of gap length. We have developed a method, lra, that uses SDP with a concave-cost gap penalty, and used lra to align long-read sequences from PacBio and Oxford Nanopore (ONT) instruments as well as de novo assembly contigs. This alignment approach increases sensitivity and specificity for SV discovery, particularly for variants above 1kb and when discovering variation from ONT reads, while having runtime that are comparable (1.05-3.76×) to current methods. When applied to calling variation from de novo assembly contigs, there is a 3.2% increase in Truvari F1 score compared to minimap2+htsbox. lra is available in bioconda (https://anaconda.org/bioconda/lra) and github (https://github.com/ChaissonLab/LRA).

Subject(s)

Contig Mapping/statistics & numerical data , Sequence Alignment/statistics & numerical data , Software , Cluster Analysis , Computational Biology , Computer Simulation , Databases, Nucleic Acid/statistics & numerical data , Genetic Variation , Genome, Human , High-Throughput Nucleotide Sequencing , Humans , Programming, Linear , Sequence Analysis, DNA

18.

Expectations and blind spots for structural variation detection from long-read assemblies and short-read genome sequencing technologies.

Zhao, Xuefang; Collins, Ryan L; Lee, Wan-Ping; Weber, Alexandra M; Jun, Yukyung; Zhu, Qihui; Weisburd, Ben; Huang, Yongqing; Audano, Peter A; Wang, Harold; Walker, Mark; Lowther, Chelsea; Fu, Jack; Gerstein, Mark B; Devine, Scott E; Marschall, Tobias; Korbel, Jan O; Eichler, Evan E; Chaisson, Mark J P; Lee, Charles; Mills, Ryan E; Brand, Harrison; Talkowski, Michael E.

Am J Hum Genet ; 108(5): 919-928, 2021 05 06.

Article in English | MEDLINE | ID: mdl-33789087

ABSTRACT

Virtually all genome sequencing efforts in national biobanks, complex and Mendelian disease programs, and medical genetic initiatives are reliant upon short-read whole-genome sequencing (srWGS), which presents challenges for the detection of structural variants (SVs) relative to emerging long-read WGS (lrWGS) technologies. Given this ubiquity of srWGS in large-scale genomics initiatives, we sought to establish expectations for routine SV detection from this data type by comparison with lrWGS assembly, as well as to quantify the genomic properties and added value of SVs uniquely accessible to each technology. Analyses from the Human Genome Structural Variation Consortium (HGSVC) of three families captured ~11,000 SVs per genome from srWGS and ~25,000 SVs per genome from lrWGS assembly. Detection power and precision for SV discovery varied dramatically by genomic context and variant class: 9.7% of the current GRCh38 reference is defined by segmental duplication (SD) and simple repeat (SR), yet 91.4% of deletions that were specifically discovered by lrWGS localized to these regions. Across the remaining 90.3% of reference sequence, we observed extremely high (93.8%) concordance between technologies for deletions in these datasets. In contrast, lrWGS was superior for detection of insertions across all genomic contexts. Given that non-SD/SR sequences encompass 95.9% of currently annotated disease-associated exons, improved sensitivity from lrWGS to discover novel pathogenic deletions in these currently interpretable genomic regions is likely to be incremental. However, these analyses highlight the considerable added value of assembly-based lrWGS to create new catalogs of insertions and transposable elements, as well as disease-associated repeat expansions in genomic sequences that were previously recalcitrant to routine assessment.

Subject(s)

Genome, Human/genetics , Genomic Structural Variation , Genomics/methods , Goals , Whole Genome Sequencing/methods , Whole Genome Sequencing/standards , DNA Copy Number Variations , Exons/genetics , Humans , Research Design , Segmental Duplications, Genomic , Sequence Alignment

19.

Haplotype-resolved diverse human genomes and integrated analysis of structural variation.

Ebert, Peter; Audano, Peter A; Zhu, Qihui; Rodriguez-Martin, Bernardo; Porubsky, David; Bonder, Marc Jan; Sulovari, Arvis; Ebler, Jana; Zhou, Weichen; Serra Mari, Rebecca; Yilmaz, Feyza; Zhao, Xuefang; Hsieh, PingHsun; Lee, Joyce; Kumar, Sushant; Lin, Jiadong; Rausch, Tobias; Chen, Yu; Ren, Jingwen; Santamarina, Martin; Höps, Wolfram; Ashraf, Hufsah; Chuang, Nelson T; Yang, Xiaofei; Munson, Katherine M; Lewis, Alexandra P; Fairley, Susan; Tallon, Luke J; Clarke, Wayne E; Basile, Anna O; Byrska-Bishop, Marta; Corvelo, André; Evani, Uday S; Lu, Tsung-Yu; Chaisson, Mark J P; Chen, Junjie; Li, Chong; Brand, Harrison; Wenger, Aaron M; Ghareghani, Maryam; Harvey, William T; Raeder, Benjamin; Hasenfeld, Patrick; Regier, Allison A; Abel, Haley J; Hall, Ira M; Flicek, Paul; Stegle, Oliver; Gerstein, Mark B; Tubio, Jose M C.

Science ; 372(6537)2021 04 02.

Article in English | MEDLINE | ID: mdl-33632895

ABSTRACT

Long-read and strand-specific sequencing technologies together facilitate the de novo assembly of high-quality haplotype-resolved human genomes without parent-child trio data. We present 64 assembled haplotypes from 32 diverse human genomes. These highly contiguous haplotype assemblies (average minimum contig length needed to cover 50% of the genome: 26 million base pairs) integrate all forms of genetic variation, even across complex loci. We identified 107,590 structural variants (SVs), of which 68% were not discovered with short-read sequencing, and 278 SV hotspots (spanning megabases of gene-rich sequence). We characterized 130 of the most active mobile element source elements and found that 63% of all SVs arise through homology-mediated mechanisms. This resource enables reliable graph-based genotyping from short reads of up to 50,340 SVs, resulting in the identification of 1526 expression quantitative trait loci as well as SV candidates for adaptive selection within the human population.

Subject(s)

Genetic Variation , Genome, Human , Haplotypes , Female , Genotype , High-Throughput Nucleotide Sequencing , Humans , INDEL Mutation , Interspersed Repetitive Sequences , Male , Population Groups/genetics , Quantitative Trait Loci , Retroelements , Sequence Analysis, DNA , Sequence Inversion , Whole Genome Sequencing

20.

Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads.

Porubsky, David; Ebert, Peter; Audano, Peter A; Vollger, Mitchell R; Harvey, William T; Marijon, Pierre; Ebler, Jana; Munson, Katherine M; Sorensen, Melanie; Sulovari, Arvis; Haukness, Marina; Ghareghani, Maryam; Lansdorp, Peter M; Paten, Benedict; Devine, Scott E; Sanders, Ashley D; Lee, Charles; Chaisson, Mark J P; Korbel, Jan O; Eichler, Evan E; Marschall, Tobias.

Nat Biotechnol ; 39(3): 302-308, 2021 03.

Article in English | MEDLINE | ID: mdl-33288906

ABSTRACT

Human genomes are typically assembled as consensus sequences that lack information on parental haplotypes. Here we describe a reference-free workflow for diploid de novo genome assembly that combines the chromosome-wide phasing and scaffolding capabilities of single-cell strand sequencing1,2 with continuous long-read or high-fidelity3 sequencing data. Employing this strategy, we produced a completely phased de novo genome assembly for each haplotype of an individual of Puerto Rican descent (HG00733) in the absence of parental data. The assemblies are accurate (quality value > 40) and highly contiguous (contig N50 > 23 Mbp) with low switch error rates (0.17%), providing fully phased single-nucleotide variants, indels and structural variants. A comparison of Oxford Nanopore Technologies and Pacific Biosciences phased assemblies identified 154 regions that are preferential sites of contig breaks, irrespective of sequencing technology or phasing algorithms.

Subject(s)

Genome, Human , High-Throughput Nucleotide Sequencing/methods , Parents , Sequence Analysis, DNA/methods , Single-Cell Analysis/methods , Algorithms , Haplotypes , Humans , Puerto Rico/ethnology

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL