RESUMEN
MOTIVATION: In diploid organisms, phasing is the problem of assigning the alleles at heterozygous variants to one of two haplotypes. Reads from PacBio HiFi sequencing provide long, accurate observations that can be used as the basis for both calling and phasing variants. HiFi reads also excel at calling larger classes of variation, such as structural or tandem repeat variants. However, current phasing tools typically only phase small variants, leaving larger variants unphased. RESULTS: We developed HiPhase, a tool that jointly phases SNVs, indels, structural, and tandem repeat variants. The main benefits of HiPhase are (i) dual mode allele assignment for detecting large variants, (ii) a novel application of the A*-algorithm to phasing, and (iii) logic allowing phase blocks to span breaks caused by alignment issues around reference gaps and homozygous deletions. In our assessment, HiPhase produced an average phase block NG50 of 480 kb with 929 switchflip errors and fully phased 93.8% of genes, improving over the current state of the art. Additionally, HiPhase jointly phases SNVs, indels, structural, and tandem repeat variants and includes innate multi-threading, statistics gathering, and concurrent phased alignment output generation. AVAILABILITY AND IMPLEMENTATION: HiPhase is available as source code and a pre-compiled Linux binary with a user guide at https://github.com/PacificBiosciences/HiPhase.
Asunto(s)
Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Análisis de Secuencia de ADN , Algoritmos , Haplotipos , Secuencias Repetidas en TándemRESUMEN
Despite significant advancements in rare genetic disease diagnostics, many patients with rare genetic disease remain without a molecular diagnosis. Novel tools and methods are needed to improve the detection of disease-associated variants and understand the genetic basis of many rare diseases. Long-read genome sequencing provides improved sequencing in highly repetitive, homologous, and low-complexity regions, and improved assessment of structural variation and complex genomic rearrangements compared to short-read genome sequencing. As such, it is a promising method to explore overlooked genetic variants in rare diseases with a high suspicion of a genetic basis. We therefore applied PacBio HiFi sequencing in a large multi-generational family presenting with autosomal dominant 46,XY differences of sexual development (DSD), for whom extensive molecular testing over multiple decades had failed to identify a molecular diagnosis. This revealed a rare SINE-VNTR-Alu retroelement insertion in intron 4 of NR5A1, a gene in which loss-of-function variants are an established cause of 46,XY DSD. The insertion segregated among affected family members and was associated with loss-of-expression of alleles in cis, demonstrating a functional impact on NR5A1. This case highlights the power of long-read genome sequencing to detect genomic variants that have previously been intractable to detection by standard short-read genomic testing.
Asunto(s)
Trastorno del Desarrollo Sexual 46,XY , Retroelementos , Humanos , Mutación , Intrones/genética , Retroelementos/genética , Trastorno del Desarrollo Sexual 46,XY/genética , Enfermedades Raras/genética , Desarrollo Sexual , Factor Esteroidogénico 1/genéticaRESUMEN
The use of pharmacogenomics in clinical practice is becoming standard of care. However, due to the complex genetic makeup of pharmacogenes, not all genetic variation is currently accounted for. Here, we show the utility of long-read sequencing to resolve complex pharmacogenes by analyzing a well-characterised sample. This data consists of long reads that were processed to resolve phased haploblocks. 73% of pharmacogenes were fully covered in one phased haploblock, including 9/15 genes that are 100% complex. Variant calling accuracy in the pharmacogenes was high, with 99.8% recall and 100% precision for SNVs and 98.7% precision and 98.0% recall for Indels. For the majority of gene-drug interactions in the DPWG and CPIC guidelines, the associated genes could be fully resolved (62% and 63% respectively). Together, these findings suggest that long-read sequencing data offers promising opportunities in elucidating complex pharmacogenes and haplotype phasing while maintaining accurate variant calling.
Asunto(s)
Farmacogenética/métodos , Análisis de Secuencia de ADN/métodos , Variación Genética , Genoma Humano , Haplotipos , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Reproducibilidad de los ResultadosRESUMEN
Tandem repeat (TR) variation is associated with gene expression changes and numerous rare monogenic diseases. Although long-read sequencing provides accurate full-length sequences and methylation of TRs, there is still a need for computational methods to profile TRs across the genome. Here we introduce the Tandem Repeat Genotyping Tool (TRGT) and an accompanying TR database. TRGT determines the consensus sequences and methylation levels of specified TRs from PacBio HiFi sequencing data. It also reports reads that support each repeat allele. These reads can be subsequently visualized with a companion TR visualization tool. Assessing 937,122 TRs, TRGT showed a Mendelian concordance of 98.38%, allowing a single repeat unit difference. In six samples with known repeat expansions, TRGT detected all expansions while also identifying methylation signals and mosaicism and providing finer repeat length resolution than existing methods. Additionally, we released a database with allele sequences and methylation levels for 937,122 TRs across 100 genomes.
Asunto(s)
Metilación de ADN , Secuencias Repetidas en Tándem , Secuencias Repetidas en Tándem/genética , Humanos , Metilación de ADN/genética , Genoma Humano/genética , Alelos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Bases de Datos GenéticasRESUMEN
Using five complementary short- and long-read sequencing technologies, we phased and assembled >95% of each diploid human genome in a four-generation, 28-member family (CEPH 1463) allowing us to systematically assess de novo mutations (DNMs) and recombination. From this family, we estimate an average of 192 DNMs per generation, including 75.5 de novo single-nucleotide variants (SNVs), 7.4 non-tandem repeat indels, 79.6 de novo indels or structural variants (SVs) originating from tandem repeats, 7.7 centromeric de novo SVs and SNVs, and 12.4 de novo Y chromosome events per generation. STRs and VNTRs are the most mutable with 32 loci exhibiting recurrent mutation through the generations. We accurately assemble 288 centromeres and six Y chromosomes across the generations, documenting de novo SVs, and demonstrate that the DNM rate varies by an order of magnitude depending on repeat content, length, and sequence identity. We show a strong paternal bias (75-81%) for all forms of germline DNM, yet we estimate that 17% of de novo SNVs are postzygotic in origin with no paternal bias. We place all this variation in the context of a high-resolution recombination map (~3.5 kbp breakpoint resolution). We observe a strong maternal recombination bias (1.36 maternal:paternal ratio) with a consistent reduction in the number of crossovers with increasing paternal (r=0.85) and maternal (r=0.65) age. However, we observe no correlation between meiotic crossover locations and de novo SVs, arguing against non-allelic homologous recombination as a predominant mechanism. The use of multiple orthogonal technologies, near-telomere-to-telomere phased genome assemblies, and a multi-generation family to assess transmission has created the most comprehensive, publicly available "truth set" of all classes of genomic variants. The resource can be used to test and benchmark new algorithms and technologies to understand the most fundamental processes underlying human genetic variation.
RESUMEN
BACKGROUND: Long-read sequencing (LRS) techniques have been very successful in identifying structural variants (SVs). However, the high error rate of LRS made the detection of small variants (substitutions and short indels < 20 bp) more challenging. The introduction of PacBio HiFi sequencing makes LRS also suited for detecting small variation. Here we evaluate the ability of HiFi reads to detect de novo mutations (DNMs) of all types, which are technically challenging variant types and a major cause of sporadic, severe, early-onset disease. METHODS: We sequenced the genomes of eight parent-child trios using high coverage PacBio HiFi LRS (~ 30-fold coverage) and Illumina short-read sequencing (SRS) (~ 50-fold coverage). De novo substitutions, small indels, short tandem repeats (STRs) and SVs were called in both datasets and compared to each other to assess the accuracy of HiFi LRS. In addition, we determined the parent-of-origin of the small DNMs using phasing. RESULTS: We identified a total of 672 and 859 de novo substitutions/indels, 28 and 126 de novo STRs, and 24 and 1 de novo SVs in LRS and SRS respectively. For the small variants, there was a 92 and 85% concordance between the platforms. For the STRs and SVs, the concordance was 3.6 and 0.8%, and 4 and 100% respectively. We successfully validated 27/54 LRS-unique small variants, of which 11 (41%) were confirmed as true de novo events. For the SRS-unique small variants, we validated 42/133 DNMs and 8 (19%) were confirmed as true de novo event. Validation of 18 LRS-unique de novo STR calls confirmed none of the repeat expansions as true DNM. Confirmation of the 23 LRS-unique SVs was possible for 19 candidate SVs of which 10 (52.6%) were true de novo events. Furthermore, we were able to assign 96% of DNMs to their parental allele with LRS data, as opposed to just 20% with SRS data. CONCLUSIONS: HiFi LRS can now produce the most comprehensive variant dataset obtainable by a single technology in a single laboratory, allowing accurate calling of substitutions, indels, STRs and SVs. The accuracy even allows sensitive calling of DNMs on all variant levels, and also allows for phasing, which helps to distinguish true positive from false positive DNMs.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Mutación INDEL , Humanos , Alelos , Repeticiones de MicrosatéliteRESUMEN
Circular consensus sequencing with Pacific Biosciences (PacBio) technology generates long (10-25 kilobases), accurate 'HiFi' reads by combining serial observations of a DNA molecule into a consensus sequence. The standard approach to consensus generation, pbccs, uses a hidden Markov model. We introduce DeepConsensus, which uses an alignment-based loss to train a gap-aware transformer-encoder for sequence correction. Compared to pbccs, DeepConsensus reduces read errors by 42%. This increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27% and at Q40 by 90%. With two SMRT Cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (NG50 4.9 megabases (Mb) to 17.2 Mb), increase gene completeness (94% to 97%), reduce the false gene duplication rate (1.1% to 0.5%), improve assembly base accuracy (Q43 to Q45) and reduce variant-calling errors by 24%. DeepConsensus models could be trained to the general problem of analyzing the alignment of other types of sequences, such as unique molecular identifiers or genome assemblies.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADNRESUMEN
Long-read HiFi genome sequencing allows for accurate detection and direct phasing of single nucleotide variants, indels, and structural variants. Recent algorithmic development enables simultaneous detection of CpG methylation for analysis of regulatory element activity directly in HiFi reads. We present a comprehensive haplotype resolved 5-base HiFi genome sequencing dataset from a rare disease cohort of 276 samples in 152 families to identify rare (~0.5%) hypermethylation events. We find that 80% of these events are allele-specific and predicted to cause loss of regulatory element activity. We demonstrate heritability of extreme hypermethylation including rare cis variants associated with short (~200 bp) and large hypermethylation events (>1 kb), respectively. We identify repeat expansions in proximal promoters predicting allelic gene silencing via hypermethylation and demonstrate allelic transcriptional events downstream. On average 30-40 rare hypermethylation tiles overlap rare disease genes per patient, providing indications for variation prioritization including a previously undiagnosed pathogenic allele in DIP2B causing global developmental delay. We propose that use of HiFi genome sequencing in unsolved rare disease cases will allow detection of unconventional diseases alleles due to loss of regulatory element activity.
Asunto(s)
Metilación de ADN , Enfermedades Raras , Humanos , Haplotipos , Enfermedades Raras/genética , Metilación de ADN/genética , Análisis de Secuencia de ADN , Secuencia de Bases , Secuenciación de Nucleótidos de Alto Rendimiento , Proteínas del Tejido Nervioso/genéticaRESUMEN
Similar to many insect species, Drosophila melanogaster is capable of maintaining a stable flight trajectory for periods lasting up to several hours.1,2 Because aerodynamic torque is roughly proportional to the fifth power of wing length,3 even small asymmetries in wing size require the maintenance of subtle bilateral differences in flapping motion to maintain a stable path. Flies can even fly straight after losing half of a wing, a feat they accomplish via very large, sustained kinematic changes to both the damaged and intact wings.4 Thus, the neural network responsible for stable flight must be capable of sustaining fine-scaled control over wing motion across a large dynamic range. In this study, we describe an unusual type of descending neuron (DNg02) that projects directly from visual output regions of the brain to the dorsal flight neuropil of the ventral nerve cord. Unlike many descending neurons, which exist as single bilateral pairs with unique morphology, there is a population of at least 15 DNg02 cell pairs with nearly identical shape. By optogenetically activating different numbers of DNg02 cells, we demonstrate that these neurons regulate wingbeat amplitude over a wide dynamic range via a population code. Using two-photon functional imaging, we show that DNg02 cells are responsive to visual motion during flight in a manner that would make them well suited to continuously regulate bilateral changes in wing kinematics. Collectively, we have identified a critical set of descending neurons that provides the sensitivity and dynamic range required for flight control.
Asunto(s)
Drosophila , Vuelo Animal , Animales , Fenómenos Biomecánicos , Drosophila/fisiología , Drosophila melanogaster/fisiología , Vuelo Animal/fisiología , Modelos Biológicos , Neuronas , Alas de Animales/fisiologíaRESUMEN
The repetitive nature and complexity of some medically relevant genes poses a challenge for their accurate analysis in a clinical setting. The Genome in a Bottle Consortium has provided variant benchmark sets, but these exclude nearly 400 medically relevant genes due to their repetitiveness or polymorphic complexity. Here, we characterize 273 of these 395 challenging autosomal genes using a haplotype-resolved whole-genome assembly. This curated benchmark reports over 17,000 single-nucleotide variations, 3,600 insertions and deletions and 200 structural variations each for human genome reference GRCh37 and GRCh38 across HG002. We show that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically relevant genes, including CBS, CRYAA and KCNE1. When masking these false duplications, variant recall can improve from 8% to 100%. Forming benchmarks from a haplotype-resolved whole-genome assembly may become a prototype for future benchmarks covering the whole genome.
Asunto(s)
Genoma Humano , Genoma Humano/genética , Haplotipos/genética , Humanos , Análisis de Secuencia de ADNRESUMEN
Genome in a Bottle benchmarks are widely used to help validate clinical sequencing pipelines and develop variant calling and sequencing methods. Here we use accurate linked and long reads to expand benchmarks in 7 samples to include difficult-to-map regions and segmental duplications that are challenging for short reads. These benchmarks add more than 300,000 SNVs and 50,000 insertions or deletions (indels) and include 16% more exonic variants, many in challenging, clinically relevant genes not covered previously, such as PMS2. For HG002, we include 92% of the autosomal GRCh38 assembly while excluding regions problematic for benchmarking small variants, such as copy number variants, that should not have been in the previous version, which included 85% of GRCh38. It identifies eight times more false negatives in a short read variant call set relative to our previous benchmark. We demonstrate that this benchmark reliably identifies false positives and false negatives across technologies, enabling ongoing methods development.
RESUMEN
OBJECTIVE: We have used long-read single molecule, real-time (SMRT) sequencing to fully characterize a ~12Mb genomic region on chromosome Xq24-q27, significantly linked to bipolar disorder (BD) in an extended family from a genetic sub-isolate. This family segregates BD in at least four generations with 24 affected individuals. METHODS: We selected 16 family members for targeted sequencing. The selected individuals either carried the disease haplotype, were non-carriers of the disease haplotype, or served as married-in controls. We designed hybrid capture probes enriching for 5-9Kb fragments spanning the entire 12Mb region that were then sequenced to screen for candidate structural variants (SVs) that could explain the increased risk for BD in this extended family. RESULTS: Altogether, 201 variants were detected in the critically linked region. Although most of these represented common variants, three variants emerged that showed near-perfect segregation among all BD type I affected individuals. Two of the SVs were identified in or near genes belonging to the RNA Binding Motif Protein, X-Linked (RBMX) gene family-a 330bp Alu (subfamily AluYa5) deletion in intron 3 of the RBMX2 gene and an intergenic 27bp tandem repeat deletion between the RBMX and G protein-coupled receptor 101 (GPR101) genes. The third SV was a 50bp tandem repeat insertion in intron 1 of the Coagulation Factor IX (F9) gene. CONCLUSIONS: Among the three genetically linked SVs, additional evidence supported the Alu element deletion in RBMX2 as the leading candidate for contributing directly to the disease development of BD type I in this extended family.
Asunto(s)
Elementos Alu , Trastorno Bipolar/genética , Genes Ligados a X , Predisposición Genética a la Enfermedad , Femenino , Humanos , Masculino , LinajeRESUMEN
Structural variation (SV) is typically defined as variation within the human genome that exceeds 50 base pairs (bp). SV may be copy number neutral or it may involve duplications, deletions, and complex rearrangements. Recent studies have shown SV to be associated with many human diseases. However, studies of SV have been challenging due to technological constraints. With the advent of third generation (long-read) sequencing technology, exploration of longer stretches of DNA not easily examined previously has been made possible. In the present study, we utilized third generation (long-read) sequencing techniques to examine SV in the EGFR landscape of four haplotypes derived from two human samples. We analyzed the EGFR gene and its landscape (+/- 500,000 base pairs) using this approach and were able to identify a region of non-coding DNA with over 90% similarity to the most common activating EGFR mutation in non-small cell lung cancer. Based on previously published Alu-element genome instability algorithms, we propose a molecular mechanism to explain how this non-coding region of DNA may be interacting with and impacting the stability of the EGFR gene and potentially generating this cancer-driver gene. By these techniques, we were also able to identify previously hidden structural variation in the four haplotypes and in the human reference genome (hg38). We applied previously published algorithms to compare the relative stabilities of these five different EGFR gene landscape haplotypes to estimate their relative potentials to generate the EGFR exon 19, 15 bp canonical deletion. To our knowledge, the present study is the first to use the differences in genomic architecture between targeted cancer-linked phased haplotypes to estimate their relative potentials to form a common cancer-linked driver mutation.
Asunto(s)
Genes erbB-1/genética , Variación Genética , Genoma Humano/genética , Inestabilidad Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Carcinoma de Pulmón de Células no Pequeñas/genética , Simulación por Computador , Haplotipos , Humanos , Neoplasias Pulmonares/genética , Análisis de Secuencia de ADNRESUMEN
Most human genomes are characterized by aligning individual reads to the reference genome, but accurate long reads and linked reads now enable us to construct accurate, phased de novo assemblies. We focus on a medically important, highly variable, 5 million base-pair (bp) region where diploid assembly is particularly useful - the Major Histocompatibility Complex (MHC). Here, we develop a human genome benchmark derived from a diploid assembly for the openly-consented Genome in a Bottle sample HG002. We assemble a single contig for each haplotype, align them to the reference, call phased small and structural variants, and define a small variant benchmark for the MHC, covering 94% of the MHC and 22368 variants smaller than 50 bp, 49% more variants than a mapping-based benchmark. This benchmark reliably identifies errors in mapping-based callsets, and enables performance assessment in regions with much denser, complex variation than regions covered by previous benchmarks.
Asunto(s)
Diploidia , Complejo Mayor de Histocompatibilidad/genética , Benchmarking , Línea Celular , Variación Genética , Genoma Humano , Haplotipos , HumanosRESUMEN
Dysregulation of alpha-synuclein expression has been implicated in the pathogenesis of synucleinopathies, in particular Parkinson's Disease (PD) and Dementia with Lewy bodies (DLB). Previous studies have shown that the alternatively spliced isoforms of the SNCA gene are differentially expressed in different parts of the brain for PD and DLB patients. Similarly, SNCA isoforms with skipped exons can have a functional impact on the protein domains. The large intronic region of the SNCA gene was also shown to harbor structural variants that affect transcriptional levels. Here, we apply the first study of using long read sequencing with targeted capture of both the gDNA and cDNA of the SNCA gene in brain tissues of PD, DLB, and control samples using the PacBio Sequel system. The targeted full-length cDNA (Iso-Seq) data confirmed complex usage of known alternative start sites and variable 3' UTR lengths, as well as novel 5' starts and 3' ends not previously described. The targeted gDNA data allowed phasing of up to 81% of the ~114 kb SNCA region, with the longest phased block exceeding 54 kb. We demonstrate that long gDNA and cDNA reads have the potential to reveal long-range information not previously accessible using traditional sequencing methods. This approach has a potential impact in studying disease risk genes such as SNCA, providing new insights into the genetic etiologies, including perturbations to the landscape the gene transcripts, of human complex diseases such as synucleinopathies.
RESUMEN
The DNA sequencing technologies in use today produce either highly accurate short reads or less-accurate long reads. We report the optimization of circular consensus sequencing (CCS) to improve the accuracy of single-molecule real-time (SMRT) sequencing (PacBio) and generate highly accurate (99.8%) long high-fidelity (HiFi) reads with an average length of 13.5 kilobases (kb). We applied our approach to sequence the well-characterized human HG002/NA24385 genome and obtained precision and recall rates of at least 99.91% for single-nucleotide variants (SNVs), 95.98% for insertions and deletions <50 bp (indels) and 95.99% for structural variants. Our CCS method matches or exceeds the ability of short-read sequencing to detect small variants and structural variants. We estimate that 2,434 discordances are correctable mistakes in the 'genome in a bottle' (GIAB) benchmark set. Nearly all (99.64%) variants can be phased into haplotypes, further improving variant detection. De novo genome assembly using CCS reads alone produced a contiguous and accurate genome with a contig N50 of >15 megabases (Mb) and concordance of 99.997%, substantially outperforming assembly with less-accurate long reads.
Asunto(s)
ADN Circular/genética , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Secuencia de Bases , Variación Genética , Haplotipos , HumanosRESUMEN
Animals discriminate stimuli, learn their predictive value and use this knowledge to modify their behavior. In Drosophila, the mushroom body (MB) plays a key role in these processes. Sensory stimuli are sparsely represented by â¼2000 Kenyon cells, which converge onto 34 output neurons (MBONs) of 21 types. We studied the role of MBONs in several associative learning tasks and in sleep regulation, revealing the extent to which information flow is segregated into distinct channels and suggesting possible roles for the multi-layered MBON network. We also show that optogenetic activation of MBONs can, depending on cell type, induce repulsion or attraction in flies. The behavioral effects of MBON perturbation are combinatorial, suggesting that the MBON ensemble collectively represents valence. We propose that local, stimulus-specific dopaminergic modulation selectively alters the balance within the MBON network for those stimuli. Our results suggest that valence encoded by the MBON ensemble biases memory-based action selection.
Asunto(s)
Conducta de Elección , Drosophila melanogaster/citología , Drosophila melanogaster/fisiología , Memoria , Cuerpos Pedunculados/citología , Cuerpos Pedunculados/inervación , Neuronas/fisiología , Animales , Conducta Apetitiva/efectos de la radiación , Aprendizaje por Asociación/efectos de la radiación , Reacción de Prevención/efectos de la radiación , Conducta Animal/efectos de la radiación , Conducta de Elección/efectos de la radiación , Luz , Memoria/efectos de la radiación , Modelos Neurológicos , Cuerpos Pedunculados/efectos de la radiación , Neuronas/efectos de la radiación , Odorantes , Sueño/efectos de la radiación , Factores de Tiempo , Visión OcularRESUMEN
Infection results in the rapid activation of immunity genes in the Drosophila fat body. Two classes of transcription factors have been implicated in this process: the REL-containing proteins, Dorsal, Dif, and Relish, and the GATA factor Serpent. Here we present evidence that REL-GATA synergy plays a pervasive role in the immune response. SELEX assays identified consensus binding sites that permitted the characterization of several immunity regulatory DNAs. The distribution of REL and GATA sites within these DNAs suggests that most or all fat-specific immunity genes contain a common organization of regulatory elements: closely linked REL and GATA binding sites positioned in the same orientation and located near the transcription start site. Aspects of this "regulatory code" are essential for the immune response. These results suggest that immunity regulatory DNAs contain constrained organizational features, which may be a general property of eukaryotic enhancers.