RESUMO
Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read-based phasing. Third-generation nanopore sequence data have demonstrated a long read length, but current interpretation methods for their novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline, PEPPER-Margin-DeepVariant, that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single-nucleotide-variant identification method at the whole-genome scale and produces high-quality single-nucleotide variants in segmental duplications and low-mappability regions where short-read-based genotyping fails. We show that our pipeline can provide highly contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% and 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance over the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio HiFi-polished).
Assuntos
Genes , Haplótipos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Nanoporos , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA/métodos , Software , Genoma Humano , Humanos , Anotação de Sequência MolecularRESUMO
SUMMARY: Ribbon is an alignment visualization tool that shows how alignments are positioned within both the reference and read contexts, giving an intuitive view that enables a better understanding of structural variants and the read evidence supporting them. Ribbon was born out of a need to curate complex structural variant calls and determine whether each was well supported by long-read evidence, and it uses the same intuitive visualization method to shed light on contig alignments from genome-to-genome comparisons. AVAILABILITY AND IMPLEMENTATION: Ribbon is freely available online at http://genomeribbon.com/ and is open-source at https://github.com/marianattestad/ribbon. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Genômica , Software , GenomaRESUMO
The SK-BR-3 cell line is one of the most important models for HER2+ breast cancers, which affect one in five breast cancer patients. SK-BR-3 is known to be highly rearranged, although much of the variation is in complex and repetitive regions that may be underreported. Addressing this, we sequenced SK-BR-3 using long-read single molecule sequencing from Pacific Biosciences and develop one of the most detailed maps of structural variations (SVs) in a cancer genome available, with nearly 20,000 variants present, most of which were missed by short-read sequencing. Surrounding the important ERBB2 oncogene (also known as HER2), we discover a complex sequence of nested duplications and translocations, suggesting a punctuated progression. Full-length transcriptome sequencing further revealed several novel gene fusions within the nested genomic variants. Combining long-read genome and transcriptome sequencing enables an in-depth analysis of how SVs disrupt the genome and sheds new light on the complex mechanisms involved in cancer genome evolution.
Assuntos
Neoplasias da Mama/genética , Amplificação de Genes/genética , Rearranjo Gênico/genética , Oncogenes/genética , Neoplasias da Mama/patologia , Feminino , Genoma Humano , Variação Estrutural do Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Células MCF-7 , Receptor ErbB-2/genética , Sequências Repetitivas de Ácido Nucleico/genética , Transcriptoma/genéticaRESUMO
Structural variations are the greatest source of genetic variation, but they remain poorly understood because of technological limitations. Single-molecule long-read sequencing has the potential to dramatically advance the field, although high error rates are a challenge with existing methods. Addressing this need, we introduce open-source methods for long-read alignment (NGMLR; https://github.com/philres/ngmlr ) and structural variant identification (Sniffles; https://github.com/fritzsedlazeck/Sniffles ) that provide unprecedented sensitivity and precision for variant detection, even in repeat-rich regions and for complex nested events that can have substantial effects on human health. In several long-read datasets, including healthy and cancerous human genomes, we discovered thousands of novel variants and categorized systematic errors in short-read approaches. NGMLR and Sniffles can automatically filter false events and operate on low-coverage data, thereby reducing the high costs that have hindered the application of long reads in clinical and research settings.
Assuntos
Análise Mutacional de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Genoma Humano , Genômica/métodos , HumanosRESUMO
BACKGROUND: Genome-wide association studies (GWAS) are typically visualized using a two-dimensional Manhattan plot, displaying chromosomal location of SNPs along the x-axis and the negative log-10 of their p-value on the y-axis. This traditional plot provides a broad overview of the results, but offers little opportunity for interaction or expansion of specific regions, and is unable to show additional dimensions of the dataset. RESULTS: We created BigTop, a visualization framework in virtual reality (VR), designed to render a Manhattan plot in three dimensions, wrapping the graph around the user in a simulated cylindrical room. BigTop uses the z-axis to display minor allele frequency of each SNP, allowing for the identification of allelic variants of genes. BigTop also offers additional interactivity, allowing users to select any individual SNP and receive expanded information, including SNP name, exact values, and gene location, if applicable. BigTop is built in JavaScript using the React and A-Frame frameworks, and can be rendered using commercially available VR headsets or in a two-dimensional web browser such as Google Chrome. Data is read into BigTop in JSON format, and can be provided as either JSON or a tab-separated text file. CONCLUSIONS: Using additional dimensions and interactivity options offered through VR, we provide a new, interactive, three-dimensional representation of the traditional Manhattan plot for displaying and exploring GWAS data.
Assuntos
Biologia Computacional/métodos , Realidade Virtual , Genoma , Estudo de Associação Genômica Ampla , Humanos , Polimorfismo de Nucleotídeo Único , Software , Interface Usuário-ComputadorAssuntos
Cuidados Críticos , Sequenciamento por Nanoporos/métodos , Transtornos do Neurodesenvolvimento/diagnóstico , Adolescente , Pré-Escolar , Feminino , Humanos , Lactente , Recém-Nascido , Masculino , Pessoa de Meia-Idade , Mutação , Sequenciamento por Nanoporos/economia , Transtornos do Neurodesenvolvimento/genética , Análise de Sequência de DNA/métodos , Estado Epiléptico/genéticaRESUMO
While genome assembly projects have been successful in many haploid and inbred species, the assembly of noninbred or rearranged heterozygous genomes remains a major challenge. To address this challenge, we introduce the open-source FALCON and FALCON-Unzip algorithms (https://github.com/PacificBiosciences/FALCON/) to assemble long-read sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. We generate new reference sequences for heterozygous samples including an F1 hybrid of Arabidopsis thaliana, the widely cultivated Vitis vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata, samples that have challenged short-read assembly approaches. The FALCON-based assemblies are substantially more contiguous and complete than alternate short- or long-read approaches. The phased diploid assembly enabled the study of haplotype structure and heterozygosities between homologous chromosomes, including the identification of widespread heterozygous structural variation within coding sequences.
Assuntos
Diploide , Genoma Fúngico/genética , Genoma de Planta/genética , Genômica/métodos , Polimorfismo de Nucleotídeo Único/genética , Algoritmos , Arabidopsis/genética , Basidiomycota/genética , DNA Fúngico/genética , DNA de Plantas/genética , Haplótipos , Heterozigoto , Humanos , Análise de Sequência de DNA , Vitis/genéticaRESUMO
SUMMARY: GenomeScope is an open-source web tool to rapidly estimate the overall characteristics of a genome, including genome size, heterozygosity rate and repeat content from unprocessed short reads. These features are essential for studying genome evolution, and help to choose parameters for downstream analysis. We demonstrate its accuracy on 324 simulated and 16 real datasets with a wide range in genome sizes, heterozygosity levels and error rates. AVAILABILITY AND IMPLEMENTATION: http://genomescope.org , https://github.com/schatzlab/genomescope.git . CONTACT: mschatz@jhu.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Biologia Computacional/métodos , Análise de Sequência de DNA/métodos , Software , Animais , Evolução Molecular , Genoma , Tamanho do Genoma , Heterozigoto , InternetRESUMO
UNLABELLED: Assemblytics is a web app for detecting and analyzing variants from a de novo genome assembly aligned to a reference genome. It incorporates a unique anchor filtering approach to increase robustness to repetitive elements, and identifies six classes of variants based on their distinct alignment signatures. Assemblytics can be applied both to comparing aberrant genomes, such as human cancers, to a reference, or to identify differences between related species. Multiple interactive visualizations enable in-depth explorations of the genomic distributions of variants. AVAILABILITY AND IMPLEMENTATION: http://assemblytics.com, https://github.com/marianattestad/assemblytics CONTACT: mnattest@cshl.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Genoma , Genômica , Variação Genética , Humanos , SoftwareRESUMO
Accurate genome assemblies are essential for biological research, but even the highest quality assemblies retain errors caused by the technologies used to construct them. Base-level errors are typically fixed with an additional polishing step that uses reads aligned to the draft assembly to identify necessary edits. However, current methods struggle to find a balance between over-and under-polishing. Here, we present an encoder-only transformer model for assembly polishing called DeepPolisher, which predicts corrections to the underlying sequence using Pacbio HiFi read alignments to a diploid assembly. Our pipeline introduces a method, PHARAOH (Phasing Reads in Areas Of Homozygosity), which uses ultra-long ONT data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions. We demonstrate that the DeepPolisher pipeline can reduce assembly errors by half, with a greater than 70% reduction in indel errors. We have applied our DeepPolisher-based pipeline to 180 assemblies from the next Human Pangenome Reference Consortium (HPRC) data release, producing an average predicted Quality Value (QV) improvement of 3.4 (54% error reduction) for the majority of the genome.
RESUMO
Long-read sequencing technology has enabled variant detection in difficult-to-map regions of the genome and enabled rapid genetic diagnosis in clinical settings. Rapidly evolving third-generation sequencing platforms like Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are introducing newer platforms and data types. It has been demonstrated that variant calling methods based on deep neural networks can use local haplotyping information with long-reads to improve the genotyping accuracy. However, using local haplotype information creates an overhead as variant calling needs to be performed multiple times which ultimately makes it difficult to extend to new data types and platforms as they get introduced. In this work, we have developed a local haplotype approximate method that enables state-of-the-art variant calling performance with multiple sequencing platforms including PacBio Revio system, ONT R10.4 simplex and duplex data. This addition of local haplotype approximation simplifies long-read variant calling with DeepVariant.
Assuntos
Haplótipos , Sequenciamento de Nucleotídeos em Larga Escala , Haplótipos/genética , Humanos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Polimorfismo de Nucleotídeo Único , Genoma Humano , Algoritmos , Variação Genética , Redes Neurais de ComputaçãoRESUMO
Long-read sequencing technology has enabled variant detection in difficult-to-map regions of the genome and enabled rapid genetic diagnosis in clinical settings. Rapidly evolving third-generation sequencing platforms like Pacific Biosciences (PacBio) and Oxford nanopore technologies (ONT) are introducing newer platforms and data types. It has been demonstrated that variant calling methods based on deep neural networks can use local haplotyping information with long-reads to improve the genotyping accuracy. However, using local haplotype information creates an overhead as variant calling needs to be performed multiple times which ultimately makes it difficult to extend to new data types and platforms as they get introduced. In this work, we have developed a local haplotype approximate method that enables state-of-the-art variant calling performance with multiple sequencing platforms including PacBio Revio system, ONT R10.4 simplex and duplex data. This addition of local haplotype approximation makes DeepVariant a universal variant calling solution for long-read sequencing platforms.
RESUMO
Circular consensus sequencing with Pacific Biosciences (PacBio) technology generates long (10-25 kilobases), accurate 'HiFi' reads by combining serial observations of a DNA molecule into a consensus sequence. The standard approach to consensus generation, pbccs, uses a hidden Markov model. We introduce DeepConsensus, which uses an alignment-based loss to train a gap-aware transformer-encoder for sequence correction. Compared to pbccs, DeepConsensus reduces read errors by 42%. This increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27% and at Q40 by 90%. With two SMRT Cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (NG50 4.9 megabases (Mb) to 17.2 Mb), increase gene completeness (94% to 97%), reduce the false gene duplication rate (1.1% to 0.5%), improve assembly base accuracy (Q43 to Q45) and reduce variant-calling errors by 24%. DeepConsensus models could be trained to the general problem of analyzing the alignment of other types of sequences, such as unique molecular identifiers or genome assemblies.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNARESUMO
Whole-genome sequencing (WGS) can identify variants that cause genetic disease, but the time required for sequencing and analysis has been a barrier to its use in acutely ill patients. In the present study, we develop an approach for ultra-rapid nanopore WGS that combines an optimized sample preparation protocol, distributing sequencing over 48 flow cells, near real-time base calling and alignment, accelerated variant calling and fast variant filtration for efficient manual review. Application to two example clinical cases identified a candidate variant in <8 h from sample preparation to variant identification. We show that this framework provides accurate variant calls and efficient prioritization, and accelerates diagnostic clinical genome sequencing twofold compared with previous approaches.
Assuntos
Sequenciamento por Nanoporos , Nanoporos , Mapeamento Cromossômico , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Sequenciamento Completo do Genoma/métodosRESUMO
The precisionFDA Truth Challenge V2 aimed to assess the state of the art of variant calling in challenging genomic regions. Starting with FASTQs, 20 challenge participants applied their variant-calling pipelines and submitted 64 variant call sets for one or more sequencing technologies (Illumina, PacBio HiFi, and Oxford Nanopore Technologies). Submissions were evaluated following best practices for benchmarking small variants with updated Genome in a Bottle benchmark sets and genome stratifications. Challenge submissions included numerous innovative methods, with graph-based and machine learning methods scoring best for short-read and long-read datasets, respectively. With machine learning approaches, combining multiple sequencing technologies performed particularly well. Recent developments in sequencing and variant calling have enabled benchmarking variants in challenging genomic regions, paving the way for the identification of previously unknown clinically relevant variants.
RESUMO
The application of next-generation sequencing to estimate genetic diversity of Plasmodium falciparum, the most lethal malaria parasite, has proved challenging due to the skewed AT-richness [â¼80.6% (A + T)] of its genome and the lack of technology to assemble highly polymorphic subtelomeric regions that contain clonally variant, multigene virulence families (Ex: var and rifin). To address this, we performed amplification-free, single molecule, real-time sequencing of P. falciparum genomic DNA and generated reads of average length 12 kb, with 50% of the reads between 15.5 and 50 kb in length. Next, using the Hierarchical Genome Assembly Process, we assembled the P. falciparum genome de novo and successfully compiled all 14 nuclear chromosomes telomere-to-telomere. We also accurately resolved centromeres [â¼90-99% (A + T)] and subtelomeric regions and identified large insertions and duplications that add extra var and rifin genes to the genome, along with smaller structural variants such as homopolymer tract expansions. Overall, we show that amplification-free, long-read sequencing combined with de novo assembly overcomes major challenges inherent to studying the P. falciparum genome. Indeed, this technology may not only identify the polymorphic and repetitive subtelomeric sequences of parasite populations from endemic areas but may also evaluate structural variation linked to virulence, drug resistance and disease transmission.
Assuntos
Genoma de Protozoário , Plasmodium falciparum/genética , Telômero/genética , Mapeamento de Sequências Contíguas , Polimorfismo Genético , Análise de Sequência de DNARESUMO
The methylotrophic yeast, Pichia pastoris, has been genetically engineered to produce many heterologous proteins for industrial and research purposes. In order to secrete proteins for easier purification from the extracellular medium, the coding sequence of recombinant proteins is initially fused to the Saccharomyces cerevisiae α-mating factor secretion signal leader. Extensive site-directed mutagenesis of the prepro-region of the α-mating factor secretion signal sequence was performed in order to determine the effects of various deletions and substitutions on expression. Though some mutations clearly dampened protein expression, deletion of amino acids 57-70, corresponding to the predicted 3rd alpha helix of α-mating factor secretion signal, increased secretion of reporter proteins horseradish peroxidase and lipase at least 50% in small-scale cultures. These findings raise the possibility that the secretory efficiency of the leader can be further enhanced in the future.
Assuntos
Regulação Fúngica da Expressão Gênica , Mutação , Peptídeos/metabolismo , Pichia/genética , Proteínas Recombinantes/biossíntese , Sequência de Aminoácidos , Western Blotting , Deleção de Genes , Genes Reporter , Peroxidase do Rábano Silvestre/genética , Peroxidase do Rábano Silvestre/metabolismo , Lipase/genética , Lipase/metabolismo , Fator de Acasalamento , Dados de Sequência Molecular , Mutagênese Sítio-Dirigida , Peptídeos/genética , Pichia/metabolismo , Plasmídeos , Reação em Cadeia da Polimerase em Tempo Real , Proteínas Recombinantes/genética , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismoRESUMO
Pichia pastoris is a methylotrophic yeast that has been genetically engineered to express over one thousand heterologous proteins valued for industrial, pharmaceutical and basic research purposes. In most cases, the 5' untranslated region (UTR) of the alcohol oxidase 1 (AOX1) gene is fused to the coding sequence of the recombinant gene for protein expression in this yeast. Because the effect of the AOX1 5'UTR on protein expression is not known, site-directed mutagenesis was performed in order to decrease or increase the length of this region. Both of these types of changes were shown to affect translational efficiency, not transcript stability. While increasing the length of the 5'UTR clearly decreased expression of a ß-galactosidase reporter in a proportional manner, a deletion analysis demonstrated that the AOX1 5'UTR contains a complex mixture of both positive and negative cis-acting elements, suggesting that the construction of a synthetic 5'UTR optimized for a higher level of expression may be challenging.