Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 63
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Nat Biotechnol ; 2024 Apr 26.
Artigo em Inglês | MEDLINE | ID: mdl-38671154

RESUMO

Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits and are linked to over 60 disease phenotypes. However, they are often excluded from at-scale studies because of challenges with variant calling and representation, as well as a lack of a genome-wide standard. Here, to promote the development of TR methods, we created a catalog of TR regions and explored TR properties across 86 haplotype-resolved long-read human assemblies. We curated variants from the Genome in a Bottle (GIAB) HG002 individual to create a TR dataset to benchmark existing and future TR analysis methods. We also present an improved variant comparison method that handles variants greater than 4 bp in length and varying allelic representation. The 8.1% of the genome covered by the TR catalog holds ~24.9% of variants per individual, including 124,728 small and 17,988 large variants for the GIAB HG002 'truth-set' TR benchmark. We demonstrate the utility of this pipeline across short-read and long-read technologies.

2.
medRxiv ; 2024 Mar 07.
Artigo em Inglês | MEDLINE | ID: mdl-38496498

RESUMO

Less than half of individuals with a suspected Mendelian condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control datasets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project ONT Sequencing Consortium aims to generate LRS data from at least 800 of the 1000 Genomes Project samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37x and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.

3.
bioRxiv ; 2024 Jan 23.
Artigo em Inglês | MEDLINE | ID: mdl-38328152

RESUMO

Tandem repeats are frequent across the human genome, and variation in repeat length has been linked to a variety of traits. Recent improvements in long read sequencing technologies have the potential to greatly improve TR analysis, especially for long or complex repeats. Here we introduce LongTR, which accurately genotypes tandem repeats from high fidelity long reads available from both PacBio and Oxford Nanopore Technologies. LongTR is freely available at https://github.com/gymrek-lab/longtr.

4.
bioRxiv ; 2023 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-37961319

RESUMO

Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits, and are linked to over 60 disease phenotypes. However, their complexity often excludes them from at-scale studies due to challenges with variant calling, representation, and lack of a genome-wide standard. To promote TR methods development, we create a comprehensive catalog of TR regions and explore its properties across 86 samples. We then curate variants from the GIAB HG002 individual to create a tandem repeat benchmark. We also present a variant comparison method that handles small and large alleles and varying allelic representation. The 8.1% of the genome covered by the TR catalog holds ∼24.9% of variants per individual, including 124,728 small and 17,988 large variants for the GIAB HG002 TR benchmark. We work with the GIAB community to demonstrate the utility of this benchmark across short and long read technologies.

5.
Nat Methods ; 20(8): 1213-1221, 2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-37365340

RESUMO

Advancements in sequencing technologies and assembly methods enable the regular production of high-quality genome assemblies characterizing complex regions. However, challenges remain in efficiently interpreting variation at various scales, from smaller tandem repeats to megabase rearrangements, across many human genomes. We present a PanGenome Research Tool Kit (PGR-TK) enabling analyses of complex pangenome structural and haplotype variation at multiple scales. We apply the graph decomposition methods in PGR-TK to the class II major histocompatibility complex demonstrating the importance of the human pangenome for analyzing complicated regions. Moreover, we investigate the Y-chromosome genes, DAZ1/DAZ2/DAZ3/DAZ4, of which structural variants have been linked to male infertility, and X-chromosome genes OPN1LW and OPN1MW linked to eye disorders. We further showcase PGR-TK across 395 complex repetitive medically important genes. This highlights the power of PGR-TK to resolve complex variation in regions of the genome that were previously too complex to analyze.


Assuntos
Genoma Humano , Genômica , Masculino , Humanos , Complexo Principal de Histocompatibilidade
6.
Nat Rev Genet ; 24(7): 464-483, 2023 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-37059810

RESUMO

Genetic variant calling from DNA sequencing has enabled understanding of germline variation in hundreds of thousands of humans. Sequencing technologies and variant-calling methods have advanced rapidly, routinely providing reliable variant calls in most of the human genome. We describe how advances in long reads, deep learning, de novo assembly and pangenomes have expanded access to variant calls in increasingly challenging, repetitive genomic regions, including medically relevant regions, and how new benchmark sets and benchmarking methods illuminate their strengths and limitations. Finally, we explore the possible future of more complete characterization of human genome variation in light of the recent completion of a telomere-to-telomere human genome reference assembly and human pangenomes, and we consider the innovations needed to benchmark their newly accessible repetitive regions and complex variants.


Assuntos
Benchmarking , Genoma Humano , Humanos , Genômica , Análise de Sequência de DNA , Sequenciamento de Nucleotídeos em Larga Escala
7.
Genome Biol ; 24(1): 31, 2023 02 21.
Artigo em Inglês | MEDLINE | ID: mdl-36810122

RESUMO

The current version of the human reference genome, GRCh38, contains a number of errors including 1.2 Mbp of falsely duplicated and 8.04 Mbp of collapsed regions. These errors impact the variant calling of 33 protein-coding genes, including 12 with medical relevance. Here, we present FixItFelix, an efficient remapping approach, together with a modified version of the GRCh38 reference genome that improves the subsequent analysis across these genes within minutes for an existing alignment file while maintaining the same coordinates. We showcase these improvements over multi-ethnic control samples, demonstrating improvements for population variant calling as well as eQTL studies.


Assuntos
Genoma Humano , Genômica , Humanos , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA
8.
J Mol Diagn ; 25(1): 3-16, 2023 01.
Artigo em Inglês | MEDLINE | ID: mdl-36244574

RESUMO

In silico approaches for next-generation sequencing (NGS) data modeling have utility in the clinical laboratory as a tool for clinical assay validation. In silico NGS data can take a variety of forms, including pure simulated data or manipulated data files in which variants are inserted into existing data files. In silico data enable simulation of a range of variants that may be difficult to obtain from a single physical sample. Such data allow laboratories to more accurately test the performance of clinical bioinformatics pipelines without sequencing additional cases. For example, clinical laboratories may use in silico data to simulate low variant allele fraction variants to test the analytical sensitivity of variant calling software or simulate a range of insertion/deletion sizes to determine the performance of insertion/deletion calling software. In this article, the Working Group reviews the different types of in silico data with their strengths and limitations, methods to generate in silico data, and how data can be used in the clinical molecular diagnostic laboratory. Survey data indicate how in silico NGS data are currently being used. Finally, potential applications for which in silico data may become useful in the future are presented.


Assuntos
Patologistas , Patologia Molecular , Humanos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Biologia Computacional/métodos , Software
9.
Cell Genom ; 2(5)2022 May.
Artigo em Inglês | MEDLINE | ID: mdl-36452119

RESUMO

Genome in a Bottle benchmarks are widely used to help validate clinical sequencing pipelines and develop variant calling and sequencing methods. Here we use accurate linked and long reads to expand benchmarks in 7 samples to include difficult-to-map regions and segmental duplications that are challenging for short reads. These benchmarks add more than 300,000 SNVs and 50,000 insertions or deletions (indels) and include 16% more exonic variants, many in challenging, clinically relevant genes not covered previously, such as PMS2. For HG002, we include 92% of the autosomal GRCh38 assembly while excluding regions problematic for benchmarking small variants, such as copy number variants, that should not have been in the previous version, which included 85% of GRCh38. It identifies eight times more false negatives in a short read variant call set relative to our previous benchmark. We demonstrate that this benchmark reliably identifies false positives and false negatives across technologies, enabling ongoing methods development.

10.
Nat Methods ; 19(6): 687-695, 2022 06.
Artigo em Inglês | MEDLINE | ID: mdl-35361931

RESUMO

Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first telomere-to-telomere human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Although derived from highly accurate sequences, evaluation revealed evidence of small errors and structural misassemblies in the initial draft assembly. To correct these errors, we designed a new repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly quality value from 70.2 to 73.9 measured from PacBio high-fidelity and Illumina k-mers. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both high-fidelity and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Nanoporos , Feminino , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Gravidez , Análise de Sequência de DNA/métodos , Telômero/genética
11.
Science ; 376(6588): eabl3533, 2022 04.
Artigo em Inglês | MEDLINE | ID: mdl-35357935

RESUMO

Compared to its predecessors, the Telomere-to-Telomere CHM13 genome adds nearly 200 million base pairs of sequence, corrects thousands of structural errors, and unlocks the most complex regions of the human genome for clinical and functional study. We show how this reference universally improves read mapping and variant calling for 3202 and 17 globally diverse samples sequenced with short and long reads, respectively. We identify hundreds of thousands of variants per sample in previously unresolved regions, showcasing the promise of the T2T-CHM13 reference for evolutionary and biomedical discovery. Simultaneously, this reference eliminates tens of thousands of spurious variants per sample, including reduction of false positives in 269 medically relevant genes by up to a factor of 12. Because of these improvements in variant discovery coupled with population and functional genomic resources, T2T-CHM13 is positioned to replace GRCh38 as the prevailing reference for human genetics.


Assuntos
Variação Genética , Genoma Humano , Genômica/normas , Análise de Sequência de DNA/normas , Humanos , Padrões de Referência
12.
Nat Biotechnol ; 40(5): 672-680, 2022 05.
Artigo em Inglês | MEDLINE | ID: mdl-35132260

RESUMO

The repetitive nature and complexity of some medically relevant genes poses a challenge for their accurate analysis in a clinical setting. The Genome in a Bottle Consortium has provided variant benchmark sets, but these exclude nearly 400 medically relevant genes due to their repetitiveness or polymorphic complexity. Here, we characterize 273 of these 395 challenging autosomal genes using a haplotype-resolved whole-genome assembly. This curated benchmark reports over 17,000 single-nucleotide variations, 3,600 insertions and deletions and 200 structural variations each for human genome reference GRCh37 and GRCh38 across HG002. We show that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically relevant genes, including CBS, CRYAA and KCNE1. When masking these false duplications, variant recall can improve from 8% to 100%. Forming benchmarks from a haplotype-resolved whole-genome assembly may become a prototype for future benchmarks covering the whole genome.


Assuntos
Genoma Humano , Genoma Humano/genética , Haplótipos/genética , Humanos , Análise de Sequência de DNA
14.
Nat Biotechnol ; 39(9): 1129-1140, 2021 09.
Artigo em Inglês | MEDLINE | ID: mdl-34504351

RESUMO

Assessing the reproducibility, accuracy and utility of massively parallel DNA sequencing platforms remains an ongoing challenge. Here the Association of Biomolecular Resource Facilities (ABRF) Next-Generation Sequencing Study benchmarks the performance of a set of sequencing instruments (HiSeq/NovaSeq/paired-end 2 × 250-bp chemistry, Ion S5/Proton, PacBio circular consensus sequencing (CCS), Oxford Nanopore Technologies PromethION/MinION, BGISEQ-500/MGISEQ-2000 and GS111) on human and bacterial reference DNA samples. Among short-read instruments, HiSeq 4000 and X10 provided the most consistent, highest genome coverage, while BGI/MGISEQ provided the lowest sequencing error rates. The long-read instrument PacBio CCS had the highest reference-based mapping rate and lowest non-mapping rate. The two long-read platforms PacBio CCS and PromethION/MinION showed the best sequence mapping in repeat-rich areas and across homopolymers. NovaSeq 6000 using 2 × 250-bp read chemistry was the most robust instrument for capturing known insertion/deletion events. This study serves as a benchmark for current genomics technologies, as well as a resource to inform experimental design and next-generation sequencing variant calling.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento de Nucleotídeos em Larga Escala/normas , Análise de Sequência de DNA/métodos , Análise de Sequência de DNA/normas , Pareamento Incorreto de Bases , Benchmarking , DNA/genética , DNA Bacteriano/genética , Genoma Bacteriano , Genoma Humano , Humanos
16.
Genet Med ; 23(9): 1673-1680, 2021 09.
Artigo em Inglês | MEDLINE | ID: mdl-34007000

RESUMO

PURPOSE: To evaluate the impact of technically challenging variants on the implementation, validation, and diagnostic yield of commonly used clinical genetic tests. Such variants include large indels, small copy-number variants (CNVs), complex alterations, and variants in low-complexity or segmentally duplicated regions. METHODS: An interlaboratory pilot study used synthetic specimens to assess detection of challenging variant types by various next-generation sequencing (NGS)-based workflows. One well-performing workflow was further validated and used in clinician-ordered testing of more than 450,000 patients. RESULTS: In the interlaboratory study, only 2 of 13 challenging variants were detected by all 10 workflows, and just 3 workflows detected all 13. Limitations were also observed among 11 less-challenging indels. In clinical testing, 21.6% of patients carried one or more pathogenic variants, of which 13.8% (17,561) were classified as technically challenging. These variants were of diverse types, affecting 556 of 1,217 genes across hereditary cancer, cardiovascular, neurological, pediatric, reproductive carrier screening, and other indicated tests. CONCLUSION: The analytic and clinical sensitivity of NGS workflows can vary considerably, particularly for prevalent, technically challenging variants. This can have important implications for the design and validation of tests (by laboratories) and the selection of tests (by clinicians) for a wide range of clinical indications.


Assuntos
Testes Genéticos , Sequenciamento de Nucleotídeos em Larga Escala , Criança , Variações do Número de Cópias de DNA/genética , Humanos , Mutação INDEL/genética , Projetos Piloto
17.
Nat Biotechnol ; 39(3): 309-312, 2021 03.
Artigo em Inglês | MEDLINE | ID: mdl-33288905

RESUMO

Haplotype-resolved or phased genome assembly provides a complete picture of genomes and their complex genetic variations. However, current algorithms for phased assembly either do not generate chromosome-scale phasing or require pedigree information, which limits their application. We present a method named diploid assembly (DipAsm) that uses long, accurate reads and long-range conformation data for single individuals to generate a chromosome-scale phased assembly within 1 day. Applied to four public human genomes, PGP1, HG002, NA12878 and HG00733, DipAsm produced haplotype-resolved assemblies with minimum contig length needed to cover 50% of the known genome (NG50) up to 25 Mb and phased ~99.5% of heterozygous sites at 98-99% accuracy, outperforming other approaches in terms of both contiguity and phasing completeness. We demonstrate the importance of chromosome-scale phased assemblies for the discovery of structural variants (SVs), including thousands of new transposon insertions, and of highly polymorphic and medically important regions such as the human leukocyte antigen (HLA) and killer cell immunoglobulin-like receptor (KIR) regions. DipAsm will facilitate high-quality precision medicine and studies of individual haplotype variation and population diversity.


Assuntos
Cromossomos Humanos , Genoma Humano , Haplótipos , Algoritmos , Heterozigoto , Humanos , Polimorfismo de Nucleotídeo Único
18.
Nat Commun ; 11(1): 4794, 2020 09 22.
Artigo em Inglês | MEDLINE | ID: mdl-32963235

RESUMO

Most human genomes are characterized by aligning individual reads to the reference genome, but accurate long reads and linked reads now enable us to construct accurate, phased de novo assemblies. We focus on a medically important, highly variable, 5 million base-pair (bp) region where diploid assembly is particularly useful - the Major Histocompatibility Complex (MHC). Here, we develop a human genome benchmark derived from a diploid assembly for the openly-consented Genome in a Bottle sample HG002. We assemble a single contig for each haplotype, align them to the reference, call phased small and structural variants, and define a small variant benchmark for the MHC, covering 94% of the MHC and 22368 variants smaller than 50 bp, 49% more variants than a mapping-based benchmark. This benchmark reliably identifies errors in mapping-based callsets, and enables performance assessment in regions with much denser, complex variation than regions covered by previous benchmarks.


Assuntos
Diploide , Complexo Principal de Histocompatibilidade/genética , Benchmarking , Linhagem Celular , Variação Genética , Genoma Humano , Haplótipos , Humanos
20.
Nat Biotechnol ; 38(9): 1044-1053, 2020 09.
Artigo em Inglês | MEDLINE | ID: mdl-32686750

RESUMO

De novo assembly of a human genome using nanopore long-read sequences has been reported, but it used more than 150,000 CPU hours and weeks of wall-clock time. To enable rapid human genome assembly, we present Shasta, a de novo long-read assembler, and polishing algorithms named MarginPolish and HELEN. Using a single PromethION nanopore sequencer and our toolkit, we assembled 11 highly contiguous human genomes de novo in 9 d. We achieved roughly 63× coverage, 42-kb read N50 values and 6.5× coverage in reads >100 kb using three flow cells per sample. Shasta produced a complete haploid human genome assembly in under 6 h on a single commercial compute node. MarginPolish and HELEN polished haploid assemblies to more than 99.9% identity (Phred quality score QV = 30) with nanopore reads alone. Addition of proximity-ligation sequencing enabled near chromosome-level scaffolds for all 11 genomes. We compare our assembly performance to existing methods for diploid, haploid and trio-binned human samples and report superior accuracy and speed.


Assuntos
Genoma Humano/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento por Nanoporos , Análise de Sequência de DNA/métodos , Algoritmos , Benchmarking , Cromossomos Humanos/genética , Aprendizado Profundo , Genômica , Antígenos HLA/genética , Haploidia , Sequenciamento de Nucleotídeos em Larga Escala/normas , Humanos , Análise de Sequência de DNA/normas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...