RESUMO
Genetic variant calling from DNA sequencing has enabled understanding of germline variation in hundreds of thousands of humans. Sequencing technologies and variant-calling methods have advanced rapidly, routinely providing reliable variant calls in most of the human genome. We describe how advances in long reads, deep learning, de novo assembly and pangenomes have expanded access to variant calls in increasingly challenging, repetitive genomic regions, including medically relevant regions, and how new benchmark sets and benchmarking methods illuminate their strengths and limitations. Finally, we explore the possible future of more complete characterization of human genome variation in light of the recent completion of a telomere-to-telomere human genome reference assembly and human pangenomes, and we consider the innovations needed to benchmark their newly accessible repetitive regions and complex variants.
Assuntos
Benchmarking , Genoma Humano , Humanos , Genômica , Análise de Sequência de DNA , Sequenciamento de Nucleotídeos em Larga EscalaRESUMO
The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome5. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity6. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent-child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.
Assuntos
Mapeamento Cromossômico , Diploide , Genoma Humano , Genômica , Humanos , Mapeamento Cromossômico/normas , Genoma Humano/genética , Haplótipos/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento de Nucleotídeos em Larga Escala/normas , Análise de Sequência de DNA/métodos , Análise de Sequência de DNA/normas , Padrões de Referência , Genômica/métodos , Genômica/normas , Cromossomos Humanos/genética , Variação Genética/genéticaRESUMO
Fewer than half of individuals with a suspected Mendelian or monogenic condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control datasets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project ONT Sequencing Consortium aims to generate LRS data from at least 800 of the 1000 Genomes Project samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37x and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.
RESUMO
In response to the COVID-19 pandemic, the National Institute of Standards and Technology released a synthetic RNA material for SARS-CoV-2 in June 2020. The goal was to rapidly produce a material to support molecular diagnostic testing applications. This material, referred to as Research Grade Test Material 10169, was shipped free of charge to laboratories across the globe to provide a non-hazardous material for assay development and assay calibration. The material consisted of two unique regions of the SARS-CoV-2 genome approximately 4 kb nucleotides in length. The concentration of each synthetic fragment was measured using RT-dPCR methods and confirmed to be compatible with RT-qPCR methods. In this report, the preparation, stability, and limitations of this material are described.
Assuntos
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , COVID-19/diagnóstico , Pandemias , Técnicas de Diagnóstico Molecular/métodos , RNA Viral/genética , Sensibilidade e Especificidade , Teste para COVID-19RESUMO
Allostery is a fundamental biophysical mechanism that underlies cellular sensing, signaling, and metabolism. Yet a quantitative understanding of allosteric genotype-phenotype relationships remains elusive. Here, we report the large-scale measurement of the genotype-phenotype landscape for an allosteric protein: the lac repressor from Escherichia coli, LacI. Using a method that combines long-read and short-read DNA sequencing, we quantitatively measure the dose-response curves for nearly 105 variants of the LacI genetic sensor. The resulting data provide a quantitative map of the effect of amino acid substitutions on LacI allostery and reveal systematic sequence-structure-function relationships. We find that in many cases, allosteric phenotypes can be quantitatively predicted with additive or neural-network models, but unpredictable changes also occur. For example, we were surprised to discover a new band-stop phenotype that challenges conventional models of allostery and that emerges from combinations of nearly silent amino acid substitutions.
Assuntos
Genótipo , Repressores Lac/metabolismo , Fenótipo , Regulação Alostérica , Substituição de Aminoácidos , Escherichia coli/genética , Variação GenéticaRESUMO
Metagenomic samples are snapshots of complex ecosystems at work. They comprise hundreds of known and unknown species, contain multiple strain variants and vary greatly within and across environments. Many microbes found in microbial communities are not easily grown in culture making their DNA sequence our only clue into their evolutionary history and biological function. Metagenomic assembly is a computational process aimed at reconstructing genes and genomes from metagenomic mixtures. Current methods have made significant strides in reconstructing DNA segments comprising operons, tandem gene arrays and syntenic blocks. Shorter, higher-throughput sequencing technologies have become the de facto standard in the field. Sequencers are now able to generate billions of short reads in only a few days. Multiple metagenomic assembly strategies, pipelines and assemblers have appeared in recent years. Owing to the inherent complexity of metagenome assembly, regardless of the assembly algorithm and sequencing method, metagenome assemblies contain errors. Recent developments in assembly validation tools have played a pivotal role in improving metagenomics assemblers. Here, we survey recent progress in the field of metagenomic assembly, provide an overview of key approaches for genomic and metagenomic assembly validation and demonstrate the insights that can be derived from assemblies through the use of assembly validation strategies. We also discuss the potential for impact of long-read technologies in metagenomics. We conclude with a discussion of future challenges and opportunities in the field of metagenomic assembly and validation.
Assuntos
Metagenoma , Metagenômica/métodos , Microbiota/genética , Algoritmos , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Metagenômica/estatística & dados numéricos , Metagenômica/tendências , SoftwareRESUMO
SUMMARY: We developed the metagenomeFeatures R Bioconductor package along with annotation packages for three 16S rRNA databases (Greengenes, RDP and SILVA) to facilitate working with 16S rRNA databases and marker-gene survey feature data. The metagenomeFeatures package defines two classes, MgDb for working with 16S rRNA sequence databases, and mgFeatures for marker-gene survey feature data. The associated annotation packages provide a consistent interface to the different databases facilitating database comparison and exploration. The mgFeatures-class represents a crucial step in the development of a common data structure for working with 16S marker-gene survey data in R. AVAILABILITY AND IMPLEMENTATION: https://bioconductor.org/packages/release/bioc/html/metagenomeFeatures.html. SUPPLEMENTARY INFORMATION: Supplementary material is available at Bioinformatics online.
Assuntos
Bases de Dados de Ácidos Nucleicos , Software , RNA Ribossômico 16S , Inquéritos e QuestionáriosRESUMO
The rapid adoption of microbial whole genome sequencing in public health, clinical testing, and forensic laboratories requires the use of validated measurement processes. Well-characterized, homogeneous, and stable microbial genomic reference materials can be used to evaluate measurement processes, improving confidence in microbial whole genome sequencing results. We have developed a reproducible and transparent bioinformatics tool, PEPR, Pipelines for Evaluating Prokaryotic References, for characterizing the reference genome of prokaryotic genomic materials. PEPR evaluates the quality, purity, and homogeneity of the reference material genome, and purity of the genomic material. The quality of the genome is evaluated using high coverage paired-end sequence data; coverage, paired-end read size and direction, as well as soft-clipping rates, are used to identify mis-assemblies. The homogeneity and purity of the material relative to the reference genome are characterized by comparing base calls from replicate datasets generated using multiple sequencing technologies. Genomic purity of the material is assessed by checking for DNA contaminants. We demonstrate the tool and its output using sequencing data while developing a Staphylococcus aureus candidate genomic reference material. PEPR is open source and available at https://github.com/usnistgov/pepr .
Assuntos
Biologia Computacional , Genoma , Sequenciamento de Nucleotídeos em Larga EscalaRESUMO
Despite the variety in sequencing platforms, mappers, and variant callers, no single pipeline is optimal across the entire human genome. Therefore, developers, clinicians, and researchers need to make tradeoffs when designing pipelines for their application. Currently, assessing such tradeoffs relies on intuition about how a certain pipeline will perform in a given genomic context. We present StratoMod, which addresses this problem using an interpretable machine-learning classifier to predict germline variant calling errors in a data-driven manner. We show StratoMod can precisely predict recall using Hifi or Illumina and leverage StratoMod's interpretability to measure contributions from difficult-to-map and homopolymer regions for each respective outcome. Furthermore, we use Statomod to assess the effect of mismapping on predicted recall using linear vs. graph-based references, and identify the hard-to-map regions where graph-based methods excelled and by how much. For these we utilize our draft benchmark based on the Q100 HG002 assembly, which contains previously-inaccessible difficult regions. Furthermore, StratoMod presents a new method of predicting clinically relevant variants likely to be missed, which is an improvement over current pipelines which only filter variants likely to be false. We anticipate this being useful for performing precise risk-reward analyses when designing variant calling pipelines.
Assuntos
Aprendizado de Máquina , Humanos , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Genômica/métodos , Variação GenéticaRESUMO
Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits and are linked to over 60 disease phenotypes. However, they are often excluded from at-scale studies because of challenges with variant calling and representation, as well as a lack of a genome-wide standard. Here, to promote the development of TR methods, we created a catalog of TR regions and explored TR properties across 86 haplotype-resolved long-read human assemblies. We curated variants from the Genome in a Bottle (GIAB) HG002 individual to create a TR dataset to benchmark existing and future TR analysis methods. We also present an improved variant comparison method that handles variants greater than 4 bp in length and varying allelic representation. The 8.1% of the genome covered by the TR catalog holds ~24.9% of variants per individual, including 124,728 small and 17,988 large variants for the GIAB HG002 'truth-set' TR benchmark. We demonstrate the utility of this pipeline across short-read and long-read technologies.
RESUMO
Despite the growing variety of sequencing and variant-calling tools, no workflow performs equally well across the entire human genome. Understanding context-dependent performance is critical for enabling researchers, clinicians, and developers to make informed tradeoffs when selecting sequencing hardware and software. Here we describe a set of "stratifications," which are BED files that define distinct contexts throughout the genome. We define these for GRCh37/38 as well as the new T2T-CHM13 reference, adding many new hard-to-sequence regions which are critical for understanding performance as the field progresses. Specifically, we highlight the increase in hard-to-map and GC-rich stratifications in CHM13 relative to the previous references. We then compare the benchmarking performance with each reference and show the performance penalty brought about by these additional difficult regions in CHM13. Additionally, we demonstrate how the stratifications can track context-specific improvements over different platform iterations, using Oxford Nanopore Technologies as an example. The means to generate these stratifications are available as a snakemake pipeline at https://github.com/usnistgov/giab-stratifications . We anticipate this being useful in enabling precise risk-reward calculations when building sequencing pipelines for any of the commonly-used reference genomes.
Assuntos
Genoma Humano , Genômica , Software , Humanos , Genômica/métodos , Análise de Sequência de DNA/métodos , Benchmarking , Sequenciamento de Nucleotídeos em Larga Escala/métodosRESUMO
Less than half of individuals with a suspected Mendelian condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control datasets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project ONT Sequencing Consortium aims to generate LRS data from at least 800 of the 1000 Genomes Project samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37x and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.
RESUMO
The Genome in a Bottle Consortium (GIAB), hosted by the National Institute of Standards and Technology (NIST), is developing new matched tumor-normal samples, the first to be explicitly consented for public dissemination of genomic data and cell lines. Here, we describe a comprehensive genomic dataset from the first individual, HG008, including DNA from an adherent, epithelial-like pancreatic ductal adenocarcinoma (PDAC) tumor cell line and matched normal cells from duodenal and pancreatic tissues. Data for the tumor-normal matched samples comes from thirteen distinct state-of-the-art whole genome measurement technologies, including high depth short and long-read bulk whole genome sequencing (WGS), single cell WGS, and Hi-C, and karyotyping. These data will be used by the GIAB Consortium to develop matched tumor-normal benchmarks for somatic variant detection. We expect these data to facilitate innovation for whole genome measurement technologies, de novo assembly of tumor and normal genomes, and bioinformatic tools to identify small and structural somatic mutations. This first-of-its-kind broadly consented open-access resource will facilitate further understanding of sequencing methods used for cancer biology.
RESUMO
Previous research on the Caribbean coral Montastraea cavernosa reported the presence of cyanobacterial endosymbionts and nitrogen fixation in orange, but not brown, colonies. We compared the diversity of nifH gene sequences between these two color morphs at three locations in the Caribbean and found that the nifH sequences recovered from M. cavernosa were consistent with previous studies on corals where members of both the α-proteobacteria and cyanobacteria were recovered. A number of nifH operational taxonomic units (OTUs) were significantly more abundant in the orange compared to the brown morphs, and one specific OTU (OTU 17), a cyanobacterial nifH sequence similar to others from corals and sponges and related to the cyanobacterial genus Cyanothece, was found in all orange morphs of M. cavernosa at all locations. The nifH diversity reported here, from a community perspective, was not significantly different between orange and brown morphs of M. cavernosa.
Assuntos
Antozoários/microbiologia , Biodiversidade , Animais , Bactérias/classificação , Bactérias/genética , Região do Caribe , Cianobactérias/classificação , Cianobactérias/genética , Variação Genética , Fixação de Nitrogênio/genética , Oxirredutases/genéticaRESUMO
As synthetic biology expands and accelerates into real-world applications, methods for quantitatively and precisely engineering biological function become increasingly relevant. This is particularly true for applications that require programmed sensing to dynamically regulate gene expression in response to stimuli. However, few methods have been described that can engineer biological sensing with any level of quantitative precision. Here, we present two complementary methods for precision engineering of genetic sensors: in silico selection and machine-learning-enabled forward engineering. Both methods use a large-scale genotype-phenotype dataset to identify DNA sequences that encode sensors with quantitatively specified dose response. First, we show that in silico selection can be used to engineer sensors with a wide range of dose-response curves. To demonstrate in silico selection for precise, multi-objective engineering, we simultaneously tune a genetic sensor's sensitivity (EC50) and saturating output to meet quantitative specifications. In addition, we engineer sensors with inverted dose-response and specified EC50. Second, we demonstrate a machine-learning-enabled approach to predictively engineer genetic sensors with mutation combinations that are not present in the large-scale dataset. We show that the interpretable machine learning results can be combined with a biophysical model to engineer sensors with improved inverted dose-response curves.
Assuntos
Aprendizado de Máquina , Biologia Sintética , Biologia Sintética/métodosRESUMO
Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits, and are linked to over 60 disease phenotypes. However, their complexity often excludes them from at-scale studies due to challenges with variant calling, representation, and lack of a genome-wide standard. To promote TR methods development, we create a comprehensive catalog of TR regions and explore its properties across 86 samples. We then curate variants from the GIAB HG002 individual to create a tandem repeat benchmark. We also present a variant comparison method that handles small and large alleles and varying allelic representation. The 8.1% of the genome covered by the TR catalog holds â¼24.9% of variants per individual, including 124,728 small and 17,988 large variants for the GIAB HG002 TR benchmark. We work with the GIAB community to demonstrate the utility of this benchmark across short and long read technologies.
RESUMO
Environmental sampling for microbiological contaminants is a key component of hygiene monitoring and risk characterization practices utilized across diverse fields of application. However, confidence in surface sampling results, both in the field and in controlled laboratory studies, has been undermined by large variation in sampling performance results. Sources of variation include controlled parameters, such as sampling materials and processing methods, which often differ among studies, as well as random and systematic errors; however, the relative contributions of these factors remain unclear. The objective of this study was to determine the relative impacts of sample processing methods, including extraction solution and physical dissociation method (vortexing and sonication), on recovery of Gram-positive (Bacillus cereus) and Gram-negative (Burkholderia thailandensis and Escherichia coli) bacteria from directly inoculated wipes. This work showed that target organism had the largest impact on extraction efficiency and recovery precision, as measured by traditional colony counts. The physical dissociation method (PDM) had negligible impact, while the effect of the extraction solution was organism dependent. Overall, however, extraction of organisms from wipes using phosphate-buffered saline with 0.04% Tween 80 (PBST) resulted in the highest mean recovery across all three organisms. The results from this study contribute to a better understanding of the factors that influence sampling performance, which is critical to the development of efficient and reliable sampling methodologies relevant to public health and biodefense.
Assuntos
Bacillus cereus/isolamento & purificação , Técnicas Bacteriológicas/métodos , Burkholderia/isolamento & purificação , Microbiologia Ambiental , Escherichia coli/isolamento & purificação , Manejo de Espécimes/métodos , Sensibilidade e EspecificidadeRESUMO
Compared to its predecessors, the Telomere-to-Telomere CHM13 genome adds nearly 200 million base pairs of sequence, corrects thousands of structural errors, and unlocks the most complex regions of the human genome for clinical and functional study. We show how this reference universally improves read mapping and variant calling for 3202 and 17 globally diverse samples sequenced with short and long reads, respectively. We identify hundreds of thousands of variants per sample in previously unresolved regions, showcasing the promise of the T2T-CHM13 reference for evolutionary and biomedical discovery. Simultaneously, this reference eliminates tens of thousands of spurious variants per sample, including reduction of false positives in 269 medically relevant genes by up to a factor of 12. Because of these improvements in variant discovery coupled with population and functional genomic resources, T2T-CHM13 is positioned to replace GRCh38 as the prevailing reference for human genetics.
Assuntos
Variação Genética , Genoma Humano , Genômica/normas , Análise de Sequência de DNA/normas , Humanos , Padrões de ReferênciaRESUMO
The repetitive nature and complexity of some medically relevant genes poses a challenge for their accurate analysis in a clinical setting. The Genome in a Bottle Consortium has provided variant benchmark sets, but these exclude nearly 400 medically relevant genes due to their repetitiveness or polymorphic complexity. Here, we characterize 273 of these 395 challenging autosomal genes using a haplotype-resolved whole-genome assembly. This curated benchmark reports over 17,000 single-nucleotide variations, 3,600 insertions and deletions and 200 structural variations each for human genome reference GRCh37 and GRCh38 across HG002. We show that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically relevant genes, including CBS, CRYAA and KCNE1. When masking these false duplications, variant recall can improve from 8% to 100%. Forming benchmarks from a haplotype-resolved whole-genome assembly may become a prototype for future benchmarks covering the whole genome.