RESUMEN
SUMMARY: Genomics has become an essential technology for surveilling emerging infectious disease outbreaks. A range of technologies and strategies for pathogen genome enrichment and sequencing are being used by laboratories worldwide, together with different and sometimes ad hoc, analytical procedures for generating genome sequences. A fully integrated analytical process for raw sequence to consensus genome determination, suited to outbreaks such as the ongoing COVID-19 pandemic, is critical to provide a solid genomic basis for epidemiological analyses and well-informed decision making. We have developed a web-based platform and integrated bioinformatic workflows that help to provide consistent high-quality analysis of SARS-CoV-2 sequencing data generated with either the Illumina or Oxford Nanopore Technologies (ONT). Using an intuitive web-based interface, this workflow automates data quality control, SARS-CoV-2 reference-based genome variant and consensus calling, lineage determination and provides the ability to submit the consensus sequence and necessary metadata to GenBank, GISAID and INSDC raw data repositories. We tested workflow usability using real world data and validated the accuracy of variant and lineage analysis using several test datasets, and further performed detailed comparisons with results from the COVID-19 Galaxy Project workflow. Our analyses indicate that EC-19 workflows generate high-quality SARS-CoV-2 genomes. Finally, we share a perspective on patterns and impact observed with Illumina versus ONT technologies on workflow congruence and differences. AVAILABILITY AND IMPLEMENTATION: https://edge-covid19.edgebioinformatics.org, and https://github.com/LANL-Bioinformatics/EDGE/tree/SARS-CoV2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
COVID-19 , SARS-CoV-2 , Genoma Viral , Genómica , Humanos , Pandemias , SARS-CoV-2/genéticaRESUMEN
SUMMARY: Polymerase chain reaction-based assays are the current gold standard for detecting and diagnosing SARS-CoV-2. However, as SARS-CoV-2 mutates, we need to constantly assess whether existing PCR-based assays will continue to detect all known viral strains. To enable the continuous monitoring of SARS-CoV-2 assays, we have developed a web-based assay validation algorithm that checks existing PCR-based assays against the ever-expanding genome databases for SARS-CoV-2 using both thermodynamic and edit-distance metrics. The assay-screening results are displayed as a heatmap, showing the number of mismatches between each detection and each SARS-CoV-2 genome sequence. Using a mismatch threshold to define detection failure, assay performance is summarized with the true-positive rate (recall) to simplify assay comparisons. AVAILABILITY AND IMPLEMENTATION: The assay evaluation website and supporting software are Open Source and freely available at https://covid19.edgebioinformatics.org/#/assayValidation, https://github.com/jgans/thermonucleotide BLAST and https://github.com/LANL-Bioinformatics/assay_validation. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
COVID-19 , SARS-CoV-2 , Prueba de COVID-19 , Humanos , Reacción en Cadena de la Polimerasa , Sensibilidad y EspecificidadRESUMEN
Continued advancements in sequencing technologies have fueled the development of new sequencing applications and promise to flood current databases with raw data. A number of factors prevent the seamless and easy use of these data, including the breadth of project goals, the wide array of tools that individually perform fractions of any given analysis, the large number of associated software/hardware dependencies, and the detailed expertise required to perform these analyses. To address these issues, we have developed an intuitive web-based environment with a wide assortment of integrated and cutting-edge bioinformatics tools in pre-configured workflows. These workflows, coupled with the ease of use of the environment, provide even novice next-generation sequencing users with the ability to perform many complex analyses with only a few mouse clicks and, within the context of the same environment, to visualize and further interrogate their results. This bioinformatics platform is an initial attempt at Empowering the Development of Genomics Expertise (EDGE) in a wide range of applications for microbial research.
Asunto(s)
Bacillus anthracis/clasificación , Biología Computacional/métodos , Ebolavirus/clasificación , Escherichia coli/clasificación , Programas Informáticos , Yersinia pestis/clasificación , Carbunco/microbiología , Bacillus anthracis/genética , Ebolavirus/genética , Escherichia coli/genética , Infecciones por Escherichia coli/microbiología , Fiebre Hemorrágica Ebola/virología , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Internet , Filogenia , Peste/microbiología , Yersinia pestis/genéticaRESUMEN
A major challenge in the field of shotgun metagenomics is the accurate identification of organisms present within a microbial community, based on classification of short sequence reads. Though existing microbial community profiling methods have attempted to rapidly classify the millions of reads output from modern sequencers, the combination of incomplete databases, similarity among otherwise divergent genomes, errors and biases in sequencing technologies, and the large volumes of sequencing data required for metagenome sequencing has led to unacceptably high false discovery rates (FDR). Here, we present the application of a novel, gene-independent and signature-based metagenomic taxonomic profiling method with significantly and consistently smaller FDR than any other available method. Our algorithm circumvents false positives using a series of non-redundant signature databases and examines Genomic Origins Through Taxonomic CHAllenge (GOTTCHA). GOTTCHA was tested and validated on 20 synthetic and mock datasets ranging in community composition and complexity, was applied successfully to data generated from spiked environmental and clinical samples, and robustly demonstrates superior performance compared with other available tools.
Asunto(s)
Metagenómica/métodos , Microbiología del Aire , Algoritmos , Heces/microbiología , Francisella tularensis/genética , Francisella tularensis/aislamiento & purificación , Humanos , Metagenoma , Programas InformáticosRESUMEN
BACKGROUND: Illumina is the most widely used next generation sequencing technology and produces millions of short reads that contain errors. These sequencing errors constitute a major problem in applications such as de novo genome assembly, metagenomics analysis and single nucleotide polymorphism discovery. RESULTS: In this study, we present ADEPT, a dynamic error detection method, based on the quality scores of each nucleotide and its neighboring nucleotides, together with their positions within the read and compares this to the position-specific quality score distribution of all bases within the sequencing run. This method greatly improves upon other available methods in terms of the true positive rate of error discovery without affecting the false positive rate, particularly within the middle of reads. CONCLUSIONS: ADEPT is the only tool to date that dynamically assesses errors within reads by comparing position-specific and neighboring base quality scores with the distribution of quality scores for the dataset being analyzed. The result is a method that is less prone to position-dependent under-prediction, which is one of the most prominent issues in error prediction. The outcome is that ADEPT improves upon prior efforts in identifying true errors, primarily within the middle of reads, while reducing the false positive rate.
Asunto(s)
Bacterias/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Metagenómica , Polimorfismo de Nucleótido Simple/genética , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , Bacterias/clasificación , Biología Computacional , Humanos , Control de CalidadRESUMEN
Genomic sequencing of clinical samples to identify emerging variants of SARS-CoV-2 has been a key public health tool for curbing the spread of the virus. As a result, an unprecedented number of SARS-CoV-2 genomes were sequenced during the COVID-19 pandemic, which allowed for rapid identification of genetic variants, enabling the timely design and testing of therapies and deployment of new vaccine formulations to combat the new variants. However, despite the technological advances of deep sequencing, the analysis of the raw sequence data generated globally is neither standardized nor consistent, leading to vastly disparate sequences that may impact identification of variants. Here, we show that for both Illumina and Oxford Nanopore sequencing platforms, downstream bioinformatic protocols used by industry, government, and academic groups resulted in different virus sequences from same sample. These bioinformatic workflows produced consensus genomes with differences in single nucleotide polymorphisms, inclusion and exclusion of insertions, and/or deletions, despite using the same raw sequence as input datasets. Here, we compared and characterized such discrepancies and propose a specific suite of parameters and protocols that should be adopted across the field. Consistent results from bioinformatic workflows are fundamental to SARS-CoV-2 and future pathogen surveillance efforts, including pandemic preparation, to allow for a data-driven and timely public health response.
Asunto(s)
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , COVID-19/epidemiología , Pandemias , Flujo de Trabajo , Biología ComputacionalRESUMEN
Early and accurate diagnosis of respiratory pathogens and associated outbreaks can allow for the control of spread, epidemiological modeling, targeted treatment, and decision making-as is evident with the current COVID-19 pandemic. Many respiratory infections share common symptoms, making them difficult to diagnose using only syndromic presentation. Yet, with delays in getting reference laboratory tests and limited availability and poor sensitivity of point-of-care tests, syndromic diagnosis is the most-relied upon method in clinical practice today. Here, we examine the variability in diagnostic identification of respiratory infections during the annual infection cycle in northern New Mexico, by comparing syndromic diagnostics with polymerase chain reaction (PCR) and sequencing-based methods, with the goal of assessing gaps in our current ability to identify respiratory pathogens. Of 97 individuals that presented with symptoms of respiratory infection, only 23 were positive for at least one RNA virus, as confirmed by sequencing. Whereas influenza virus (n = 7) was expected during this infection cycle, we also observed coronavirus (n = 7), respiratory syncytial virus (n = 8), parainfluenza virus (n = 4), and human metapneumovirus (n = 1) in individuals with respiratory infection symptoms. Four patients were coinfected with two viruses. In 21 individuals that tested positive using PCR, RNA sequencing completely matched in only 12 (57%) of these individuals. Few individuals (37.1%) were diagnosed to have an upper respiratory tract infection or viral syndrome by syndromic diagnostics, and the type of virus could only be distinguished in one patient. Thus, current syndromic diagnostic approaches fail to accurately identify respiratory pathogens associated with infection and are not suited to capture emerging threats in an accurate fashion. We conclude there is a critical and urgent need for layered agnostic diagnostics to track known and unknown pathogens at the point of care to control future outbreaks.
RESUMEN
During the COVID-19 pandemic, SARS-CoV-2 surveillance efforts integrated genome sequencing of clinical samples to identify emergent viral variants and to support rapid experimental examination of genome-informed vaccine and therapeutic designs. Given the broad range of methods applied to generate new viral genomes, it is critical that consensus and variant calling tools yield consistent results across disparate pipelines. Here we examine the impact of sequencing technologies (Illumina and Oxford Nanopore) and 7 different downstream bioinformatic protocols on SARS-CoV-2 variant calling as part of the NIH Accelerating COVID-19 Therapeutic Interventions and Vaccines (ACTIV) Tracking Resistance and Coronavirus Evolution (TRACE) initiative, a public-private partnership established to address the COVID-19 outbreak. Our results indicate that bioinformatic workflows can yield consensus genomes with different single nucleotide polymorphisms, insertions, and/or deletions even when using the same raw sequence input datasets. We introduce the use of a specific suite of parameters and protocols that greatly improves the agreement among pipelines developed by diverse organizations. Such consistency among bioinformatic pipelines is fundamental to SARS-CoV-2 and future pathogen surveillance efforts. The application of analysis standards is necessary to more accurately document phylogenomic trends and support data-driven public health responses.
RESUMEN
The nascent field of microbiome science is transitioning from a descriptive approach of cataloging taxa and functions present in an environment to applying multi-omics methods to investigate microbiome dynamics and function. A large number of new tools and algorithms have been designed and used for very specific purposes on samples collected by individual investigators or groups. While these developments have been quite instructive, the ability to compare microbiome data generated by many groups of researchers is impeded by the lack of standardized application of bioinformatics methods. Additionally, there are few examples of broad bioinformatics workflows that can process metagenome, metatranscriptome, metaproteome and metabolomic data at scale, and no central hub that allows processing, or provides varied omics data that are findable, accessible, interoperable and reusable (FAIR). Here, we review some of the challenges that exist in analyzing omics data within the microbiome research sphere, and provide context on how the National Microbiome Data Collaborative has adopted a standardized and open access approach to address such challenges.
RESUMEN
Extensive use of antibiotics in both public health and animal husbandry has resulted in rapid emergence of antibiotic resistance in almost all human pathogens, including biothreat pathogens. Antibiotic resistance has thus become a major concern for both public health and national security. We developed multiplexed assays for rapid, simultaneous pathogen detection and characterization of ciprofloxacin and doxycycline resistance in Bacillus anthracis, Yersinia pestis, and Francisella tularensis. These assays are SNP-based and use Multiplexed Oligonucleotide Ligation-PCR (MOL-PCR). The MOL-PCR assay chemistry and MOLigo probe design process are presented. A web-based tool - MOLigoDesigner (http://MOLigoDesigner.lanl.gov) was developed to facilitate the probe design. All probes were experimentally validated individually and in multiplexed assays, and minimal sets of multiplexed MOLigo probes were identified for simultaneous pathogen detection and antibiotic resistance characterization.
Asunto(s)
Farmacorresistencia Microbiana/genética , Reacción en Cadena de la Polimerasa/métodos , Polimorfismo de Nucleótido Simple , Animales , Bacillus anthracis/efectos de los fármacos , Bacillus anthracis/genética , Bacillus anthracis/patogenicidad , Ciprofloxacina/farmacología , Biología Computacional , ADN Bacteriano/genética , Doxiciclina/farmacología , Francisella tularensis/efectos de los fármacos , Francisella tularensis/genética , Francisella tularensis/patogenicidad , Humanos , Internet , Técnicas de Sonda Molecular , Sondas de Oligonucleótidos/genética , Yersinia pestis/efectos de los fármacos , Yersinia pestis/genética , Yersinia pestis/patogenicidadRESUMEN
Next-generation sequencing (NGS) offers unparalleled resolution for untargeted organism detection and characterization. However, the majority of NGS analysis programs require users to be proficient in programming and command-line interfaces. EDGE bioinformatics was developed to offer scientists with little to no bioinformatics expertise a point-and-click platform for analyzing sequencing data in a rapid and reproducible manner. EDGE (Empowering the Development of Genomics Expertise) v1.0 released in January 2017, is an intuitive web-based bioinformatics platform engineered for the analysis of microbial and metagenomic NGS-based data ( Li et al., 2017 ). The EDGE bioinformatics suite combines vetted publicly available tools, and tracks settings to ensure reliable and reproducible analysis workflows. To execute the EDGE workflow, only raw sequencing reads and a project ID are necessary. Users can access in-house data, or run analyses on samples deposited in Sequence Read Archive. Default settings offer a robust first-glance and are often sufficient for novice users. All analyses are modular; users can easily turn workflows on/off, and modify parameters to cater to project needs. Results are compiled and available for download in a PDF-formatted report containing publication quality figures. We caution that interpreting results still requires in-depth scientific understanding, however report visuals are often informative, even to novice users.
RESUMEN
Ebolaviruses are a diverse group of RNA viruses comprising five different species, four of which cause fatal hemorrhagic fever in humans. Because of their high infectivity and lethality, ebolaviruses are considered major biothreat agents. Although detection assays exist, no forensic assays are currently available. Here, we report the development of forensic assays that differentiate ebolaviruses. We performed phylogenetic analyses and identified canonical SNPs for all species, major clades and isolates. TaqMan-MGB allelic discrimination assays based on these SNPs were designed, screened against synthetic RNA templates, and validated against ebolavirus genomic RNAs. A total of 45 assays were validated to provide 100% coverage of the species and variants with additional resolution at the isolate level. These assays enabled accurate forensic analysis on 4 "unknown" ebolaviruses. Unknowns were correctly classified to species and variant. A goal of providing resolution below the isolate level was not successful. These high-resolution forensic assays allow rapid and accurate genotyping of ebolaviruses for forensic investigations.
Asunto(s)
Ebolavirus/genética , Polimorfismo de Nucleótido Simple , Alelos , Genética Forense , Genoma Viral , Filogenia , ARN Viral/análisis , Reacción en Cadena en Tiempo Real de la Polimerasa , Análisis de SecuenciaRESUMEN
Microbial diversity studies based on metagenomic sequencing have greatly enhanced our knowledge of the microbial world. However, one caveat is the fact that not all microorganisms are equally well detected, questioning the universality of this approach. Firmicutes are known to be a dominant bacterial group. Several Firmicutes species are endospore formers and this property makes them hardy in potentially harsh conditions, and thus likely to be present in a wide variety of environments, even as residents and not functional players. While metagenomic libraries can be expected to contain endospore formers, endospores are known to be resilient to many traditional methods of DNA isolation and thus potentially undetectable. In this study we evaluated the representation of endospore-forming Firmicutes in 73 published metagenomic datasets using two molecular markers unique to this bacterial group (spo0A and gpr). Both markers were notably absent in well-known habitats of Firmicutes such as soil, with spo0A found only in three mammalian gut microbiomes. A tailored DNA extraction method resulted in the detection of a large diversity of endospore-formers in amplicon sequencing of the 16S rRNA and spo0A genes. However, shotgun classification was still poor with only a minor fraction of the community assigned to Firmicutes. Thus, removing a specific bias in a molecular workflow improves detection in amplicon sequencing, but it was insufficient to overcome the limitations for detecting endospore-forming Firmicutes in whole-genome metagenomics. In conclusion, this study highlights the importance of understanding the specific methodological biases that can contribute to improve the universality of metagenomic approaches.
RESUMEN
We report here the genome sequence of an effective chromium-reducing bacterium, Bacillus cereus strain S612. The size of the draft genome sequence is approximately 5.4 Mb, with a G+C content of 35%, and it is predicted to contain 5,450 protein-coding genes.
RESUMEN
The genome of strain GS3372 is the first publicly available strain of Aeribacillus pallidus. This endospore-forming thermophilic strain was isolated from a deep geothermal reservoir. The availability of this genome can contribute to the clarification of the taxonomy of the closely related Anoxybacillus, Geobacillus, and Aeribacillus genera.
RESUMEN
Bacillus alveayuensis strain 24KAM51 was isolated from a marine hydrothermal vent in Milos, Greece. Its genome depicts interesting features of halotolerance and resistance to heavy metals.
RESUMEN
The genus Burkholderia encompasses both pathogenic (including Burkholderia mallei and Burkholderia pseudomallei, U.S. Centers for Disease Control and Prevention Category B listed), and nonpathogenic Gram-negative bacilli. Here we present full genome sequences for a panel of 59 Burkholderia strains, selected to aid in detection assay development.
RESUMEN
We report here the genome sequence of Thauera sp. strain SWB20, isolated from a Singaporean wastewater treatment facility using gel microdroplets (GMDs) and single-cell genomics (SCG). This approach provided a single clonal microcolony that was sufficient to obtain a 4.9-Mbp genome assembly of an ecologically relevant Thauera species.