Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 258
Filter
Add more filters










Publication year range
1.
Methods Mol Biol ; 2802: 587-609, 2024.
Article in English | MEDLINE | ID: mdl-38819573

ABSTRACT

Comparative analysis of (meta)genomes necessitates aggregation, integration, and synthesis of well-annotated data using standards. The Genomic Standards Consortium (GSC) collaborates with the research community to develop and maintain the Minimum Information about any (x) Sequence (MIxS) reporting standard for genomic data. To facilitate the use of the GSC's MIxS reporting standard, we provide a description of the structure and terminology, how to navigate ontologies for required terms in MIxS, and demonstrate practical usage through a soil metagenome example.


Subject(s)
Genomics , Metagenome , Metagenomics , Metagenomics/methods , Metagenomics/standards , Genomics/methods , Genomics/standards , Metagenome/genetics , Databases, Genetic , Soil Microbiology
2.
ISME Commun ; 4(1): ycae057, 2024 Jan.
Article in English | MEDLINE | ID: mdl-38812718

ABSTRACT

Microbial communities are diverse biological systems that include taxa from across multiple kingdoms of life. Notably, interactions between bacteria and fungi play a significant role in determining community structure. However, these statistical associations across kingdoms are more difficult to infer than intra-kingdom associations due to the nature of the data involved using standard network inference techniques. We quantify the challenges of cross-kingdom network inference from both theoretical and practical points of view using synthetic and real-world microbiome data. We detail the theoretical issue presented by combining compositional data sets drawn from the same environment, e.g. 16S and ITS sequencing of a single set of samples, and we survey common network inference techniques for their ability to handle this error. We then test these techniques for the accuracy and usefulness of their intra- and inter-kingdom associations by inferring networks from a set of simulated samples for which a ground-truth set of associations is known. We show that while the two methods mitigate the error of cross-kingdom inference, there is little difference between techniques for key practical applications including identification of strong correlations and identification of possible keystone taxa (i.e. hub nodes in the network). Furthermore, we identify a signature of the error caused by transkingdom network inference and demonstrate that it appears in networks constructed using real-world environmental microbiome data.

3.
Microbiol Resour Announc ; 13(4): e0067723, 2024 Apr 11.
Article in English | MEDLINE | ID: mdl-38488370

ABSTRACT

We present the complete genome sequence of the probiotic strain Lactobacillus acidophilus ATCC 9224. The genome sequence provides a valuable resource for investigating the phylogenetic evolution of this lineage and conducting comparative genomics with other Lactobacillus strains and species.

4.
Microlife ; 5: uqae004, 2024.
Article in English | MEDLINE | ID: mdl-38463165

ABSTRACT

Bacteriophages play a crucial role in shaping bacterial communities, yet the mechanisms by which nonmotile bacteriophages interact with their hosts remain poorly understood. This knowledge gap is especially pronounced in structured environments like soil, where spatial constraints and air-filled zones hinder aqueous diffusion. In soil, hyphae of filamentous microorganisms form a network of 'fungal highways' (FHs) that facilitate the dispersal of other microorganisms. We propose that FHs also promote bacteriophage dissemination. Viral particles can diffuse in liquid films surrounding hyphae or be transported by infectable (host) or uninfectable (nonhost) bacterial carriers coexisting on FH networks. To test this, two bacteriophages that infect Pseudomonas putida DSM291 (host) but not KT2440 (nonhost) were used. In the absence of carriers, bacteriophages showed limited diffusion on 3D-printed abiotic networks, but diffusion was significantly improved in Pythium ultimum-formed FHs when the number of connecting hyphae exceeded 20. Transport by both host and nonhost carriers enhanced bacteriophage dissemination. Host carriers were five times more effective in transporting bacteriophages, particularly in FHs with over 30 connecting hyphae. This study enhances our understanding of bacteriophage dissemination in nonsaturated environments like soils, highlighting the importance of biotic networks and bacterial hosts in facilitating this process.

5.
Viruses ; 16(3)2024 03 11.
Article in English | MEDLINE | ID: mdl-38543795

ABSTRACT

Genomic sequencing of clinical samples to identify emerging variants of SARS-CoV-2 has been a key public health tool for curbing the spread of the virus. As a result, an unprecedented number of SARS-CoV-2 genomes were sequenced during the COVID-19 pandemic, which allowed for rapid identification of genetic variants, enabling the timely design and testing of therapies and deployment of new vaccine formulations to combat the new variants. However, despite the technological advances of deep sequencing, the analysis of the raw sequence data generated globally is neither standardized nor consistent, leading to vastly disparate sequences that may impact identification of variants. Here, we show that for both Illumina and Oxford Nanopore sequencing platforms, downstream bioinformatic protocols used by industry, government, and academic groups resulted in different virus sequences from same sample. These bioinformatic workflows produced consensus genomes with differences in single nucleotide polymorphisms, inclusion and exclusion of insertions, and/or deletions, despite using the same raw sequence as input datasets. Here, we compared and characterized such discrepancies and propose a specific suite of parameters and protocols that should be adopted across the field. Consistent results from bioinformatic workflows are fundamental to SARS-CoV-2 and future pathogen surveillance efforts, including pandemic preparation, to allow for a data-driven and timely public health response.


Subject(s)
COVID-19 , SARS-CoV-2 , Humans , SARS-CoV-2/genetics , COVID-19/epidemiology , Pandemics , Workflow , Computational Biology
6.
Astrobiology ; 23(12): 1348-1367, 2023 12.
Article in English | MEDLINE | ID: mdl-38079228

ABSTRACT

Democratizing genomic data science, including bioinformatics, can diversify the STEM workforce and may, in turn, bring new perspectives into the space sciences. In this respect, the development of education and research programs that bridge genome science with "place" and world-views specific to a given region are valuable for Indigenous students and educators. Through a multi-institutional collaboration, we developed an ongoing education program and model that includes Illumina and Oxford Nanopore sequencing, free bioinformatic platforms, and teacher training workshops to address our research and education goals through a place-based science education lens. High school students and researchers cultivated, sequenced, assembled, and annotated the genomes of 13 bacteria from Mars analog sites with cultural relevance, 10 of which were novel species. Students, teachers, and community members assisted with the discovery of new, potentially chemolithotrophic bacteria relevant to astrobiology. This joint education-research program also led to the discovery of species from Mars analog sites capable of producing N-acyl homoserine lactones, which are quorum-sensing molecules used in bacterial communication. Whole genome sequencing was completed in high school classrooms, and connected students to funded space research, increased research output, and provided culturally relevant, place-based science education, with participants naming three novel species described here. Students at St. Andrew's School (Honolulu, Hawai'i) proposed the name Bradyrhizobium prioritasuperba for the type strain, BL16AT, of the new species (DSM 112479T = NCTC 14602T). The nonprofit organization Kauluakalana proposed the name Brenneria ulupoensis for the type strain, K61T, of the new species (DSM 116657T = LMG = 33184T), and Hawai'i Baptist Academy students proposed the name Paraflavitalea speifideiaquila for the type strain, BL16ET, of the new species (DSM 112478T = NCTC 14603T).


Subject(s)
Exobiology , Schools , Humans , Hawaii , Genomics , Bacteria
7.
Front Fungal Biol ; 4: 1285531, 2023.
Article in English | MEDLINE | ID: mdl-38155707

ABSTRACT

Members of the fungal genus Morchella are widely known for their important ecological roles and significant economic value. In this study, we used amplicon and genome sequencing to characterize bacterial communities associated with sexual fruiting bodies from wild specimens, as well as vegetative mycelium and sclerotia obtained from Morchella isolates grown in vitro. These investigations included diverse representatives from both Elata and Esculenta Morchella clades. Unique bacterial community compositions were observed across the various structures examined, both within and across individual Morchella isolates or specimens. However, specific bacterial taxa were frequently detected in association with certain structures, providing support for an associated core bacterial community. Bacteria from the genus Pseudomonas and Ralstonia constituted the core bacterial associates of Morchella mycelia and sclerotia, while other genera (e.g., Pedobacter spp., Deviosa spp., and Bradyrhizobium spp.) constituted the core bacterial community of fruiting bodies. Furthermore, the importance of Pseudomonas as a key member of the bacteriome was supported by the isolation of several Pseudomonas strains from mycelia during in vitro cultivation. Four of the six mycelial-derived Pseudomonas isolates shared 16S rDNA sequence identity with amplicon sequences recovered directly from the examined fungal structures. Distinct interaction phenotypes (antagonistic or neutral) were observed in confrontation assays between these bacteria and various Morchella isolates. Genome sequences obtained from these Pseudomonas isolates revealed intriguing differences in gene content and annotated functions, specifically with respect to toxin-antitoxin systems, cell adhesion, chitinases, and insecticidal toxins. These genetic differences correlated with the interaction phenotypes. This study provides evidence that Pseudomonas spp. are frequently associated with Morchella and these associations may greatly impact fungal physiology.

8.
Microlife ; 4: uqad042, 2023.
Article in English | MEDLINE | ID: mdl-37965130

ABSTRACT

This study presents an inexpensive approach for the macro- and microscopic observation of fungal mycelial growth. The 'fungal drops' method allows to investigate the development of a mycelial network in filamentous microorganisms at the colony and hyphal scales. A heterogeneous environment is created by depositing 15-20 µl drops on a hydrophobic surface at a fixed distance. This system is akin to a two-dimensional (2D) soil-like structure in which aqueous-pockets are intermixed with air-filled pores. The fungus (spores or mycelia) is inoculated into one of the drops, from which hyphal growth and exploration take place. Hyphal structures are assessed at different scales using stereoscopic and microscopic imaging. The former allows to evaluate the local response of regions within the colony (modular behaviour), while the latter can be used for fractal dimension analyses to describe the hyphal network architecture. The method was tested with several species to underpin the transferability to multiple species. In addition, two sets of experiments were carried out to demonstrate its use in fungal biology. First, mycelial reorganization of Fusarium oxysporum was assessed as a response to patches containing different nutrient concentrations. Second, the effect of interactions with the soil bacterium Pseudomonas putida on habitat colonization by the same fungus was assessed. This method appeared as fast and accessible, allowed for a high level of replication, and complements more complex experimental platforms. Coupled with image analysis, the fungal drops method provides new insights into the study of fungal modularity both macroscopically and at a single-hypha level.

9.
Microorganisms ; 11(11)2023 Nov 18.
Article in English | MEDLINE | ID: mdl-38004814

ABSTRACT

Escherichia albertii is an emerging foodborne pathogen. To better understand the pathogenesis and health risk of this pathogen, comparative genomics and phenotypic characterization were applied to assess the pathogenicity potential of E. albertii strains isolated from wild birds in a major agricultural region in California. Shiga toxin genes stx2f were present in all avian strains. Pangenome analyses of 20 complete genomes revealed a total of 11,249 genes, of which nearly 80% were accessory genes. Both core gene-based phylogenetic and accessory gene-based relatedness analyses consistently grouped the three stx2f-positive clinical strains with the five avian strains carrying ST7971. Among the three Stx2f-converting prophage integration sites identified, ssrA was the most common one. Besides the locus of enterocyte effacement and type three secretion system, the high pathogenicity island, OI-122, and type six secretion systems were identified. Substantial strain variation in virulence gene repertoire, Shiga toxin production, and cytotoxicity were revealed. Six avian strains exhibited significantly higher cytotoxicity than that of stx2f-positive E. coli, and three of them exhibited a comparable level of cytotoxicity with that of enterohemorrhagic E. coli outbreak strains, suggesting that wild birds could serve as a reservoir of E. albertii strains with great potential to cause severe diseases in humans.

10.
Front Microbiol ; 14: 1216591, 2023.
Article in English | MEDLINE | ID: mdl-37799600

ABSTRACT

Members of the archaeal order Caldarchaeales (previously the phylum Aigarchaeota) are poorly sampled and are represented in public databases by relatively few genomes. Additional representative genomes will help resolve their placement among all known members of Archaea and provide insights into their roles in the environment. In this study, we analyzed 16S rRNA gene amplicons belonging to the Caldarchaeales that are available in public databases, which demonstrated that archaea of the order Caldarchaeales are diverse, widespread, and most abundant in geothermal habitats. We also constructed five metagenome-assembled genomes (MAGs) of Caldarchaeales from two geothermal features to investigate their metabolic potential and phylogenomic position in the domain Archaea. Two of the MAGs were assembled from microbial community DNA extracted from fumarolic lava rocks from Mauna Ulu, Hawai'i, and three were assembled from DNA obtained from hot spring sinters from the El Tatio geothermal field in Chile. MAGs from Hawai'i are high quality bins with completeness >95% and contamination <1%, and one likely belongs to a novel species in a new genus recently discovered at a submarine volcano off New Zealand. MAGs from Chile have lower completeness levels ranging from 27 to 70%. Gene content of the MAGs revealed that these members of Caldarchaeales are likely metabolically versatile and exhibit the potential for both chemoorganotrophic and chemolithotrophic lifestyles. The wide array of metabolic capabilities exhibited by these members of Caldarchaeales might help them thrive under diverse harsh environmental conditions. All the MAGs except one from Chile harbor putative prophage regions encoding several auxiliary metabolic genes (AMGs) that may confer a fitness advantage on their Caldarchaeales hosts by increasing their metabolic potential and make them better adapted to new environmental conditions. Phylogenomic analysis of the five MAGs and over 3,000 representative archaeal genomes showed the order Caldarchaeales forms a monophyletic group that is sister to the clade comprising the orders Geothermarchaeales (previously Candidatus Geothermarchaeota), Conexivisphaerales and Nitrososphaerales (formerly known as Thaumarchaeota), supporting the status of Caldarchaeales members as a clade distinct from the Thaumarchaeota.

11.
Nat Biotechnol ; 2023 Sep 21.
Article in English | MEDLINE | ID: mdl-37735266

ABSTRACT

Identifying and characterizing mobile genetic elements in sequencing data is essential for understanding their diversity, ecology, biotechnological applications and impact on public health. Here we introduce geNomad, a classification and annotation framework that combines information from gene content and a deep neural network to identify sequences of plasmids and viruses. geNomad uses a dataset of more than 200,000 marker protein profiles to provide functional gene annotation and taxonomic assignment of viral genomes. Using a conditional random field model, geNomad also detects proviruses integrated into host genomes with high precision. In benchmarks, geNomad achieved high classification performance for diverse plasmids and viruses (Matthews correlation coefficient of 77.8% and 95.3%, respectively), substantially outperforming other tools. Leveraging geNomad's speed and scalability, we processed over 2.7 trillion base pairs of sequencing data, leading to the discovery of millions of viruses and plasmids that are available through the IMG/VR and IMG/PR databases. geNomad is available at https://portal.nersc.gov/genomad .

12.
Commun Biol ; 6(1): 948, 2023 09 18.
Article in English | MEDLINE | ID: mdl-37723238

ABSTRACT

Diverse members of early-diverging Mucoromycota, including mycorrhizal taxa and soil-associated Mortierellaceae, are known to harbor Mollicutes-related endobacteria (MRE). It has been hypothesized that MRE were acquired by a common ancestor and transmitted vertically. Alternatively, MRE endosymbionts could have invaded after the divergence of Mucoromycota lineages and subsequently spread to new hosts horizontally. To better understand the evolutionary history of MRE symbionts, we generated and analyzed four complete MRE genomes from two Mortierellaceae genera: Linnemannia (MRE-L) and Benniella (MRE-B). These genomes include the smallest known of fungal endosymbionts and showed signals of a tight relationship with hosts including a reduced functional capacity and genes transferred from fungal hosts to MRE. Phylogenetic reconstruction including nine MRE from mycorrhizal fungi revealed that MRE-B genomes are more closely related to MRE from Glomeromycotina than MRE-L from the same host family. We posit that reductions in genome size, GC content, pseudogene content, and repeat content in MRE-L may reflect a longer-term relationship with their fungal hosts. These data indicate Linnemannia and Benniella MRE were likely acquired independently after their fungal hosts diverged from a common ancestor. This work expands upon foundational knowledge on minimal genomes and provides insights into the evolution of bacterial endosymbionts.


Subject(s)
Mycorrhizae , Tenericutes , Phylogeny , Genomics , Mycorrhizae/genetics , Genome Size
13.
Microbiome ; 11(1): 192, 2023 08 26.
Article in English | MEDLINE | ID: mdl-37626434

ABSTRACT

As microbiome research has progressed, it has become clear that most, if not all, eukaryotic organisms are hosts to microbiomes composed of prokaryotes, other eukaryotes, and viruses. Fungi have only recently been considered holobionts with their own microbiomes, as filamentous fungi have been found to harbor bacteria (including cyanobacteria), mycoviruses, other fungi, and whole algal cells within their hyphae. Constituents of this complex endohyphal microbiome have been interrogated using multi-omic approaches. However, a lack of tools, techniques, and standardization for integrative multi-omics for small-scale microbiomes (e.g., intracellular microbiomes) has limited progress towards investigating and understanding the total diversity of the endohyphal microbiome and its functional impacts on fungal hosts. Understanding microbiome impacts on fungal hosts will advance explorations of how "microbiomes within microbiomes" affect broader microbial community dynamics and ecological functions. Progress to date as well as ongoing challenges of performing integrative multi-omics on the endohyphal microbiome is discussed herein. Addressing the challenges associated with the sample extraction, sample preparation, multi-omic data generation, and multi-omic data analysis and integration will help advance current knowledge of the endohyphal microbiome and provide a road map for shrinking microbiome investigations to smaller scales. Video Abstract.


Subject(s)
Microbiota , Multiomics , Data Analysis , Eukaryota , Microbiota/genetics , Prokaryotic Cells
14.
Sci Total Environ ; 892: 164506, 2023 Sep 20.
Article in English | MEDLINE | ID: mdl-37295515

ABSTRACT

Microbial communities, and their ecological importance, have been investigated in several habitats. However, so far, most studies could not describe the closest microbial interactions and their functionalities. This study investigates the co-occurring interactions between fungi and bacteria in plant rhizoplanes and their potential functions. The partnerships were obtained using fungal-highway columns with four plant-based media. The fungi and associated microbiomes isolated from the columns were identified by sequencing the ITS (fungi) and 16S rRNA genes (bacteria). Statistical analyses including Exploratory Graph and Network Analysis were used to visualize the presence of underlying clusters in the microbial communities and evaluate the metabolic functions associated with the fungal microbiome (PICRUSt2). Our findings characterize the presence of both unique and complex bacterial communities associated with different fungi. The results showed that Bacillus was associated as exo-bacteria in 80 % of the fungi but occurred as putative endo-bacteria in 15 %. A shared core of putative endo-bacterial genera, potentially involved in the nitrogen cycle was found in 80 % of the isolated fungi. The comparison of potential metabolic functions of the putative endo- and exo-communities highlighted the potential essential factors to establish an endosymbiotic relationship, such as the loss of pathways associated with metabolites obtained from the host while maintaining pathways responsible for bacterial survival within the hypha.


Subject(s)
Microbiota , Mycobiome , Fungi , RNA, Ribosomal, 16S/genetics , Plant Roots/microbiology , Bacteria , Soil Microbiology
15.
Fungal Biol ; 127(5): 1005-1009, 2023 05.
Article in English | MEDLINE | ID: mdl-37142360

ABSTRACT

Research on bacterial-fungal interactions (BFIs) has revealed that fungi and bacteria frequently interact with one another within diverse ecosystems and microbiomes. Assessing the current state of knowledge within the field of BFI research, particularly with respect to what interactions between bacteria and fungi have been previously described, is very challenging and time consuming. This is largely due to a lack of any centralized resource, with reports of BFIs being spread across publications in numerous journals using non-standardized text to describe the relationships. To address this issue, we have developed the BFI Research Portal, a publicly accessible database of previously reported interactions between bacterial and fungal taxa to serve as a centralized resource for the field. Users can query bacterial or fungal taxa to see what members from the other kingdom have been observed as interaction partners. Search results are accompanied by interactive and intuitive visual outputs, and the database is a dynamic resource that will be updated as new BFIs are reported.


Subject(s)
Fungi , Microbiota , Bacteria
16.
Environ Microbiol ; 24(12): 6320-6335, 2022 12.
Article in English | MEDLINE | ID: mdl-36530021

ABSTRACT

Endosporulation is a complex morphophysiological process resulting in a more resistant cellular structure that is produced within the mother cell and is called endospore. Endosporulation evolved in the common ancestor of Firmicutes, but it is lost in descendant lineages classified as asporogenic. While Kurthia spp. is considered to comprise only asporogenic species, we show here that strain 11kri321, which was isolated from an oligotrophic geothermal reservoir, produces phase-bright spore-like structures. Phylogenomics of strain 11kri321 and other Kurthia strains reveals little similarity to genetic determinants of sporulation known from endosporulating Bacilli. However, morphological hallmarks of endosporulation were observed in two of the four Kurthia strains tested, resulting in spore-like structures (cryptospores). In contrast to classic endospores, these cryptospores did not protect against heat or UV damage and successive sub-culturing led to the loss of the cryptosporulating phenotype. Our findings imply that a cryptosporulation phenotype may have been prevalent and subsequently lost by laboratory culturing in other Firmicutes currently considered as asporogenic. Cryptosporulation might thus represent an ancestral but unstable and adaptive developmental state in Firmicutes that is under selection under harsh environmental conditions.


Subject(s)
Bacillus , Firmicutes , Spores, Bacterial/genetics , Phylogeny
18.
bioRxiv ; 2022 Nov 03.
Article in English | MEDLINE | ID: mdl-36380755

ABSTRACT

During the COVID-19 pandemic, SARS-CoV-2 surveillance efforts integrated genome sequencing of clinical samples to identify emergent viral variants and to support rapid experimental examination of genome-informed vaccine and therapeutic designs. Given the broad range of methods applied to generate new viral genomes, it is critical that consensus and variant calling tools yield consistent results across disparate pipelines. Here we examine the impact of sequencing technologies (Illumina and Oxford Nanopore) and 7 different downstream bioinformatic protocols on SARS-CoV-2 variant calling as part of the NIH Accelerating COVID-19 Therapeutic Interventions and Vaccines (ACTIV) Tracking Resistance and Coronavirus Evolution (TRACE) initiative, a public-private partnership established to address the COVID-19 outbreak. Our results indicate that bioinformatic workflows can yield consensus genomes with different single nucleotide polymorphisms, insertions, and/or deletions even when using the same raw sequence input datasets. We introduce the use of a specific suite of parameters and protocols that greatly improves the agreement among pipelines developed by diverse organizations. Such consistency among bioinformatic pipelines is fundamental to SARS-CoV-2 and future pathogen surveillance efforts. The application of analysis standards is necessary to more accurately document phylogenomic trends and support data-driven public health responses.

19.
Viruses ; 14(10)2022 09 27.
Article in English | MEDLINE | ID: mdl-36298683

ABSTRACT

Despite unprecedented global sequencing and surveillance of SARS-CoV-2, timely identification of the emergence and spread of novel variants of concern (VoCs) remains a challenge. Several million raw genome sequencing runs are now publicly available. We sought to survey these datasets for intrahost variation to study emerging mutations of concern. We developed iSKIM ("intrahost SARS-CoV-2 k-mer identification method") to relatively quickly and efficiently screen the many SARS-CoV-2 datasets to identify intrahost mutations belonging to lineages of concern. Certain mutations surged in frequency as intrahost minor variants just prior to, or while lineages of concern arose. The Spike N501Y change common to several VoCs was found as a minor variant in 834 samples as early as October 2020. This coincides with the timing of the first detected samples with this mutation in the Alpha/B.1.1.7 and Beta/B.1.351 lineages. Using iSKIM, we also found that Spike L452R was detected as an intrahost minor variant as early as September 2020, prior to the observed rise of the Epsilon/B.1.429/B.1.427 lineages in late 2020. iSKIM rapidly screens for mutations of interest in raw data, prior to genome assembly, and can be used to detect increases in intrahost variants, potentially providing an early indication of novel variant spread.


Subject(s)
COVID-19 , SARS-CoV-2 , Humans , SARS-CoV-2/genetics , COVID-19/diagnosis , COVID-19/epidemiology , Mutation , Spike Glycoprotein, Coronavirus/genetics
20.
PeerJ ; 10: e13821, 2022.
Article in English | MEDLINE | ID: mdl-36093336

ABSTRACT

Background: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the cause of coronavirus disease 2019 (COVID-19), has spread globally and is being surveilled with an international genome sequencing effort. Surveillance consists of sample acquisition, library preparation, and whole genome sequencing. This has necessitated a classification scheme detailing Variants of Concern (VOC) and Variants of Interest (VOI), and the rapid expansion of bioinformatics tools for sequence analysis. These bioinformatic tools are means for major actionable results: maintaining quality assurance and checks, defining population structure, performing genomic epidemiology, and inferring lineage to allow reliable and actionable identification and classification. Additionally, the pandemic has required public health laboratories to reach high throughput proficiency in sequencing library preparation and downstream data analysis rapidly. However, both processes can be limited by a lack of a standardized sequence dataset. Methods: We identified six SARS-CoV-2 sequence datasets from recent publications, public databases and internal resources. In addition, we created a method to mine public databases to identify representative genomes for these datasets. Using this novel method, we identified several genomes as either VOI/VOC representatives or non-VOI/VOC representatives. To describe each dataset, we utilized a previously published datasets format, which describes accession information and whole dataset information. Additionally, a script from the same publication has been enhanced to download and verify all data from this study. Results: The benchmark datasets focus on the two most widely used sequencing platforms: long read sequencing data from the Oxford Nanopore Technologies platform and short read sequencing data from the Illumina platform. There are six datasets: three were derived from recent publications; two were derived from data mining public databases to answer common questions not covered by published datasets; one unique dataset representing common sequence failures was obtained by rigorously scrutinizing data that did not pass quality checks. The dataset summary table, data mining script and quality control (QC) values for all sequence data are publicly available on GitHub: https://github.com/CDCgov/datasets-sars-cov-2. Discussion: The datasets presented here were generated to help public health laboratories build sequencing and bioinformatics capacity, benchmark different workflows and pipelines, and calibrate QC thresholds to ensure sequencing quality. Together, improvements in these areas support accurate and timely outbreak investigation and surveillance, providing actionable data for pandemic management. Furthermore, these publicly available and standardized benchmark data will facilitate the development and adjudication of new pipelines.


Subject(s)
COVID-19 , SARS-CoV-2 , Humans , SARS-CoV-2/genetics , COVID-19/epidemiology , Benchmarking , Computational Biology , Sequence Analysis
SELECTION OF CITATIONS
SEARCH DETAIL
...