RESUMO
In recent years, large-scale oceanic sequencing efforts have provided a deeper understanding of marine microbial communities and their dynamics. These research endeavors require the acquisition of complex and varied datasets through large, interdisciplinary and collaborative efforts. However, no unifying framework currently exists for the marine science community to integrate sequencing data with physical, geological, and geochemical datasets. Planet Microbe is a web-based platform that enables data discovery from curated historical and on-going oceanographic sequencing efforts. In Planet Microbe, each 'omics sample is linked with other biological and physiochemical measurements collected for the same water samples or during the same sample collection event, to provide a broader environmental context. This work highlights the need for curated aggregation efforts that can enable new insights into high-quality metagenomic datasets. Planet Microbe is freely accessible from https://www.planetmicrobe.org/.
Assuntos
Organismos Aquáticos/microbiologia , Análise de Dados , Meio Ambiente , Metagenômica , Planetas , Bases de Dados Genéticas , Padrões de Referência , Interface Usuário-ComputadorRESUMO
Although secondary metabolites are typically associated with competitive or pathogenic interactions, the high bioactivity of endophytic fungi in the Xylariales, coupled with their abundance and broad host ranges spanning all lineages of land plants and lichens, suggests that enhanced secondary metabolism might facilitate symbioses with phylogenetically diverse hosts. Here, we examined secondary metabolite gene clusters (SMGCs) across 96 Xylariales genomes in two clades (Xylariaceae s.l. and Hypoxylaceae), including 88 newly sequenced genomes of endophytes and closely related saprotrophs and pathogens. We paired genomic data with extensive metadata on endophyte hosts and substrates, enabling us to examine genomic factors related to the breadth of symbiotic interactions and ecological roles. All genomes contain hyperabundant SMGCs; however, Xylariaceae have increased numbers of gene duplications, horizontal gene transfers (HGTs) and SMGCs. Enhanced metabolic diversity of endophytes is associated with a greater diversity of hosts and increased capacity for lignocellulose decomposition. Our results suggest that, as host and substrate generalists, Xylariaceae endophytes experience greater selection to diversify SMGCs compared with more ecologically specialised Hypoxylaceae species. Overall, our results provide new evidence that SMGCs may facilitate symbiosis with phylogenetically diverse hosts, highlighting the importance of microbial symbioses to drive fungal metabolic diversity.
Assuntos
Líquens , Xylariales , Endófitos , Fungos , Líquens/microbiologia , Família Multigênica , Simbiose/genéticaRESUMO
Infections are a serious health concern worldwide, particularly in vulnerable populations such as the immunocompromised, elderly, and young. Advances in metagenomic sequencing availability, speed, and decreased cost offer the opportunity to supplement or even replace culture-based identification of pathogens with DNA sequence-based diagnostics. Adopting metagenomic analysis for clinical use requires that all aspects of the workflow are optimized and tested, including data analysis and computational time and resources. We tested the accuracy, sensitivity, and resource requirements of three top metagenomic taxonomic classifiers that use fast k-mer based algorithms: Centrifuge, CLARK, and KrakenUniq. Binary mixtures of bacteria showed all three reliably identified organisms down to 1% relative abundance, while only the relative abundance estimates of Centrifuge and CLARK were accurate. All three classifiers identified the organisms present in their default databases from a mock bacterial community of 20 organisms, but only Centrifuge had no false positives. In addition, Centrifuge required far less computational resources and time for analysis. Centrifuge analysis of metagenomes obtained from samples of VAP, infected DFUs, and FN showed Centrifuge identified pathogenic bacteria and one virus that were corroborated by culture or a clinical PCR assay. Importantly, in both diabetic foot ulcer patients, metagenomic sequencing identified pathogens 4-6 weeks before culture. Finally, we show that Centrifuge results were minimally affected by elimination of time-consuming read quality control and host screening steps.
Assuntos
Bactérias/genética , Bactérias/isolamento & purificação , Metagenômica/métodos , Algoritmos , Código de Barras de DNA Taxonômico/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Metagenoma , Microbiota/genética , Sensibilidade e Especificidade , Análise de Sequência de DNA/métodosRESUMO
Gramene (http://www.gramene.org) is a curated online resource for comparative functional genomics in crops and model plant species, currently hosting 27 fully and 10 partially sequenced reference genomes in its build number 38. Its strength derives from the application of a phylogenetic framework for genome comparison and the use of ontologies to integrate structural and functional annotation data. Whole-genome alignments complemented by phylogenetic gene family trees help infer syntenic and orthologous relationships. Genetic variation data, sequences and genome mappings available for 10 species, including Arabidopsis, rice and maize, help infer putative variant effects on genes and transcripts. The pathways section also hosts 10 species-specific metabolic pathways databases developed in-house or by our collaborators using Pathway Tools software, which facilitates searches for pathway, reaction and metabolite annotations, and allows analyses of user-defined expression datasets. Recently, we released a Plant Reactome portal featuring 133 curated rice pathways. This portal will be expanded for Arabidopsis, maize and other plant species. We continue to provide genetic and QTL maps and marker datasets developed by crop researchers. The project provides a unique community platform to support scientific research in plant genomics including studies in evolution, genetics, plant breeding, molecular biology, biochemistry and systems biology.
Assuntos
Bases de Dados Genéticas , Genoma de Planta , Genômica , Produtos Agrícolas/genética , Variação Genética , Internet , Redes e Vias Metabólicas/genética , Anotação de Sequência Molecular , Plantas/genética , Plantas/metabolismoRESUMO
Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species. The project exploits and extends technologies for genome annotation, analysis and dissemination, developed in the context of the vertebrate-focused Ensembl project, and provides a complementary set of resources for non-vertebrate species through a consistent set of programmatic and interactive interfaces. These provide access to data including reference sequence, gene models, transcriptional data, polymorphisms and comparative analysis. This article provides an update to the previous publications about the resource, with a focus on recent developments. These include the addition of important new genomes (and related data sets) including crop plants, vectors of human disease and eukaryotic pathogens. In addition, the resource has scaled up its representation of bacterial genomes, and now includes the genomes of over 9000 bacteria. Specific extensions to the web and programmatic interfaces have been developed to support users in navigating these large data sets. Looking forward, analytic tools to allow targeted selection of data for visualization and download are likely to become increasingly important in future as the number of available genomes increases within all domains of life, and some of the challenges faced in representing bacterial data are likely to become commonplace for eukaryotes in future.
Assuntos
Bases de Dados Genéticas , Genoma , Animais , Grão Comestível/genética , Genoma Bacteriano , Genoma Fúngico , Genoma de Planta , Genômica , Internet , Anotação de Sequência Molecular , SoftwareRESUMO
Now in its 10th year, the Gramene database (http://www.gramene.org) has grown from its primary focus on rice, the first fully-sequenced grass genome, to become a resource for major model and crop plants including Arabidopsis, Brachypodium, maize, sorghum, poplar and grape in addition to several species of rice. Gramene began with the addition of an Ensembl genome browser and has expanded in the last decade to become a robust resource for plant genomics hosting a wide array of data sets including quantitative trait loci (QTL), metabolic pathways, genetic diversity, genes, proteins, germplasm, literature, ontologies and a fully-structured markers and sequences database integrated with genome browsers and maps from various published studies (genetic, physical, bin, etc.). In addition, Gramene now hosts a variety of web services including a Distributed Annotation Server (DAS), BLAST and a public MySQL database. Twice a year, Gramene releases a major build of the database and makes interim releases to correct errors or to make important updates to software and/or data.
Assuntos
Bases de Dados Genéticas , Genoma de Planta , Plantas/genética , Mapeamento Cromossômico , Genes de Plantas , Variação Genética , Genômica , Redes e Vias Metabólicas , Plantas/metabolismo , Locos de Características Quantitativas , SinteniaRESUMO
Environmental contamination is a fundamental determinant of health and well-being, and when the environment is compromised, vulnerabilities are generated. The complex challenges associated with environmental health and food security are influenced by current and emerging political, social, economic, and environmental contexts. To solve these "wicked" dilemmas, disparate public health surveillance efforts are conducted by local, state, and federal agencies. More recently, citizen/community science (CS) monitoring efforts are providing site-specific data. One of the biggest challenges in using these government datasets, let alone incorporating CS data, for a holistic assessment of environmental exposure is data management and interoperability. To facilitate a more holistic perspective and approach to solution generation, we have developed a method to provide a common data model that will allow environmental health researchers working at different scales and research domains to exchange data and ask new questions. We anticipate that this method will help to address environmental health disparities, which are unjust and avoidable, while ensuring CS datasets are ethically integrated to achieve environmental justice. Specifically, we used a transdisciplinary research framework to develop a methodology to integrate CS data with existing governmental environmental monitoring and social attribute data (vulnerability and resilience variables) that span across 10 different federal and state agencies. A key challenge in integrating such different datasets is the lack of widely adopted ontologies for vulnerability and resiliency factors. In addition to following the best practice of submitting new term requests to existing ontologies to fill gaps, we have also created an application ontology, the Superfund Research Project Data Interface Ontology (SRPDIO).
RESUMO
UNLABELLED: CMap is a web-based tool for displaying and comparing maps of any type and from any species. A user can compare an unlimited number of maps, view pair-wise comparisons of known correspondences, and search for maps or for features by name, species, type and accession. CMap is freely available, can run on a variety of database engines and uses only free and open software components. AVAILABILITY: http://www.gmod.org/cmap
Assuntos
Biologia Computacional/métodos , Software , Animais , InternetRESUMO
Gramene (www.gramene.org) is a curated resource for genetic, genomic and comparative genomics data for the major crop species, including rice, maize, wheat and many other plant (mainly grass) species. Gramene is an open-source project. All data and software are freely downloadable through the ftp site (ftp.gramene.org/pub/gramene) and available for use without restriction. Gramene's core data types include genome assembly and annotations, other DNA/mRNA sequences, genetic and physical maps/markers, genes, quantitative trait loci (QTLs), proteins, ontologies, literature and comparative mappings. Since our last NAR publication 2 years ago, we have updated these data types to include new datasets and new connections among them. Completely new features include rice pathways for functional annotation of rice genes; genetic diversity data from rice, maize and wheat to show genetic variations among different germplasms; large-scale genome comparisons among Oryza sativa and its wild relatives for evolutionary studies; and the creation of orthologous gene sets and phylogenetic trees among rice, Arabidopsis thaliana, maize, poplar and several animal species (for reference purpose). We have significantly improved the web interface in order to provide a more user-friendly browsing experience, including a dropdown navigation menu system, unified web page for markers, genes, QTLs and proteins, and enhanced quick search functions.
Assuntos
Produtos Agrícolas/genética , Bases de Dados Genéticas , Genoma de Planta , Arabidopsis/genética , Mapeamento Cromossômico , Produtos Agrícolas/metabolismo , Marcadores Genéticos , Variação Genética , Genômica , Internet , Oryza/genética , Poaceae/genética , Triticum/genética , Interface Usuário-Computador , Zea mays/genéticaRESUMO
High-throughput sequencing technologies provide unprecedented power to identify novel viruses from a wide variety of (environmental) samples. The field of 'viral metagenomics' has dramatically expanded our understanding of viral diversity. Viral metagenomic approaches imply that many novel viruses will not be described by researchers who are experts on (the genomic organization of) that virus family. We have developed the papillomavirus annotation tool (PuMA) to provide researchers with a convenient and reproducible method to annotate and report novel papillomaviruses. PuMA currently correctly annotates 99% of the papillomavirus genes when benchmarked against the 655 reference genomes in the papillomavirus episteme. Compared to another viral annotation pipeline, PuMA annotates more viral features while being more accurate. To demonstrate its general applicability, we also developed a preliminary version of PuMA that can annotate polyomaviruses. PuMA is available on GitHub (https://github.com/KVD-lab/puma) and through the iMicrobe online environment (https://www.imicrobe.us/#/apps/puma).
RESUMO
Background: Shotgun metagenomics provides powerful insights into microbial community biodiversity and function. Yet, inferences from metagenomic studies are often limited by dataset size and complexity and are restricted by the availability and completeness of existing databases. De novo comparative metagenomics enables the comparison of metagenomes based on their total genetic content. Results: We developed a tool called Libra that performs an all-vs-all comparison of metagenomes for precise clustering based on their k-mer content. Libra uses a scalable Hadoop framework for massive metagenome comparisons, Cosine Similarity for calculating the distance using sequence composition and abundance while normalizing for sequencing depth, and a web-based implementation in iMicrobe (http://imicrobe.us) that uses the CyVerse advanced cyberinfrastructure to promote broad use of the tool by the scientific community. Conclusions: A comparison of Libra to equivalent tools using both simulated and real metagenomic datasets, ranging from 80 million to 4.2 billion reads, reveals that methods commonly implemented to reduce compute time for large datasets, such as data reduction, read count normalization, and presence/absence distance metrics, greatly diminish the resolution of large-scale comparative analyses. In contrast, Libra uses all of the reads to calculate k-mer abundance in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.
Assuntos
Metagenômica/métodos , Microbiota/genética , Software , Algoritmos , Análise por Conglomerados , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodosRESUMO
BACKGROUND: Scientists have amassed a wealth of microbiome datasets, making it possible to study microbes in biotic and abiotic systems on a population or planetary scale; however, this potential has not been fully realized given that the tools, datasets, and computation are available in diverse repositories and locations. To address this challenge, we developed iMicrobe.us, a community-driven microbiome data marketplace and tool exchange for users to integrate their own data and tools with those from the broader community. FINDINGS: The iMicrobe platform brings together analysis tools and microbiome datasets by leveraging National Science Foundation-supported cyberinfrastructure and computing resources from CyVerse, Agave, and XSEDE. The primary purpose of iMicrobe is to provide users with a freely available, web-based platform to (1) maintain and share project data, metadata, and analysis products, (2) search for related public datasets, and (3) use and publish bioinformatics tools that run on highly scalable computing resources. Analysis tools are implemented in containers that encapsulate complex software dependencies and run on freely available XSEDE resources via the Agave API, which can retrieve datasets from the CyVerse Data Store or any web-accessible location (e.g., FTP, HTTP). CONCLUSIONS: iMicrobe promotes data integration, sharing, and community-driven tool development by making open source data and tools accessible to the research community in a web-based platform.
Assuntos
Metagenômica/métodos , Microbiota/genética , Software , Big Data , MetagenomaRESUMO
A wealth of viral data sits untapped in publicly available metagenomic data sets when it might be extracted to create a usable index for the virological research community. We hypothesized that work of this complexity and scale could be done in a hackathon setting. Ten teams comprised of over 40 participants from six countries, assembled to create a crowd-sourced set of analysis and processing pipelines for a complex biological data set in a three-day event on the San Diego State University campus starting 9 January 2019. Prior to the hackathon, 141,676 metagenomic data sets from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) were pre-assembled into contiguous assemblies (contigs) by NCBI staff. During the hackathon, a subset consisting of 2953 SRA data sets (approximately 55 million contigs) was selected, which were further filtered for a minimal length of 1 kb. This resulted in 4.2 million (Mio) contigs, which were aligned using BLAST against all known virus genomes, phylogenetically clustered and assigned metadata. Out of the 4.2 Mio contigs, 360,000 contigs were labeled with domains and an additional subset containing 4400 contigs was screened for virus or virus-like genes. The work yielded valuable insights into both SRA data and the cloud infrastructure required to support such efforts, revealing analysis bottlenecks and possible workarounds thereof. Mainly: (i) Conservative assemblies of SRA data improves initial analysis steps; (ii) existing bioinformatic software with weak multithreading/multicore support can be elevated by wrapper scripts to use all cores within a computing node; (iii) redesigning existing bioinformatic algorithms for a cloud infrastructure to facilitate its use for a wider audience; and (iv) a cloud infrastructure allows a diverse group of researchers to collaborate effectively. The scientific findings will be extended during a follow-up event. Here, we present the applied workflows, initial results, and lessons learned from the hackathon.
Assuntos
Computação em Nuvem/normas , Genoma Viral , Metagenoma , Metagenômica/métodos , Big Data , Genoma Humano , Humanos , Metagenômica/normas , SoftwareRESUMO
Rice, maize, sorghum, wheat, barley and the other major crop grasses from the family Poaceae (Gramineae) are mankind's most important source of calories and contribute tens of billions of dollars annually to the world economy (FAO 1999, http://www.fao.org; USDA 1997, http://www.usda.gov). Continued improvement of Poaceae crops is necessary in order to continue to feed an ever-growing world population. However, of the major crop grasses, only rice (Oryza sativa), with a compact genome of approximately 400 Mbp, has been sequenced and annotated. The Gramene database (http://www.gramene.org) takes advantage of the known genetic colinearity (synteny) between rice and the major crop plant genomes to provide maize, sorghum, millet, wheat, oat and barley researchers with the benefits of an annotated genome years before their own species are sequenced. Gramene is a one stop portal for finding curated literature, genetic and genomic datasets related to maps, markers, genes, genomes and quantitative trait loci. The addition of several new tools to Gramene has greatly facilitated the potential for comparative analysis among the grasses and contributes to our understanding of the anatomy, development, environmental responses and the factors influencing agronomic performance of cereal crops. Since the last publication on Gramene database by D. H. Ware, P. Jaiswal, J. Ni, I. V. Yap, X. Pan, K. Y. Clark, L. Teytelman, S. C. Schmidt, W. Zhao, K. Chang et al. [(2002), Plant Physiol., 130, 1606-1613], the database has undergone extensive changes that are described in this publication.
Assuntos
Mapeamento Cromossômico , Bases de Dados Genéticas , Grão Comestível/genética , Genoma de Planta , Arabidopsis/genética , Genes de Plantas , Marcadores Genéticos , Genômica , Internet , Oryza/genética , Proteínas de Plantas/genética , Locos de Características Quantitativas , Interface Usuário-Computador , Vocabulário Controlado , Zea mays/genéticaRESUMO
Microbes affect nutrient and energy transformations throughout the world's ecosystems, yet they do so under viral constraints. In complex communities, viral metagenome (virome) sequencing is transforming our ability to quantify viral diversity and impacts. Although some bottlenecks, for example, few reference genomes and nonquantitative viromics, have been overcome, the void of centralized data sets and specialized tools now prevents viromics from being broadly applied to answer fundamental ecological questions. Here we present iVirus, a community resource that leverages the CyVerse cyberinfrastructure to provide access to viromic tools and data sets. The iVirus Data Commons contains both raw and processed data from 1866 samples and 73 projects derived from global ocean expeditions, as well as existing and legacy public repositories. Through the CyVerse Discovery Environment, users can interrogate these data sets using existing analytical tools (software applications known as 'Apps') for assembly, open reading frame prediction and annotation, as well as several new Apps specifically developed for analyzing viromes. Because Apps are web based and powered by CyVerse supercomputing resources, they enable scalable analyses for a broad user base. Finally, a use-case scenario documents how to apply these advances toward new data. This growing iVirus resource should help researchers utilize viromics as yet another tool to elucidate viral roles in nature.
Assuntos
Bases de Dados Genéticas , Vírus/isolamento & purificação , Microbiologia Ambiental , Internet , Metagenoma , Fases de Leitura Aberta , Software , Vírus/classificação , Vírus/genéticaRESUMO
Bacteriophages play an important role in host-driven biological processes by controlling bacterial population size, horizontally transferring genes between hosts and expressing host-derived genes to alter host metabolism. Metagenomics provides the genetic basis for understanding the interplay between uncultured bacteria, their phage and the environment. In particular, viral metagenomes (viromes) are providing new insight into phage-encoded host genes (i.e. auxiliary metabolic genes; AMGs) that reprogram host metabolism during infection. Yet, despite deep sequencing efforts of viral communities, the majority of sequences have no match to known proteins. Reference-independent computational techniques, such as protein clustering, contig spectra and ecological profiling are overcoming these barriers to examine both the known and unknown components of viromes. As the field of viral metagenomics progresses, a critical assessment of tools is required as the majority of algorithms have been developed for analyzing bacteria. The aim of this paper is to offer an overview of current computational methodologies for virome analysis and to provide an example of reference-independent approaches using human skin viromes. Additionally, we present methods to carefully validate AMGs from host contamination. Despite computational challenges, these new methods offer novel insights into the diversity and functional roles of phages in diverse environments.
Assuntos
Bacteriófagos/genética , Genoma Viral , Metagenômica , Bactérias/genética , Bacteriófagos/fisiologia , Biologia Computacional , DNA Viral/genética , Transferência Genética Horizontal , Interações Hospedeiro-Patógeno/genética , Humanos , Metagenoma , Pele/virologiaRESUMO
Gramene is an integrated informatics resource for accessing, visualizing, and comparing plant genomes and biological pathways. Originally targeting grasses, Gramene has grown to host annotations for economically important and research model crops, including wheat, potato, tomato, banana, grape, poplar, and Chlamydomonas. Its strength derives from the application of a phylogenetic framework for genome comparison and the use of ontologies to integrate structural and functional annotation data. This chapter outlines system requirements for end users and database hosting, data types and basic navigation within Gramene, and provides examples of how to (1) view a phylogenetic tree for a family of transcription factors, (2) explore genetic variation in the orthologues of a gene with a known trait association, and (3) upload, visualize, and privately share end user data into a new genome browser track.Moreover, this is the first publication describing Gramene's new web interface-intended to provide a simplified portal to the most complete and up-to-date set of plant genome and pathway annotations.
Assuntos
Biologia Computacional/métodos , Plantas/genética , Plantas/metabolismo , Software , Genoma de Planta , Redes e Vias Metabólicas , Transdução de Sinais , NavegadorRESUMO
PREMISE OF THE STUDY: We report the de novo assembly and characterization of the transcriptomes of Brachypodium sylvaticum (slender false-brome) accessions from native populations of Spain and Greece, and an invasive population west of Corvallis, Oregon, USA. ⢠METHODS AND RESULTS: More than 350 million sequence reads from the mRNA libraries prepared from three B. sylvaticum genotypes were assembled into 120,091 (Corvallis), 104,950 (Spain), and 177,682 (Greece) transcript contigs. In comparison with the B. distachyon Bd21 reference genome and GenBank protein sequences, we estimate >90% exome coverage for B. sylvaticum. The transcripts were assigned Gene Ontology and InterPro annotations. Brachypodium sylvaticum sequence reads aligned against the Bd21 genome revealed 394,654 single-nucleotide polymorphisms (SNPs) and >20,000 simple sequence repeat (SSR) DNA sites. ⢠CONCLUSIONS: To our knowledge, this is the first report of transcriptome sequencing of invasive plant species with a closely related sequenced reference genome. The sequences and identified SNP variant and SSR sites will provide tools for developing novel genetic markers for use in genotyping and characterization of invasive behavior of B. sylvaticum.
RESUMO
Gramene is a well-established resource for plant comparative genome analysis. Data are generated through automated and curated analyses and made available through web interfaces such as GrameneMart. The Gramene project was an early adopter of the BioMart software, which remains an integral and well-used component of the Gramene website. BioMart accessible data sets include plant gene annotations, plant variation catalogues, genetic markers, physical mapping entities, public DNA/mRNA sequences of various types and curated quantitative trait loci for various species. DATABASE URL: http://www.gramene.org/biomart/martview.