RESUMO
Extremely large datasets have become routine in biology. However, performing a computational analysis of a large dataset can be overwhelming, especially for novices. Here, we present a step-by-step guide to computing workflows with the biologist end-user in mind. Starting from a foundation of sound data management practices, we make specific recommendations on how to approach and perform computational analyses of large datasets, with a view to enabling sound, reproducible biological research.
Assuntos
Biologia , Biologia Computacional/métodos , Metodologias Computacionais , Fluxo de Trabalho , Animais , Biologia/tendências , Biologia Computacional/tendências , Sistemas de Gerenciamento de Base de Dados/tendências , Conjuntos de Dados como Assunto/tendências , Guias como Assunto , Humanos , Reprodutibilidade dos Testes , Terminologia como Assunto , Recursos HumanosRESUMO
Computers are now essential in all branches of science, but most researchers are never taught the equivalent of basic lab skills for research computing. As a result, data can get lost, analyses can take much longer than necessary, and researchers are limited in how effectively they can work with software and data. Computing workflows need to follow the same practices as lab projects and notebooks, with organized data, documented steps, and the project structured for reproducibility, but researchers new to computing often don't know where to start. This paper presents a set of good computing practices that every researcher can adopt, regardless of their current level of computational skill. These practices, which encompass data management, programming, collaborating with colleagues, organizing projects, tracking work, and writing manuscripts, are drawn from a wide variety of published sources from our daily lives and from our work with volunteer organizations that have delivered workshops to over 11,000 people since 2010.
Assuntos
Segurança Computacional/normas , Metodologias Computacionais , Confiabilidade dos Dados , Pesquisa/normas , Ciência/normas , Software/normas , Documentação/normas , Guias como AssuntoRESUMO
The scale and magnitude of complex and pressing environmental issues lend urgency to the need for integrative and reproducible analysis and synthesis, facilitated by data-intensive research approaches. However, the recent pace of technological change has been such that appropriate skills to accomplish data-intensive research are lacking among environmental scientists, who more than ever need greater access to training and mentorship in computational skills. Here, we provide a roadmap for raising data competencies of current and next-generation environmental researchers by describing the concepts and skills needed for effectively engaging with the heterogeneous, distributed, and rapidly growing volumes of available data. We articulate five key skills: (1) data management and processing, (2) analysis, (3) software skills for science, (4) visualization, and (5) communication methods for collaboration and dissemination. We provide an overview of the current suite of training initiatives available to environmental scientists and models for closing the skill-transfer gap.
RESUMO
Agriculture is being challenged to provide food, and increasingly fuel, for an expanding global population. Producing bioenergy crops on marginal lands--farmland suboptimal for food crops--could help meet energy goals while minimizing competition with food production. However, the ecological costs and benefits of growing bioenergy feedstocks--primarily annual grain crops--on marginal lands have been questioned. Here we show that perennial bioenergy crops provide an alternative to annual grains that increases biodiversity of multiple taxa and sustain a variety of ecosystem functions, promoting the creation of multifunctional agricultural landscapes. We found that switchgrass and prairie plantings harbored significantly greater plant, methanotrophic bacteria, arthropod, and bird diversity than maize. Although biomass production was greater in maize, all other ecosystem services, including methane consumption, pest suppression, pollination, and conservation of grassland birds, were higher in perennial grasslands. Moreover, we found that the linkage between biodiversity and ecosystem services is dependent not only on the choice of bioenergy crop but also on its location relative to other habitats, with local landscape context as important as crop choice in determining provision of some services. Our study suggests that bioenergy policy that supports coordinated land use can diversify agricultural landscapes and sustain multiple critical ecosystem services.
Assuntos
Biodiversidade , Conservação de Recursos Energéticos , Ecossistema , Poaceae , AnimaisRESUMO
BACKGROUND: Many tools exist in the analysis of bacterial RNA sequencing (RNA-seq) transcriptional profiling experiments to identify differentially expressed genes between experimental conditions. Generally, the workflow includes quality control of reads, mapping to a reference, counting transcript abundance, and statistical tests for differentially expressed genes. In spite of the numerous tools developed for each component of an RNA-seq analysis workflow, easy-to-use bacterially oriented workflow applications to combine multiple tools and automate the process are lacking. With many tools to choose from for each step, the task of identifying a specific tool, adapting the input/output options to the specific use-case, and integrating the tools into a coherent analysis pipeline is not a trivial endeavor, particularly for microbiologists with limited bioinformatics experience. RESULTS: To make bacterial RNA-seq data analysis more accessible, we developed a Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis (SPARTA). SPARTA is a reference-based bacterial RNA-seq analysis workflow application for single-end Illumina reads. SPARTA is turnkey software that simplifies the process of analyzing RNA-seq data sets, making bacterial RNA-seq analysis a routine process that can be undertaken on a personal computer or in the classroom. The easy-to-install, complete workflow processes whole transcriptome shotgun sequencing data files by trimming reads and removing adapters, mapping reads to a reference, counting gene features, calculating differential gene expression, and, importantly, checking for potential batch effects within the data set. SPARTA outputs quality analysis reports, gene feature counts and differential gene expression tables and scatterplots. CONCLUSIONS: SPARTA provides an easy-to-use bacterial RNA-seq transcriptional profiling workflow to identify differentially expressed genes between experimental conditions. This software will enable microbiologists with limited bioinformatics experience to analyze their data and integrate next generation sequencing (NGS) technologies into the classroom. The SPARTA software and tutorial are available at sparta.readthedocs.org.
Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , RNA Bacteriano/genética , Análise de Sequência de RNA/métodos , Software , Transcriptoma/genética , Automação , Padrões de ReferênciaRESUMO
Ballast water is one of the most important vectors for the transport of non-native species to new aquatic environments. Due to the development of new ballast water quality standards for viruses, this study aimed to determine the taxonomic diversity and composition of viral communities (viromes) in ballast and harbor waters using metagenomics approaches. Ballast waters from different sources within the North America Great Lakes and paired harbor waters were collected around the Port of Duluth-Superior. Bioinformatics analysis of over 550 million sequences showed that a majority of the viral sequences could not be assigned to any taxa associated with reference sequences, indicating the lack of knowledge on viruses in ballast and harbor waters. However, the assigned viruses were dominated by double-stranded DNA phages, and sequences associated with potentially emerging viral pathogens of fish and shrimp were detected with low amino acid similarity in both ballast and harbor waters. Annotation-independent comparisons showed that viromes were distinct among the Great Lakes, and the Great Lakes viromes were closely related to viromes of other cold natural freshwater systems but distant from viromes of marine and human designed/managed freshwater systems. These results represent the most detailed characterization to date of viruses in ballast water, demonstrating their diversity and the potential significance of the ship-mediated spread of viruses.
Assuntos
Lagos/virologia , Metagenômica/métodos , Navios , Vírus/genética , Animais , Bacteriófagos/genética , Bacteriófagos/patogenicidade , Crustáceos/virologia , Monitoramento Ambiental/métodos , Peixes/virologia , Genoma Viral , Great Lakes Region , Vírus/classificação , Vírus/patogenicidade , ÁguaRESUMO
Science, technology, engineering, mathematics, and medicine (STEMM) fields change rapidly and are increasingly interdisciplinary. Commonly, STEMM practitioners use short-format training (SFT) such as workshops and short courses for upskilling and reskilling, but unaddressed challenges limit SFT's effectiveness and inclusiveness. Education researchers, students in SFT courses, and organizations have called for research and strategies that can strengthen SFT in terms of effectiveness, inclusiveness, and accessibility across multiple dimensions. This paper describes the project that resulted in a consensus set of 14 actionable recommendations to systematically strengthen SFT. A diverse international group of 30 experts in education, accessibility, and life sciences came together from 10 countries to develop recommendations that can help strengthen SFT globally. Participants, including representation from some of the largest life science training programs globally, assembled findings in the educational sciences and encompassed the experiences of several of the largest life science SFT programs. The 14 recommendations were derived through a Delphi method, where consensus was achieved in real time as the group completed a series of meetings and tasks designed to elicit specific recommendations. Recommendations cover the breadth of SFT contexts and stakeholder groups and include actions for instructors (e.g., make equity and inclusion an ethical obligation), programs (e.g., centralize infrastructure for assessment and evaluation), as well as organizations and funders (e.g., professionalize training SFT instructors; deploy SFT to counter inequity). Recommendations are aligned with a purpose-built framework-"The Bicycle Principles"-that prioritizes evidenced-based teaching, inclusiveness, and equity, as well as the ability to scale, share, and sustain SFT. We also describe how the Bicycle Principles and recommendations are consistent with educational change theories and can overcome systemic barriers to delivering consistently effective, inclusive, and career-spanning SFT.
Assuntos
Estudantes , Tecnologia , Humanos , Consenso , EngenhariaRESUMO
Here, we report our educational approach and learner evaluations of the first 5 years of the Explorations in Data Analysis for Metagenomic Advances in Microbial Ecology (EDAMAME) workshop, held annually at Michigan State University's Kellogg Biological Station from 2014 to 2018. We hope this information will be useful for others who want to organize computing-intensive workshops and will encourage quantitative skill development among microbiologists.IMPORTANCE High-throughput sequencing and related statistical and bioinformatic analyses have become routine in microbiology in the past decade, but there are few formal training opportunities to develop these skills. A weeklong workshop can offer sufficient time for novices to become introduced to best computing practices and common workflows in sequence analysis. We report our experiences in executing such a workshop targeted to professional learners (graduate students, postdoctoral scientists, faculty, and research staff).
RESUMO
This article describes the motivation, design, and progress of the Journal of Open Source Software (JOSS). JOSS is a free and open-access journal that publishes articles describing research software. It has the dual goals of improving the quality of the software submitted and providing a mechanism for research software developers to receive credit. While designed to work within the current merit system of science, JOSS addresses the dearth of rewards for key contributions to science made in the form of software. JOSS publishes articles that encapsulate scholarship contained in the software itself, and its rigorous peer review targets the software components: functionality, documentation, tests, continuous integration, and the license. A JOSS article contains an abstract describing the purpose and functionality of the software, references, and a link to the software archive. The article is the entry point of a JOSS submission, which encompasses the full set of software artifacts. Submission and review proceed in the open, on GitHub. Editors, reviewers, and authors work collaboratively and openly. Unlike other journals, JOSS does not reject articles requiring major revision; while not yet accepted, articles remain visible and under review until the authors make adequate changes (or withdraw, if unable to meet requirements). Once an article is accepted, JOSS gives it a digital object identifier (DOI), deposits its metadata in Crossref, and the article can begin collecting citations on indexers like Google Scholar and other services. Authors retain copyright of their JOSS article, releasing it under a Creative Commons Attribution 4.0 International License. In its first year, starting in May 2016, JOSS published 111 articles, with more than 40 additional articles under review. JOSS is a sponsored project of the nonprofit organization NumFOCUS and is an affiliate of the Open Source Initiative (OSI).
RESUMO
In biology, a missing link connecting data generation and data-driven discovery is the training that prepares researchers to effectively manage and analyze data. National and international cyberinfrastructure along with evolving private sector resources place biologists and students within reach of the tools needed for data-intensive biology, but training is still required to make effective use of them. In this concept paper, we review a number of opportunities and challenges that can inform the creation of a national bioinformatics training infrastructure capable of servicing the large number of emerging and existing life scientists. While college curricula are slower to adapt, grassroots startup-spirited organizations, such as Software and Data Carpentry, have made impressive inroads in training on the best practices of software use, development, and data analysis. Given the transformative potential of biology and medicine as full-fledged data sciences, more support is needed to organize, amplify, and assess these efforts and their impacts.
Assuntos
Biologia Computacional/educação , Colaboração Intersetorial , Educação Baseada em Competências/tendências , Biologia Computacional/instrumentação , Biologia Computacional/tendências , Mineração de Dados/métodos , Mineração de Dados/tendências , Sistemas de Gerenciamento de Base de Dados/organização & administração , Sistemas de Gerenciamento de Base de Dados/tendências , Humanos , Influência dos Pares , Terminologia como Assunto , Estados Unidos , Recursos HumanosRESUMO
BACKGROUND: The intestinal microbiome represents a complex network of microbes that are important for human health and preventing pathogen invasion. Studies that examine differences in intestinal microbial communities across individuals with and without enteric infections are useful for identifying microbes that support or impede intestinal health. RESULTS: 16S rRNA gene sequencing was conducted on stool DNA from patients with enteric infections (n = 200) and 75 healthy family members to identify differences in intestinal community composition. Stools from 13 patients were also examined post-infection to better understand how intestinal communities recover. Patient communities had lower species richness, evenness, and diversity versus uninfected communities, while principle coordinate analysis demonstrated close clustering of uninfected communities, but not the patient communities, irrespective of age, gender, and race. Differences in community composition between patients and family members were mostly due to variation in the abundance of phyla Proteobacteria, Bacteroidetes, and Firmicutes. Patient communities had significantly more Proteobacteria representing genus Escherichia relative to uninfected communities, which were dominated by Bacteroides. Intestinal communities from patients with bloody diarrhea clustered together in the neighbor-joining phylogeny, while communities from 13 patients' post-infection had a significant increase in Bacteroidetes and Firmicutes and clustered together with uninfected communities. CONCLUSIONS: These data demonstrate that the intestinal communities in patients with enteric bacterial infections get altered in similar ways. Furthermore, preventing an increase in Escherichia abundance may be an important consideration for future prevention strategies.
Assuntos
Enterite/microbiologia , Enterite/reabilitação , Microbioma Gastrointestinal , Adolescente , Adulto , Idoso , Bactérias/classificação , Bactérias/genética , Biodiversidade , Estudos de Casos e Controles , Criança , Pré-Escolar , Análise por Conglomerados , Enterite/epidemiologia , Feminino , Humanos , Lactente , Recém-Nascido , Masculino , Metagenoma , Michigan/epidemiologia , Pessoa de Meia-Idade , Vigilância em Saúde Pública , RNA Ribossômico 16S/genética , Fatores de Tempo , Adulto JovemRESUMO
Agriculture has marked impacts on the production of carbon dioxide (CO(2)) and consumption of methane (CH(4)) by microbial communities in upland soils-Earth's largest biological sink for atmospheric CH(4). To determine whether the diversity of microbes that catalyze the flux of these greenhouse gases is related to the magnitude and stability of these ecosystem-level processes, we conducted molecular surveys of CH(4)-oxidizing bacteria (methanotrophs) and total bacterial diversity across a range of land uses and measured the in situ flux of CH(4) and CO(2) at a site in the upper United States Midwest. Conversion of native lands to row-crop agriculture led to a sevenfold reduction in CH(4) consumption and a proportionate decrease in methanotroph diversity. Sites with the greatest stability in CH(4) consumption harbored the most methanotroph diversity. In fields abandoned from agriculture, the rate of CH(4) consumption increased with time along with the diversity of methanotrophs. Conversely, estimates of total bacterial diversity in soil were not related to the rate or stability of CO(2) emission. These combined results are consistent with the expectation that microbial diversity is a better predictor of the magnitude and stability of processes catalyzed by organisms with highly specialized metabolisms, like CH(4) oxidation, as compared with processes driven by widely distributed metabolic processes, like CO(2) production in heterotrophs. The data also suggest that managing lands to conserve or restore methanotroph diversity could mitigate the atmospheric concentrations of this potent greenhouse gas.
Assuntos
Agricultura , Bactérias/metabolismo , Dióxido de Carbono/metabolismo , Metano/metabolismo , Microbiologia do Solo , Biodiversidade , Ecossistema , Gases/metabolismo , Poluentes do Solo/metabolismo , Estados UnidosRESUMO
An intrinsic artifact of 454-based pyrosequencing leads to artificial overrepresentation of >10% of the original DNA sequencing templates. This artificial amplification of sequences is unbiased with regard to position on the pyrosequencing plate or sequence identity, and it occurs in all currently available 454 technologies. The amplified sequences start at the same position and are identical (duplicates), or vary in length, or contain a sequencing discrepancy. If the abundance of any sequence in a data set is going to be enumerated, either for comparative community analysis, transcriptional analysis or other applications, it is important to remove these artificial replicates before analysis. A web-based tool that incorporates the clustering algorithm cd-hit was developed to identify and remove artificially replicated sequences in 454-based pyrosequencing data sets. This tool cannot be used for data sets that have an initial amplification step before the standard pyrosequencing procedure, because artificial replicates cannot be distinguished from expected replication due to polymerase chain reaction (PCR) amplification, e.g., in sequencing of amplified gene "tags." This protocol provides details on how to use the replicate filter and obtain a file of unique sequences for use in metagenomic or transcriptomic analyses.
Assuntos
Biologia Computacional/métodos , Análise de Sequência de DNA/métodos , Software , Estatística como Assunto/métodos , Análise por Conglomerados , Erros de Diagnóstico , InternetRESUMO
Metagenomics is providing an unprecedented view of the taxonomic diversity, metabolic potential and ecological role of microbial communities in biomes as diverse as the mammalian gastrointestinal tract, the marine water column and soils. However, we have found a systematic error in metagenomes generated by 454-based pyrosequencing that leads to an overestimation of gene and taxon abundance; between 11% and 35% of sequences in a typical metagenome are artificial replicates. Here we document the error in several published and original datasets and offer a web-based solution (http://microbiomes.msu.edu/replicates) for identifying and removing these artifacts.
Assuntos
Bases de Dados Genéticas/normas , Metagenoma , Metagenômica/normas , Microbiologia do Solo , Sequência de Bases , Dados de Sequência Molecular , Alinhamento de Sequência , Análise de Sequência de DNA/normasRESUMO
Textpresso is a text-mining system for scientific literature. Its two major features are access to the full text of research papers and the development and use of categories of biological concepts as well as categories that describe or relate objects. A search engine enables the user to search for one or a combination of these categories and/or keywords within an entire literature. Here we describe Textpresso for Neuroscience, part of the core Neuroscience Information Framework (NIF). The Textpresso site currently consists of 67,500 full text papers and 131,300 abstracts. We show that using categories in literature can make a pure keyword query more refined and meaningful. We also show how semantic queries can be formulated with categories only. We explain the build and content of the database and describe the main features of the web pages and the advanced search options. We also give detailed illustrations of the web service developed to provide programmatic access to Textpresso. This web service is used by the NIF interface to access Textpresso. The standalone website of Textpresso for Neuroscience can be accessed at http://www.textpresso.org/neuroscience/.
Assuntos
Biologia Computacional/métodos , Bases de Dados como Assunto , Neurociências/métodos , Publicações Periódicas como Assunto , Acesso à Informação , Animais , Biologia Computacional/organização & administração , Biologia Computacional/tendências , Bases de Dados como Assunto/organização & administração , Bases de Dados como Assunto/tendências , Humanos , Armazenamento e Recuperação da Informação/métodos , Armazenamento e Recuperação da Informação/tendências , Internet/organização & administração , Internet/tendências , Neurociências/organização & administração , Neurociências/tendências , Publicações Periódicas como Assunto/tendências , Editoração/tendências , Semântica , SoftwareRESUMO
It is thought that bacteria excrete redox-active pigments as antibiotics to inhibit competitors. In Pseudomonas aeruginosa, the endogenous antibiotic pyocyanin activates SoxR, a transcription factor conserved in Proteo- and Actinobacteria. In Escherichia coli, SoxR regulates the superoxide stress response. Bioinformatic analysis coupled with gene expression studies in P. aeruginosa and Streptomyces coelicolor revealed that the majority of SoxR regulons in bacteria lack the genes required for stress responses, despite the fact that many of these organisms still produce redox-active small molecules, which indicates that redox-active pigments play a role independent of oxidative stress. These compounds had profound effects on the structural organization of colony biofilms in both P. aeruginosa and S. coelicolor, which shows that "secondary metabolites" play important conserved roles in gene expression and development.
Assuntos
Antibacterianos/metabolismo , Bactérias/genética , Proteínas de Bactérias/metabolismo , Regulação Bacteriana da Expressão Gênica , Pigmentos Biológicos/metabolismo , Pseudomonas aeruginosa/fisiologia , Streptomyces coelicolor/fisiologia , Fatores de Transcrição/metabolismo , Bactérias/metabolismo , Proteínas de Bactérias/genética , Biofilmes/crescimento & desenvolvimento , Biologia Computacional , Escherichia coli/genética , Escherichia coli/metabolismo , Proteínas de Escherichia coli/genética , Proteínas de Escherichia coli/metabolismo , Oxirredução , Estresse Oxidativo , Fenazinas/metabolismo , Fenótipo , Pseudomonas aeruginosa/genética , Pseudomonas aeruginosa/crescimento & desenvolvimento , Regulon , Streptomyces coelicolor/genética , Streptomyces coelicolor/metabolismo , Transativadores/genética , Transativadores/metabolismo , Fatores de Transcrição/genética , Regulação para CimaRESUMO
Biofilms, or surface-attached microbial communities, are both ubiquitous and resilient in the environment. Although much is known about how biofilms form, develop, and detach, very little is understood about how these events are related to metabolism and its dynamics. It is commonly thought that large subpopulations of cells within biofilms are not actively producing proteins or generating energy and are therefore dead. An alternative hypothesis is that within the growth-inactive domains of biofilms, significant populations of living cells persist and retain the capacity to dynamically regulate their metabolism. To test this, we employed unstable fluorescent reporters to measure growth activity and protein synthesis in vivo over the course of biofilm development and created a quantitative routine to compare domains of activity in independently grown biofilms. Here we report that Shewanella oneidensis biofilm structures reproducibly stratify with respect to growth activity and metabolism as a function of size. Within domains of growth-inactive cells, genes typically upregulated under anaerobic conditions are expressed well after growth has ceased. These findings reveal that, far from being dead, the majority of cells in mature S. oneidensis biofilms have actively turned-on metabolic programs appropriate to their local microenvironment and developmental stage.