RESUMO
After an initial evolution in a reducing environment, life got successively challenged by reactive oxygen species (ROS), especially during the great oxidation event (GOE) that followed the development of photosynthesis. Therefore, ROS are deeply intertwined into the physiological, morphological and transcriptional responses of most present-day organisms. Copper-zinc superoxide dismutases (CuZnSODs) evolved during the GOE and are present in charophytes and extant land plants, but nearly absent from chlorophytes. The chemical inhibitor of CuZnSOD, lung cancer screen 1 (LCS-1), could greatly facilitate the study of SODs in diverse plants. Here, we determined the impact of chemical inhibition of plant CuZnSOD activity, on plant growth, transcription and metabolism. We followed a comparative approach by using different plant species, including Marchantia Polymorpha and Physcomitrium patens, representing bryophytes, the sister lineage to vascular plants, and Arabidopsis thaliana. We show that LCS-1 causes oxidative stress in plants and that the inhibition of CuZnSODs provoked a similar core response that mainly impacted glutathione homoeostasis in all plant species analysed. That said, Physcomitrium and Arabidopsis, which contain multiple CuZnSOD isoforms showed a more complex and exacerbated response. In addition, an untargeted metabolomics approach revealed a specific metabolic signature for each plant species. Our comparative analysis exposes a conserved core response at the physiological and transcriptional level towards LCS-1, while the metabolic response largely varies. These differences correlate with the number and localization of the CuZnSOD isoforms present in each species.
RESUMO
BACKGROUND: In clinical research, data have to be accessible and reproducible, but the generated data are becoming larger and analysis complex. Here we propose a platform for Findable, Accessible, Interoperable, and Reusable (FAIR) data access and creating reproducible findings. Standardized access to a major genomic repository, the European Genome-Phenome Archive (EGA), has been achieved with API services like PyEGA3. We aim to provide a FAIR data analysis service in Galaxy by retrieving genomic data from the EGA and provide a generalized "omics" platform for FAIR data analysis. RESULTS: To demonstrate this, we implemented an end-to-end Galaxy workflow to replicate the findings from an RD-Connect synthetic dataset Beyond the 1 Million Genomes (synB1MG) available from the EGA. We developed the PyEGA3 connector within Galaxy to easily download multiple datasets from the EGA. We added the gene.iobio tool, a diagnostic environment for precision genomics, to Galaxy and demonstrate that it provides a more dynamic and interpretable view for trio analysis results. We developed a Galaxy trio analysis workflow to determine the pathogenic variants from the synB1MG trios using the GEMINI and gene.iobio tool. The complete workflow is available at WorkflowHub, and an associated tutorial was created in the Galaxy Training Network, which helps researchers unfamiliar with Galaxy to run the workflow. CONCLUSIONS: We showed the feasibility of reusing data from the EGA in Galaxy via PyEGA3 and validated the workflow by rediscovering spiked-in variants in synthetic data. Finally, we improved existing tools in Galaxy and created a workflow for trio analysis to demonstrate the value of FAIR genomics analysis in Galaxy.
Assuntos
Genômica , Software , Genômica/métodos , Genoma , Fluxo de TrabalhoRESUMO
There is an ongoing explosion of scientific datasets being generated, brought on by recent technological advances in many areas of the natural sciences. As a result, the life sciences have become increasingly computational in nature, and bioinformatics has taken on a central role in research studies. However, basic computational skills, data analysis, and stewardship are still rarely taught in life science educational programs, resulting in a skills gap in many of the researchers tasked with analysing these big datasets. In order to address this skills gap and empower researchers to perform their own data analyses, the Galaxy Training Network (GTN) has previously developed the Galaxy Training Platform (https://training.galaxyproject.org), an open access, community-driven framework for the collection of FAIR (Findable, Accessible, Interoperable, Reusable) training materials for data analysis utilizing the user-friendly Galaxy framework as its primary data analysis platform. Since its inception, this training platform has thrived, with the number of tutorials and contributors growing rapidly, and the range of topics extending beyond life sciences to include topics such as climatology, cheminformatics, and machine learning. While initially aimed at supporting researchers directly, the GTN framework has proven to be an invaluable resource for educators as well. We have focused our efforts in recent years on adding increased support for this growing community of instructors. New features have been added to facilitate the use of the materials in a classroom setting, simplifying the contribution flow for new materials, and have added a set of train-the-trainer lessons. Here, we present the latest developments in the GTN project, aimed at facilitating the use of the Galaxy Training materials by educators, and its usage in different learning environments.
Assuntos
Biologia Computacional , Software , Humanos , Biologia Computacional/métodos , Análise de Dados , PesquisadoresRESUMO
BACKGROUND: The ability to accurately distinguish bacterial from viral infection would help clinicians better target antimicrobial therapy during suspected lower respiratory tract infections (LRTI). Although technological developments make it feasible to rapidly generate patient-specific microbiota profiles, evidence is required to show the clinical value of using microbiota data for infection diagnosis. In this study, we investigated whether adding nasal cavity microbiota profiles to readily available clinical information could improve machine learning classifiers to distinguish bacterial from viral infection in patients with LRTI. RESULTS: Various multi-parametric Random Forests classifiers were evaluated on the clinical and microbiota data of 293 LRTI patients for their prediction accuracies to differentiate bacterial from viral infection. The most predictive variable was C-reactive protein (CRP). We observed a marginal prediction improvement when 7 most prevalent nasal microbiota genera were added to the CRP model. In contrast, adding three clinical variables, absolute neutrophil count, consolidation on X-ray, and age group to the CRP model significantly improved the prediction. The best model correctly predicted 85% of the 'bacterial' patients and 82% of the 'viral' patients using 13 clinical and 3 nasal cavity microbiota genera (Staphylococcus, Moraxella, and Streptococcus). CONCLUSIONS: We developed high-accuracy multi-parametric machine learning classifiers to differentiate bacterial from viral infections in LRTI patients of various ages. We demonstrated the predictive value of four easy-to-collect clinical variables which facilitate personalized and accurate clinical decision-making. We observed that nasal cavity microbiota correlate with the clinical variables and thus may not add significant value to diagnostic algorithms that aim to differentiate bacterial from viral infections.
Assuntos
Infecções Bacterianas , Microbiota , Infecções Respiratórias , Viroses , Infecções Bacterianas/tratamento farmacológico , Proteína C-Reativa/metabolismo , Humanos , Nariz/microbiologia , Infecções Respiratórias/tratamento farmacológico , Viroses/diagnósticoRESUMO
The genomes of thousands of individuals are profiled within Dutch healthcare and research each year. However, this valuable genomic data, associated clinical data and consent are captured in different ways and stored across many systems and organizations. This makes it difficult to discover rare disease patients, reuse data for personalized medicine and establish research cohorts based on specific parameters. FAIR Genomes aims to enable NGS data reuse by developing metadata standards for the data descriptions needed to FAIRify genomic data while also addressing ELSI issues. We developed a semantic schema of essential data elements harmonized with international FAIR initiatives. The FAIR Genomes schema v1.1 contains 110 elements in 9 modules. It reuses common ontologies such as NCIT, DUO and EDAM, only introducing new terms when necessary. The schema is represented by a YAML file that can be transformed into templates for data entry software (EDC) and programmatic interfaces (JSON, RDF) to ease genomic data sharing in research and healthcare. The schema, documentation and MOLGENIS reference implementation are available at https://fairgenomes.org .
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Metadados , Atenção à Saúde , Genômica , Humanos , SoftwareRESUMO
BACKGROUND: Hands-on training, whether in bioinformatics or other domains, often requires significant technical resources and knowledge to set up and run. Instructors must have access to powerful compute infrastructure that can support resource-intensive jobs running efficiently. Often this is achieved using a private server where there is no contention for the queue. However, this places a significant prerequisite knowledge or labor barrier for instructors, who must spend time coordinating deployment and management of compute resources. Furthermore, with the increase of virtual and hybrid teaching, where learners are located in separate physical locations, it is difficult to track student progress as efficiently as during in-person courses. FINDINGS: Originally developed by Galaxy Europe and the Gallantries project, together with the Galaxy community, we have created Training Infrastructure-as-a-Service (TIaaS), aimed at providing user-friendly training infrastructure to the global training community. TIaaS provides dedicated training resources for Galaxy-based courses and events. Event organizers register their course, after which trainees are transparently placed in a private queue on the compute infrastructure, which ensures jobs complete quickly, even when the main queue is experiencing high wait times. A built-in dashboard allows instructors to monitor student progress. CONCLUSIONS: TIaaS provides a significant improvement for instructors and learners, as well as infrastructure administrators. The instructor dashboard makes remote events not only possible but also easy. Students experience continuity of learning, as all training happens on Galaxy, which they can continue to use after the event. In the past 60 months, 504 training events with over 24,000 learners have used this infrastructure for Galaxy training.
Assuntos
Aprendizagem , Software , Humanos , Europa (Continente) , Biologia ComputacionalRESUMO
The Earth Microbiome Project (EMP) aided in understanding the role of microbial communities and the influence of collective genetic material (the 'microbiome') and microbial diversity patterns across the habitats of our planet. With the evolution of new sequencing technologies, researchers can now investigate the microbiome and map its influence on the environment and human health. Advances in bioinformatics methods for next-generation sequencing (NGS) data analysis have helped researchers to gain an in-depth knowledge about the taxonomic and genetic composition of microbial communities. Metagenomic-based methods have been the most commonly used approaches for microbiome analysis; however, it primarily extracts information about taxonomic composition and genetic potential of the microbiome under study, lacking quantification of the gene products (RNA and proteins). On the other hand, metatranscriptomics, the study of a microbial community's RNA expression, can reveal the dynamic gene expression of individual microbial populations and the community as a whole, ultimately providing information about the active pathways in the microbiome. In order to address the analysis of NGS data, the ASaiM analysis framework was previously developed and made available via the Galaxy platform. Although developed for both metagenomics and metatranscriptomics, the original publication demonstrated the use of ASaiM only for metagenomics, while thorough testing for metatranscriptomics data was lacking. In the current study, we have focused on validating and optimizing the tools within ASaiM for metatranscriptomics data. As a result, we deliver a robust workflow that will enable researchers to understand dynamic functional response of the microbiome in a wide variety of metatranscriptomics studies. This improved and optimized ASaiM-metatranscriptomics (ASaiM-MT) workflow is publicly available via the ASaiM framework, documented and supported with training material so that users can interrogate and characterize metatranscriptomic data, as part of larger meta-omic studies of microbiomes.
Assuntos
Metagenômica , Microbiota , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Metagenoma , Microbiota/genética , Fluxo de TrabalhoRESUMO
BACKGROUND: Bacterial plasmids often carry antibiotic resistance genes and are a significant factor in the spread of antibiotic resistance. The ability to completely assemble plasmid sequences would facilitate the localization of antibiotic resistance genes, the identification of genes that promote plasmid transmission and the accurate tracking of plasmid mobility. However, the complete assembly of plasmid sequences using the currently most widely used sequencing platform (Illumina-based sequencing) is restricted due to the generation of short sequence lengths. The long-read Oxford Nanopore Technologies (ONT) sequencing platform overcomes this limitation. Still, the assembly of plasmid sequence data remains challenging due to software incompatibility with long-reads and the error rate generated using ONT sequencing. Bioinformatics pipelines have been developed for ONT-generated sequencing but require computational skills that frequently are beyond the abilities of scientific researchers. To overcome this challenge, the authors developed 'WeFaceNano', a user-friendly Web interFace for rapid assembly and analysis of plasmid DNA sequences generated using the ONT platform. WeFaceNano includes: a read statistics report; two assemblers (Miniasm and Flye); BLAST searching; the detection of antibiotic resistance- and replicon genes and several plasmid visualizations. A user-friendly interface displays the main features of WeFaceNano and gives access to the analysis tools. RESULTS: Publicly available ONT sequence data of 21 plasmids were used to validate WeFaceNano, with plasmid assemblages and anti-microbial resistance gene detection being concordant with the published results. Interestingly, the "Flye" assembler with "meta" settings generated the most complete plasmids. CONCLUSIONS: WeFaceNano is a user-friendly open-source software pipeline suitable for accurate plasmid assembly and the detection of anti-microbial resistance genes in (clinical) samples where multiple plasmids can be present.
Assuntos
Bactérias/genética , Anotação de Sequência Molecular/métodos , Plasmídeos/genética , Software , Bactérias/classificação , Bactérias/efeitos dos fármacos , Bactérias/isolamento & purificação , Proteínas de Bactérias/genética , Biologia Computacional/instrumentação , Biologia Computacional/métodos , Farmacorresistência Bacteriana , Sequenciamento de Nucleotídeos em Larga EscalaRESUMO
The COVID-19 pandemic is shifting teaching to an online setting all over the world. The Galaxy framework facilitates the online learning process and makes it accessible by providing a library of high-quality community-curated training materials, enabling easy access to data and tools, and facilitates sharing achievements and progress between students and instructors. By combining Galaxy with robust communication channels, effective instruction can be designed inclusively, regardless of the students' environments.
Assuntos
COVID-19/epidemiologia , Instrução por Computador , Educação a Distância/organização & administração , COVID-19/virologia , Biologia Computacional , Humanos , Disseminação de Informação , Pandemias , SARS-CoV-2/isolamento & purificaçãoRESUMO
BACKGROUND: Long-read sequencing can be applied to generate very long contigs and even completely assembled genomes at relatively low cost and with minimal sample preparation. As a result, long-read sequencing platforms are becoming more popular. In this respect, the Oxford Nanopore Technologies-based long-read sequencing "nanopore" platform is becoming a widely used tool with a broad range of applications and end-users. However, the need to explore and manipulate the complex data generated by long-read sequencing platforms necessitates accompanying specialized bioinformatics platforms and tools to process the long-read data correctly. Importantly, such tools should additionally help democratize bioinformatics analysis by enabling easy access and ease-of-use solutions for researchers. RESULTS: The Galaxy platform provides a user-friendly interface to computational command line-based tools, handles the software dependencies, and provides refined workflows. The users do not have to possess programming experience or extended computer skills. The interface enables researchers to perform powerful bioinformatics analysis, including the assembly and analysis of short- or long-read sequence data. The newly developed "NanoGalaxy" is a Galaxy-based toolkit for analysing long-read sequencing data, which is suitable for diverse applications, including de novo genome assembly from genomic, metagenomic, and plasmid sequence reads. CONCLUSIONS: A range of best-practice tools and workflows for long-read sequence genome assembly has been integrated into a NanoGalaxy platform to facilitate easy access and use of bioinformatics tools for researchers. NanoGalaxy is freely available at the European Galaxy server https://nanopore.usegalaxy.eu with supporting self-learning training material available at https://training.galaxyproject.org.
Assuntos
Sequenciamento por Nanoporos , Nanoporos , Análise de Dados , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , SoftwareRESUMO
Illumina and nanopore sequencing technologies are powerful tools that can be used to determine the bacterial composition of complex microbial communities. In this study, we compared nasal microbiota results at genus level using both Illumina and nanopore 16S rRNA gene sequencing. We also monitored the progression of nanopore sequencing in the accurate identification of species, using pure, single species cultures, and evaluated the performance of the nanopore EPI2ME 16S data analysis pipeline. Fifty-nine nasal swabs were sequenced using Illumina MiSeq and Oxford Nanopore 16S rRNA gene sequencing technologies. In addition, five pure cultures of relevant bacterial species were sequenced with the nanopore sequencing technology. The Illumina MiSeq sequence data were processed using bioinformatics modules present in the Mothur software package. Albacore and Guppy base calling, a workflow in nanopore EPI2ME (Oxford Nanopore Technologies-ONT, Oxford, UK) and an in-house developed bioinformatics script were used to analyze the nanopore data. At genus level, similar bacterial diversity profiles were found, and five main and established genera were identified by both platforms. However, probably due to mismatching of the nanopore sequence primers, the nanopore sequencing platform identified Corynebacterium in much lower abundance compared to Illumina sequencing. Further, when using default settings in the EPI2ME workflow, almost all sequence reads that seem to belong to the bacterial genus Dolosigranulum and a considerable part to the genus Haemophilus were only identified at family level. Nanopore sequencing of single species cultures demonstrated at least 88% accurate identification of the species at genus and species level for 4/5 strains tested, including improvements in accurate sequence read identification when the basecaller Guppy and Albacore, and when flowcell versions R9.4 (Oxford Nanopore Technologies-ONT, Oxford, UK) and R9.2 (Oxford Nanopore Technologies-ONT, Oxford, UK) were compared. In conclusion, the current study shows that the nanopore sequencing platform is comparable with the Illumina platform in detection bacterial genera of the nasal microbiota, but the nanopore platform does have problems in detecting bacteria within the genus Corynebacterium. Although advances are being made, thorough validation of the nanopore platform is still recommendable.
Assuntos
Genes de RNAr/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Microbiota/genética , Sequenciamento por Nanoporos/métodos , Cavidade Nasal/microbiologia , RNA Ribossômico 16S/genética , Adolescente , Adulto , Idoso , Criança , Pré-Escolar , Biologia Computacional/métodos , Primers do DNA/genética , DNA Bacteriano/genética , Feminino , Humanos , Lactente , Masculino , Pessoa de Meia-Idade , Nanoporos , Adulto JovemRESUMO
BACKGROUND: Circos is a popular, highly flexible software package for the circular visualization of complex datasets. While especially popular in the field of genomic analysis, Circos enables interactive graphing of any analytical data, including alternative scientific domain data and non-scientific data. This high degree of flexibility also comes with a high degree of complexity, which may present an obstacle for researchers not trained in programming or the UNIX command line. The Galaxy platform provides a user-friendly browser-based graphical interface incorporating a broad range of "wrapped" command line tools to facilitate accessibility. FINDINGS: We have developed a Galaxy wrapper for Circos, thus combining the power of Circos with the accessibility and ease of use of the Galaxy platform. The combination substantially simplifies the specification and configuration of Circos plots for end users while retaining the power to produce publication-quality visualizations of complex multidimensional datasets. CONCLUSIONS: Galactic Circos enables the creation of publication-ready Circos plots using only a web browser, via the Galaxy platform. Users may download the full set of Circos configuration files of their plots for further manual customization. This version of Circos is available as an open-source installable application from the Galaxy ToolShed, with its use clarified in a training manual hosted by the Galaxy Training Network.
Assuntos
Biologia Computacional/métodos , Genômica/métodos , Software , Biologia Computacional/normas , Genômica/normas , Fluxo de TrabalhoRESUMO
BACKGROUND: The determination of microbial communities using the mothur tool suite (https://www.mothur.org) is well established. However, mothur requires bioinformatics-based proficiency in order to perform calculations via the command-line. Galaxy is a project dedicated to providing a user-friendly web interface for such command-line tools (https://galaxyproject.org/). RESULTS: We have integrated the full set of 125+ mothur tools into Galaxy as the Galaxy mothur Toolset (GmT) and provided a set of workflows to perform end-to-end 16S rRNA gene analyses and integrate with third-party visualization and reporting tools. We demonstrate the utility of GmT by analyzing the mothur MiSeq standard operating procedure (SOP) dataset (https://www.mothur.org/wiki/MiSeq_SOP). CONCLUSIONS: GmT is available from the Galaxy Tool Shed, and a workflow definition file and full Galaxy training manual for the mothur SOP have been created. A Docker image with a fully configured GmT Galaxy is also available.
Assuntos
Biologia Computacional/métodos , Microbiota/genética , RNA Ribossômico 16S , Análise de Sequência de DNA/métodos , SoftwareRESUMO
The primary problem with the explosion of biomedical datasets is not the data, not computational resources, and not the required storage space, but the general lack of trained and skilled researchers to manipulate and analyze these data. Eliminating this problem requires development of comprehensive educational resources. Here we present a community-driven framework that enables modern, interactive teaching of data analytics in life sciences and facilitates the development of training materials. The key feature of our system is that it is not a static but a continuously improved collection of tutorials. By coupling tutorials with a web-based analysis framework, biomedical researchers can learn by performing computation themselves through a web browser without the need to install software or search for example datasets. Our ultimate goal is to expand the breadth of training materials to include fundamental statistical and data science topics and to precipitate a complete re-engineering of undergraduate and graduate curricula in life sciences. This project is accessible at https://training.galaxyproject.org.
Assuntos
Biologia Computacional/educação , Biologia Computacional/métodos , Pesquisadores/educação , Currículo , Análise de Dados , Educação a Distância/métodos , Educação a Distância/tendências , Humanos , SoftwareRESUMO
Background: New generations of sequencing platforms coupled to numerous bioinformatics tools have led to rapid technological progress in metagenomics and metatranscriptomics to investigate complex microorganism communities. Nevertheless, a combination of different bioinformatic tools remains necessary to draw conclusions out of microbiota studies. Modular and user-friendly tools would greatly improve such studies. Findings: We therefore developed ASaiM, an Open-Source Galaxy-based framework dedicated to microbiota data analyses. ASaiM provides an extensive collection of tools to assemble, extract, explore, and visualize microbiota information from raw metataxonomic, metagenomic, or metatranscriptomic sequences. To guide the analyses, several customizable workflows are included and are supported by tutorials and Galaxy interactive tours, which guide users through the analyses step by step. ASaiM is implemented as a Galaxy Docker flavour. It is scalable to thousands of datasets but also can be used on a normal PC. The associated source code is available under Apache 2 license at https://github.com/ASaiM/framework and documentation can be found online (http://asaim.readthedocs.io). Conclusions: Based on the Galaxy framework, ASaiM offers a sophisticated environment with a variety of tools, workflows, documentation, and training to scientists working on complex microorganism communities. It makes analysis and exploration analyses of microbiota data easy, quick, transparent, reproducible, and shareable.
Assuntos
Microbiota , Software , Estatística como Assunto , Sequência de Bases , MetagenômicaRESUMO
Galaxy (homepage: https://galaxyproject.org, main public server: https://usegalaxy.org) is a web-based scientific analysis platform used by tens of thousands of scientists across the world to analyze large biomedical datasets such as those found in genomics, proteomics, metabolomics and imaging. Started in 2005, Galaxy continues to focus on three key challenges of data-driven biomedical science: making analyses accessible to all researchers, ensuring analyses are completely reproducible, and making it simple to communicate analyses so that they can be reused and extended. During the last two years, the Galaxy team and the open-source community around Galaxy have made substantial improvements to Galaxy's core framework, user interface, tools, and training materials. Framework and user interface improvements now enable Galaxy to be used for analyzing tens of thousands of datasets, and >5500 tools are now available from the Galaxy ToolShed. The Galaxy community has led an effort to create numerous high-quality tutorials focused on common types of genomic analyses. The Galaxy developer and user communities continue to grow and be integral to Galaxy's development. The number of Galaxy public servers, developers contributing to the Galaxy framework and its tools, and users of the main Galaxy server have all increased substantially.
Assuntos
Genômica/estatística & dados numéricos , Metabolômica/estatística & dados numéricos , Imagem Molecular/estatística & dados numéricos , Proteômica/estatística & dados numéricos , Interface Usuário-Computador , Conjuntos de Dados como Assunto , Humanos , Disseminação de Informação , Cooperação Internacional , Internet , Reprodutibilidade dos TestesRESUMO
Microbiota profiling has the potential to greatly impact on routine clinical diagnostics by detecting DNA derived from live, fastidious, and dead bacterial cells present within clinical samples. Such results could potentially be used to benefit patients by influencing antibiotic prescribing practices or to generate new classical-based diagnostic methods, e.g., culture or PCR. However, technical flaws in 16S rRNA gene next-generation sequencing (NGS) protocols, together with the requirement for access to bioinformatics, currently hinder the introduction of microbiota analysis into clinical diagnostics. Here, we report on the development and evaluation of an "end-to-end" microbiota profiling platform (MYcrobiota), which combines our previously validated micelle PCR/NGS (micPCR/NGS) methodology with an easy-to-use, dedicated bioinformatics pipeline. The newly designed bioinformatics pipeline processes micPCR/NGS data automatically and summarizes the results in interactive, but simple web reports. In order to explore the utility of MYcrobiota in clinical diagnostics, 47 clinical samples (40 "damaged skin" samples and 7 synovial fluids) were investigated using routine bacterial culture as comparator. MYcrobiota confirmed the presence of bacterial DNA in 37/37 culture-positive samples and detected bacterial taxa in 2/10 culture-negative samples. Moreover, 36/38 potentially relevant aerobic bacterial taxa and 3/3 mixtures of anaerobic bacteria were identified using culture and MYcrobiota, with the sensitivity and specificity being 95%. Interestingly, the majority of the 448 bacterial taxa identified using MYcrobiota were not identified using culture, which could potentially have an impact on clinical decision-making. Taken together, the development of MYcrobiota is a promising step towards the introduction of microbiota analysis into clinical diagnostic laboratories.
Assuntos
Bactérias/genética , Técnicas de Laboratório Clínico/métodos , Biologia Computacional/métodos , DNA Bacteriano/genética , Microbiota/genética , Bactérias/isolamento & purificação , Técnicas de Laboratório Clínico/instrumentação , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Técnicas de Diagnóstico Molecular/instrumentação , Técnicas de Diagnóstico Molecular/métodos , Filogenia , Reação em Cadeia da Polimerase/métodos , RNA Ribossômico 16S/genética , Estudos Retrospectivos , Análise de Sequência de DNA/métodos , Úlcera/microbiologia , Ferimentos e Lesões/microbiologiaRESUMO
The availability of high-throughput molecular profiling techniques has provided more accurate and informative data for regular clinical studies. Nevertheless, complex computational workflows are required to interpret these data. Over the past years, the data volume has been growing explosively, requiring robust human data management to organise and integrate the data efficiently. For this reason, we set up an ELIXIR implementation study, together with the Translational research IT (TraIT) programme, to design a data ecosystem that is able to link raw and interpreted data. In this project, the data from the TraIT Cell Line Use Case (TraIT-CLUC) are used as a test case for this system. Within this ecosystem, we use the European Genome-phenome Archive (EGA) to store raw molecular profiling data; tranSMART to collect interpreted molecular profiling data and clinical data for corresponding samples; and Galaxy to store, run and manage the computational workflows. We can integrate these data by linking their repositories systematically. To showcase our design, we have structured the TraIT-CLUC data, which contain a variety of molecular profiling data types, for storage in both tranSMART and EGA. The metadata provided allows referencing between tranSMART and EGA, fulfilling the cycle of data submission and discovery; we have also designed a data flow from EGA to Galaxy, enabling reanalysis of the raw data in Galaxy. In this way, users can select patient cohorts in tranSMART, trace them back to the raw data and perform (re)analysis in Galaxy. Our conclusion is that the majority of metadata does not necessarily need to be stored (redundantly) in both databases, but that instead FAIR persistent identifiers should be available for well-defined data ontology levels: study, data access committee, physical sample, data sample and raw data file. This approach will pave the way for the stable linkage and reuse of data.
RESUMO
High-throughput molecular profiling techniques are routinely generating vast amounts of data for translational medicine studies. Secure access controlled systems are needed to manage, store, transfer and distribute these data due to its personally identifiable nature. The European Genome-phenome Archive (EGA) was created to facilitate access and management to long-term archival of bio-molecular data. Each data provider is responsible for ensuring a Data Access Committee is in place to grant access to data stored in the EGA. Moreover, the transfer of data during upload and download is encrypted. ELIXIR, a European research infrastructure for life-science data, initiated a project (2016 Human Data Implementation Study) to understand and document the ELIXIR requirements for secure management of controlled-access data. As part of this project, a full ecosystem was designed to connect archived raw experimental molecular profiling data with interpreted data and the computational workflows, using the CTMM Translational Research IT (CTMM-TraIT) infrastructure http://www.ctmm-trait.nl as an example. Here we present the first outcomes of this project, a framework to enable the download of EGA data to a Galaxy server in a secure way. Galaxy provides an intuitive user interface for molecular biologists and bioinformaticians to run and design data analysis workflows. More specifically, we developed a tool -- ega_download_streamer - that can download data securely from EGA into a Galaxy server, which can subsequently be further processed. This tool will allow a user within the browser to run an entire analysis containing sensitive data from EGA, and to make this analysis available for other researchers in a reproducible manner, as shown with a proof of concept study. The tool ega_download_streamer is available in the Galaxy tool shed: https://toolshed.g2.bx.psu.edu/view/yhoogstrate/ega_download_streamer.
RESUMO
UNLABELLED: A new generation of tools that identify fusion genes in RNA-seq data is limited in either sensitivity and or specificity. To allow further downstream analysis and to estimate performance, predicted fusion genes from different tools have to be compared. However, the transcriptomic context complicates genomic location-based matching. FusionMatcher (FuMa) is a program that reports identical fusion genes based on gene-name annotations. FuMa automatically compares and summarizes all combinations of two or more datasets in a single run, without additional programming necessary. FuMa uses one gene annotation, avoiding mismatches caused by tool-specific gene annotations. FuMa matches 10% more fusion genes compared with exact gene matching due to overlapping genes and accepts intermediate output files that allow a stepwise analysis of corresponding tools. AVAILABILITY AND IMPLEMENTATION: The code is available at: https://github.com/ErasmusMC-Bioinformatics/fuma and available for Galaxy in the tool sheds and directly accessible at https://bioinf-galaxian.erasmusmc.nl/galaxy/ CONTACT: y.hoogstrate@erasmusmc.nl or a.stubbs@erasmusmc.nl SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.