RESUMO
The main goals and challenges for the life science communities in the Open Science framework are to increase reuse and sustainability of data resources, software tools, and workflows, especially in large-scale data-driven research and computational analyses. Here, we present key findings, procedures, effective measures and recommendations for generating and establishing sustainable life science resources based on the collaborative, cross-disciplinary work done within the EOSC-Life (European Open Science Cloud for Life Sciences) consortium. Bringing together 13 European life science research infrastructures, it has laid the foundation for an open, digital space to support biological and medical research. Using lessons learned from 27 selected projects, we describe the organisational, technical, financial and legal/ethical challenges that represent the main barriers to sustainability in the life sciences. We show how EOSC-Life provides a model for sustainable data management according to FAIR (findability, accessibility, interoperability, and reusability) principles, including solutions for sensitive- and industry-related resources, by means of cross-disciplinary training and best practices sharing. Finally, we illustrate how data harmonisation and collaborative work facilitate interoperability of tools, data, solutions and lead to a better understanding of concepts, semantics and functionalities in the life sciences.
Assuntos
Disciplinas das Ciências Biológicas , Pesquisa Biomédica , Software , Fluxo de TrabalhoRESUMO
There are thousands of well-maintained high-quality open-source software utilities for all aspects of scientific data analysis. For more than a decade, the Galaxy Project has been providing computational infrastructure and a unified user interface for these tools to make them accessible to a wide range of researchers. To streamline the process of integrating tools and constructing workflows as much as possible, we have developed Planemo, a software development kit for tool and workflow developers and Galaxy power users. Here we outline Planemo's implementation and describe its broad range of functionality for designing, testing, and executing Galaxy tools, workflows, and training material. In addition, we discuss the philosophy underlying Galaxy tool and workflow development, and how Planemo encourages the use of development best practices, such as test-driven development, by its users, including those who are not professional software developers.
Assuntos
Biologia Computacional , Software , Fluxo de Trabalho , Análise de DadosRESUMO
MOTIVATION: Many diseases, such as cancer, are characterized by an alteration of cellular metabolism allowing cells to adapt to changes in the microenvironment. Stable isotope-resolved metabolomics (SIRM) and downstream data analyses are widely used techniques for unraveling cells' metabolic activity to understand the altered functioning of metabolic pathways in the diseased state. While a number of bioinformatic solutions exist for the differential analysis of SIRM data, there is currently no available resource providing a comprehensive toolbox. RESULTS: In this work, we present DIMet, a one-stop comprehensive tool for differential analysis of targeted tracer data. DIMet accepts metabolite total abundances, isotopologue contributions, and isotopic mean enrichment, and supports differential comparison (pairwise and multi-group), time-series analyses, and labeling profile comparison. Moreover, it integrates transcriptomics and targeted metabolomics data through network-based metabolograms. We illustrate the use of DIMet in real SIRM datasets obtained from Glioblastoma P3 cell-line samples. DIMet is open-source, and is readily available for routine downstream analysis of isotope-labeled targeted metabolomics data, as it can be used both in the command line interface or as a complete toolkit in the public Galaxy Europe and Workfow4Metabolomics web platforms. AVAILABILITY AND IMPLEMENTATION: DIMet is freely available at https://github.com/cbib/DIMet, and through https://usegalaxy.eu and https://workflow4metabolomics.usegalaxy.fr. All the datasets are available at Zenodo https://zenodo.org/records/10925786.
Assuntos
Marcação por Isótopo , Metabolômica , Software , Metabolômica/métodos , Humanos , Marcação por Isótopo/métodos , Glioblastoma/metabolismo , Linhagem Celular TumoralRESUMO
In modern reproducible, hypothesis-driven plant research, scientists are increasingly relying on research data management (RDM) services and infrastructures to streamline the processes of collecting, processing, sharing, and archiving research data. FAIR (i.e., findable, accessible, interoperable, and reusable) research data play a pivotal role in enabling the integration of interdisciplinary knowledge and facilitating the comparison and synthesis of a wide range of analytical findings. The PLANTdataHUB offers a solution that realizes RDM of scientific (meta)data as evolving collections of files in a directory - yielding FAIR digital objects called ARCs - with tools that enable scientists to plan, communicate, collaborate, publish, and reuse data on the same platform while gaining continuous quality control insights. The centralized platform is scalable from personal use to global communities and provides advanced federation capabilities for institutions that prefer to host their own satellite instances. This approach borrows many concepts from software development and adapts them to fit the challenges of the field of modern plant science undergoing digital transformation. The PLANTdataHUB supports researchers in each stage of a scientific project with adaptable continuous quality control insights, from the early planning phase to data publication. The central live instance of PLANTdataHUB is accessible at (https://git.nfdi4plants.org), and it will continue to evolve as a community-driven and dynamic resource that serves the needs of contemporary plant science.
Assuntos
Bases de Dados como Assunto , Disseminação de Informação , PlantasRESUMO
There is an ongoing explosion of scientific datasets being generated, brought on by recent technological advances in many areas of the natural sciences. As a result, the life sciences have become increasingly computational in nature, and bioinformatics has taken on a central role in research studies. However, basic computational skills, data analysis, and stewardship are still rarely taught in life science educational programs, resulting in a skills gap in many of the researchers tasked with analysing these big datasets. In order to address this skills gap and empower researchers to perform their own data analyses, the Galaxy Training Network (GTN) has previously developed the Galaxy Training Platform (https://training.galaxyproject.org), an open access, community-driven framework for the collection of FAIR (Findable, Accessible, Interoperable, Reusable) training materials for data analysis utilizing the user-friendly Galaxy framework as its primary data analysis platform. Since its inception, this training platform has thrived, with the number of tutorials and contributors growing rapidly, and the range of topics extending beyond life sciences to include topics such as climatology, cheminformatics, and machine learning. While initially aimed at supporting researchers directly, the GTN framework has proven to be an invaluable resource for educators as well. We have focused our efforts in recent years on adding increased support for this growing community of instructors. New features have been added to facilitate the use of the materials in a classroom setting, simplifying the contribution flow for new materials, and have added a set of train-the-trainer lessons. Here, we present the latest developments in the GTN project, aimed at facilitating the use of the Galaxy Training materials by educators, and its usage in different learning environments.
Assuntos
Biologia Computacional , Software , Humanos , Biologia Computacional/métodos , Análise de Dados , PesquisadoresRESUMO
BACKGROUND: Galaxy is a web-based open-source platform for scientific analyses. Researchers use thousands of high-quality tools and workflows for their respective analyses in Galaxy. Tool recommender system predicts a collection of tools that can be used to extend an analysis. In this work, a tool recommender system is developed by training a transformer on workflows available on Galaxy Europe and its performance is compared to other neural networks such as recurrent, convolutional and dense neural networks. RESULTS: The transformer neural network achieves two times faster convergence, has significantly lower model usage (model reconstruction and prediction) time and shows a better generalisation that goes beyond training workflows than the older tool recommender system created using RNN in Galaxy. In addition, the transformer also outperforms CNN and DNN on several key indicators. It achieves a faster convergence time, lower model usage time, and higher quality tool recommendations than CNN. Compared to DNN, it converges faster to a higher precision@k metric (approximately 0.98 by transformer compared to approximately 0.9 by DNN) and shows higher quality tool recommendations. CONCLUSION: Our work shows a novel usage of transformers to recommend tools for extending scientific workflows. A more robust tool recommendation model, created using a transformer, having significantly lower usage time than RNN and CNN, higher precision@k than DNN, and higher quality tool recommendations than all three neural networks, will benefit researchers in creating scientifically significant workflows and exploratory data analysis in Galaxy. Additionally, the ability to train faster than all three neural networks imparts more scalability for training on larger datasets consisting of millions of tool sequences. Open-source scripts to create the recommendation model are available under MIT licence at https://github.com/anuprulez/galaxy_tool_recommendation_transformers.
Assuntos
Redes Neurais de Computação , Software , Fluxo de Trabalho , Análise de Dados , Europa (Continente)RESUMO
Among the 30 nonsynonymous nucleotide substitutions in the Omicron S-gene are 13 that have only rarely been seen in other SARS-CoV-2 sequences. These mutations cluster within three functionally important regions of the S-gene at sites that will likely impact (1) interactions between subunits of the Spike trimer and the predisposition of subunits to shift from down to up configurations, (2) interactions of Spike with ACE2 receptors, and (3) the priming of Spike for membrane fusion. We show here that, based on both the rarity of these 13 mutations in intrapatient sequencing reads and patterns of selection at the codon sites where the mutations occur in SARS-CoV-2 and related sarbecoviruses, prior to the emergence of Omicron the mutations would have been predicted to decrease the fitness of any virus within which they occurred. We further propose that the mutations in each of the three clusters therefore cooperatively interact to both mitigate their individual fitness costs, and, in combination with other mutations, adaptively alter the function of Spike. Given the evident epidemic growth advantages of Omicron overall previously known SARS-CoV-2 lineages, it is crucial to determine both how such complex and highly adaptive mutation constellations were assembled within the Omicron S-gene, and why, despite unprecedented global genomic surveillance efforts, the early stages of this assembly process went completely undetected.
Assuntos
COVID-19 , Glicoproteína da Espícula de Coronavírus , COVID-19/genética , Humanos , Mutação , SARS-CoV-2/genética , Glicoproteína da Espícula de Coronavírus/genéticaRESUMO
Quantitative mass spectrometry-based proteomics has become a high-throughput technology for the identification and quantification of thousands of proteins in complex biological samples. Two frequently used tools, MaxQuant and MSstats, allow for the analysis of raw data and finding proteins with differential abundance between conditions of interest. To enable accessible and reproducible quantitative proteomics analyses in a cloud environment, we have integrated MaxQuant (including TMTpro 16/18plex), Proteomics Quality Control (PTXQC), MSstats, and MSstatsTMT into the open-source Galaxy framework. This enables the web-based analysis of label-free and isobaric labeling proteomics experiments via Galaxy's graphical user interface on public clouds. MaxQuant and MSstats in Galaxy can be applied in conjunction with thousands of existing Galaxy tools and integrated into standardized, sharable workflows. Galaxy tracks all metadata and intermediate results in analysis histories, which can be shared privately for collaborations or publicly, allowing full reproducibility and transparency of published analysis. To further increase accessibility, we provide detailed hands-on training materials. The integration of MaxQuant and MSstats into the Galaxy framework enables their usage in a reproducible way on accessible large computational infrastructures, hence realizing the foundation for high-throughput proteomics data science for everyone.
Assuntos
Proteômica , Software , Computação em Nuvem , Espectrometria de Massas/métodos , Proteínas/análise , Proteômica/métodos , Reprodutibilidade dos TestesRESUMO
The zebrafish embryo is transcriptionally mostly quiescent during the first 10 cell cycles, until the main wave of zygotic genome activation (ZGA) occurs, accompanied by fast chromatin remodeling. At ZGA, homologs of the mammalian stem cell transcription factors (TFs) Pou5f3, Nanog, and Sox19b bind to thousands of developmental enhancers to initiate transcription. So far, how these TFs influence chromatin dynamics at ZGA has remained unresolved. To address this question, we analyzed nucleosome positions in wild-type and maternal-zygotic (MZ) mutants for pou5f3 and nanog by MNase-seq. We show that Nanog, Sox19b, and Pou5f3 bind to the high nucleosome affinity regions (HNARs). HNARs are spanning over 600 bp, featuring high in vivo and predicted in vitro nucleosome occupancy and high predicted propeller twist DNA shape value. We suggest a two-step nucleosome destabilization-depletion model, in which the same intrinsic DNA properties of HNAR promote both high nucleosome occupancy and differential binding of TFs. In the first step, already before ZGA, Pou5f3 and Nanog destabilize nucleosomes at HNAR centers genome-wide. In the second step, post-ZGA, Nanog, Pou5f3, and SoxB1 maintain open chromatin state on the subset of HNARs, acting synergistically. Nanog binds to the HNAR center, whereas the Pou5f3 stabilizes the flanks. The HNAR model will provide a useful tool for genome regulatory studies in a variety of biological systems.
Assuntos
Montagem e Desmontagem da Cromatina , Proteína Homeobox Nanog/metabolismo , Nucleossomos/genética , Fator 3 de Transcrição de Octâmero/metabolismo , Fatores de Transcrição SOX/metabolismo , Proteínas de Peixe-Zebra/metabolismo , Zigoto/metabolismo , Animais , Regulação da Expressão Gênica no Desenvolvimento , Proteína Homeobox Nanog/genética , Nucleossomos/metabolismo , Fator 3 de Transcrição de Octâmero/genética , Ligação Proteica , Fatores de Transcrição SOX/genética , Peixe-Zebra , Proteínas de Peixe-Zebra/genéticaRESUMO
MOTIVATION: Hi-C technology provides insights into the 3D organization of the chromatin, and the single-cell Hi-C method enables researchers to gain knowledge about the chromatin state in individual cell levels. Single-cell Hi-C interaction matrices are high dimensional and very sparse. To cluster thousands of single-cell Hi-C interaction matrices, they are flattened and compiled into one matrix. Depending on the resolution, this matrix can have a few million or even billions of features; therefore, computations can be memory intensive. We present a single-cell Hi-C clustering approach using an approximate nearest neighbors method based on locality-sensitive hashing to reduce the dimensions and the computational resources. RESULTS: The presented method can process a 10 kb single-cell Hi-C dataset with 2600 cells and needs 40 GB of memory, while competitive approaches are not computable even with 1 TB of memory. It can be shown that the differentiation of the cells by their chromatin folding properties and, therefore, the quality of the clustering of single-cell Hi-C data is advantageous compared to competitive algorithms. AVAILABILITY AND IMPLEMENTATION: The presented clustering algorithm is part of the scHiCExplorer, is available on Github https://github.com/joachimwolff/scHiCExplorer, and as a conda package via the bioconda channel. The approximate nearest neighbors implementation is available via https://github.com/joachimwolff/sparse-neighbors-search and as a conda package via the bioconda channel. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Cromossomos , Software , Cromatina , Algoritmos , Análise por ConglomeradosRESUMO
MOTIVATION: Single-cell Hi-C research currently lacks an efficient, easy to use and shareable data storage format. Recent studies have used a variety of sub-optimal solutions: publishing raw data only, text-based interaction matrices, or reusing established Hi-C storage formats for single interaction matrices. These approaches are storage and pre-processing intensive, require long labour time and are often error-prone. RESULTS: The single-cell cooler file format (scool) provides an efficient, user-friendly and storage-saving approach for single-cell Hi-C data. It is a flavour of the established cooler format and guarantees stable API support. AVAILABILITY AND IMPLEMENTATION: The single-cell cooler format is part of the cooler file format as of API version 0.8.9. It is available via pip, conda and github: https://github.com/mirnylab/cooler. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Armazenamento e Recuperação da Informação , SoftwareRESUMO
MOTIVATION: Generating publication ready plots to display multiple genomic tracks can pose a serious challenge. Making desirable and accurate figures requires considerable effort. This is usually done by hand or using a vector graphic software. RESULTS: pyGenomeTracks (PGT) is a modular plotting tool that easily combines multiple tracks. It enables a reproducible and standardized generation of highly customizable and publication ready images. AVAILABILITY AND IMPLEMENTATION: PGT is available through a graphical interface on https://usegalaxy.eu and through the command line. It is provided on conda via the bioconda channel, on pip and it is openly developed on github: https://github.com/deeptools/pyGenomeTracks. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Genoma , Genômica , SoftwareRESUMO
SUMMARY: Many aspects of the global response to the COVID-19 pandemic are enabled by the fast and open publication of SARS-CoV-2 genetic sequence data. The European Nucleotide Archive (ENA) is the European recommended open repository for genetic sequences. In this work, we present a tool for submitting raw sequencing reads of SARS-CoV-2 to ENA. The tool features a single-step submission process, a graphical user interface, tabular-formatted metadata and the possibility to remove human reads prior to submission. A Galaxy wrap of the tool allows users with little or no bioinformatics knowledge to do bulk sequencing read submissions. The tool is also packed in a Docker container to ease deployment. AVAILABILITY AND IMPLEMENTATION: CLI ENA upload tool is available at github.com/usegalaxy-eu/ena-upload-cli (DOI 10.5281/zenodo.4537621); Galaxy ENA upload tool at toolshed.g2.bx.psu.edu/view/iuc/ena_upload/382518f24d6d and github.com/galaxyproject/tools-iuc/tree/master/tools/ena_upload (development); and ENA upload Galaxy container at github.com/ELIXIR-Belgium/ena-upload-container (DOI 10.5281/zenodo.4730785).
Assuntos
COVID-19 , Software , Humanos , SARS-CoV-2 , Nucleotídeos , PandemiasRESUMO
The current state of much of the Wuhan pneumonia virus (severe acute respiratory syndrome coronavirus 2 [SARS-CoV-2]) research shows a regrettable lack of data sharing and considerable analytical obfuscation. This impedes global research cooperation, which is essential for tackling public health emergencies and requires unimpeded access to data, analysis tools, and computational infrastructure. Here, we show that community efforts in developing open analytical software tools over the past 10 years, combined with national investments into scientific computational infrastructure, can overcome these deficiencies and provide an accessible platform for tackling global health emergencies in an open and transparent manner. Specifically, we use all SARS-CoV-2 genomic data available in the public domain so far to (1) underscore the importance of access to raw data and (2) demonstrate that existing community efforts in curation and deployment of biomedical software can reliably support rapid, reproducible research during global health crises. All our analyses are fully documented at https://github.com/galaxyproject/SARS-CoV-2.
Assuntos
Betacoronavirus/patogenicidade , Infecções por Coronavirus/virologia , Pneumonia Viral/virologia , Saúde Pública , Síndrome Respiratória Aguda Grave/virologia , COVID-19 , Análise de Dados , Humanos , Pandemias , SARS-CoV-2RESUMO
The Coronavirus Disease 2019 (COVID-19) outbreaks have caused universities all across the globe to close their campuses and forced them to initiate online teaching. This article reviews the pedagogical foundations for developing effective distance education practices, starting from the assumption that promoting autonomous thinking is an essential element to guarantee full citizenship in a democracy and for moral decision-making in situations of rapid change, which has become a pressing need in the context of a pandemic. In addition, the main obstacles related to this new context are identified, and solutions are proposed according to the existing bibliography in learning sciences.
Assuntos
COVID-19/epidemiologia , Biologia Computacional , Educação a Distância/organização & administração , Quarentena , Ensino , COVID-19/virologia , Tomada de Decisões , Humanos , Pandemias , SARS-CoV-2/isolamento & purificaçãoRESUMO
Supervised machine learning is an essential but difficult to use approach in biomedical data analysis. The Galaxy-ML toolkit (https://galaxyproject.org/community/machine-learning/) makes supervised machine learning more accessible to biomedical scientists by enabling them to perform end-to-end reproducible machine learning analyses at large scale using only a web browser. Galaxy-ML extends Galaxy (https://galaxyproject.org), a biomedical computational workbench used by tens of thousands of scientists across the world, with a suite of tools for all aspects of supervised machine learning.
Assuntos
Biologia Computacional/métodos , Aprendizado de Máquina , Reprodutibilidade dos Testes , SoftwareRESUMO
The COVID-19 pandemic is shifting teaching to an online setting all over the world. The Galaxy framework facilitates the online learning process and makes it accessible by providing a library of high-quality community-curated training materials, enabling easy access to data and tools, and facilitates sharing achievements and progress between students and instructors. By combining Galaxy with robust communication channels, effective instruction can be designed inclusively, regardless of the students' environments.
Assuntos
COVID-19/epidemiologia , Instrução por Computador , Educação a Distância/organização & administração , COVID-19/virologia , Biologia Computacional , Humanos , Disseminação de Informação , Pandemias , SARS-CoV-2/isolamento & purificaçãoRESUMO
The Galaxy HiCExplorer provides a web service at https://hicexplorer.usegalaxy.eu. It enables the integrative analysis of chromosome conformation by providing tools and computational resources to pre-process, analyse and visualize Hi-C, Capture Hi-C (cHi-C) and single-cell Hi-C (scHi-C) data. Since the last publication, Galaxy HiCExplorer has been expanded considerably with new tools to facilitate the analysis of cHi-C and to provide an in-depth analysis of Hi-C data. Moreover, it supports the analysis of scHi-C data by offering a broad range of tools. With the help of the standard graphical user interface of Galaxy, presented workflows, extensive documentation and tutorials, novices as well as Hi-C experts are supported in their Hi-C data analysis with Galaxy HiCExplorer.
Assuntos
Cromatina/química , Software , Gráficos por Computador , Técnicas Genéticas/normas , Internet , Conformação Molecular , Reprodutibilidade dos Testes , Análise de Célula Única/normasRESUMO
The Omics Discovery Index is an open source platform that can be used to access, discover and disseminate omics datasets. OmicsDI integrates proteomics, genomics, metabolomics, models and transcriptomics datasets. Using an efficient indexing system, OmicsDI integrates different biological entities including genes, transcripts, proteins, metabolites and the corresponding publications from PubMed. In addition, it implements a group of pipelines to estimate the impact of each dataset by tracing the number of citations, reanalysis and biological entities reported by each dataset. Here, we present the OmicsDI REST interface (www.omicsdi.org/ws/) to enable programmatic access to any dataset in OmicsDI or all the datasets for a specific provider (database). Clients can perform queries on the API using different metadata information such as sample details (species, tissues, etc), instrumentation (mass spectrometer, sequencer), keywords and other provided annotations. In addition, we present two different libraries in R and Python to facilitate the development of tools that can programmatically interact with the OmicsDI REST interface.
Assuntos
Perfilação da Expressão Gênica/métodos , Proteômica/métodos , Software , Bases de Dados Genéticas , Conjuntos de Dados como Assunto , Genômica/métodos , Metabolômica/métodos , Interface Usuário-ComputadorRESUMO
BioContainers is an open-source project that aims to create, store, and distribute bioinformatics software containers and packages. The BioContainers community has developed a set of guidelines to standardize software containers including the metadata, versions, licenses, and software dependencies. BioContainers supports multiple packaging and container technologies such as Conda, Docker, and Singularity. The BioContainers provide over 9000 bioinformatics tools, including more than 200 proteomics and mass spectrometry tools. Here we introduce the BioContainers Registry and Restful API to make containerized bioinformatics tools more findable, accessible, interoperable, and reusable (FAIR). The BioContainers Registry provides a fast and convenient way to find and retrieve bioinformatics tool packages and containers. By doing so, it will increase the use of bioinformatics packages and containers while promoting replicability and reproducibility in research.