Pesquisa | BVS Violência e Saúde

1.

Towards Machine-FAIR: Representing software and datasets to facilitate reuse and scientific discovery by machines.

Wagner, Michael M; Hogan, William R; Levander, John D; Diller, Matthew.

J Biomed Inform ; 154: 104647, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-38692465

RESUMO

OBJECTIVE: To use software, datasets, and data formats in the domain of Infectious Disease Epidemiology as a test collection to evaluate a novel M1 use case, which we introduce in this paper. M1 is a machine that upon receipt of a new digital object of research exhaustively finds all valid compositions of it with existing objects. METHOD: We implemented a data-format-matching-only M1 using exhaustive search, which we refer to as M1DFM. We then ran M1DFM on the test collection and used error analysis to identify needed semantic constraints. RESULTS: Precision of M1DFM search was 61.7%. Error analysis identified needed semantic constraints and needed changes in handling of data services. Most semantic constraints were simple, but one data format was sufficiently complex to be practically impossible to represent semantic constraints over, from which we conclude limitatively that software developers will have to meet the machines halfway by engineering software whose inputs are sufficiently simple that their semantic constraints can be represented, akin to the simple APIs of services. We summarize these insights as M1-FAIR guiding principles for composability and suggest a roadmap for progressively capable devices in the service of reuse and accelerated scientific discovery. CONCLUSION: Algorithmic search of digital repositories for valid workflow compositions has potential to accelerate scientific discovery but requires a scalable solution to the problem of knowledge acquisition about semantic constraints on software inputs. Additionally, practical limitations on the logical complexity of semantic constraints must be respected, which has implications for the design of software.

Assuntos

Software , Humanos , Semântica , Aprendizado de Máquina , Algoritmos , Bases de Dados Factuais

2.

APE in the Wild: Automated Exploration of Proteomics Workflows in the bio.tools Registry.

Kasalica, Vedran; Schwämmle, Veit; Palmblad, Magnus; Ison, Jon; Lamprecht, Anna-Lena.

J Proteome Res ; 20(4): 2157-2165, 2021 04 02.

Artigo em Inglês | MEDLINE | ID: mdl-33720735

RESUMO

The bio.tools registry is a main catalogue of computational tools in the life sciences. More than 17â¯000 tools have been registered by the international bioinformatics community. The bio.tools metadata schema includes semantic annotations of tool functions, that is, formal descriptions of tools' data types, formats, and operations with terms from the EDAM bioinformatics ontology. Such annotations enable the automated composition of tools into multistep pipelines or workflows. In this Technical Note, we revisit a previous case study on the automated composition of proteomics workflows. We use the same four workflow scenarios but instead of using a small set of tools with carefully handcrafted annotations, we explore workflows directly on bio.tools. We use the Automated Pipeline Explorer (APE), a reimplementation and extension of the workflow composition method previously used. Moving "into the wild" opens up an unprecedented wealth of tools and a huge number of alternative workflows. Automated composition tools can be used to explore this space of possibilities systematically. Inevitably, the mixed quality of semantic annotations in bio.tools leads to unintended or erroneous tool combinations. However, our results also show that additional control mechanisms (tool filters, configuration options, and workflow constraints) can effectively guide the exploration toward smaller sets of more meaningful workflows.

Assuntos

Proteômica , Software , Biologia Computacional , Sistema de Registros , Fluxo de Trabalho

3.

Fault-Tolerant and Data-Intensive Resource Scheduling and Management for Scientific Applications in Cloud Computing.

Ahmad, Zulfiqar; Jehangiri, Ali Imran; Ala'anzy, Mohammed Alaa; Othman, Mohamed; Umar, Arif Iqbal.

Sensors (Basel) ; 21(21)2021 Oct 30.

Artigo em Inglês | MEDLINE | ID: mdl-34770545

RESUMO

Cloud computing is a fully fledged, matured and flexible computing paradigm that provides services to scientific and business applications in a subscription-based environment. Scientific applications such as Montage and CyberShake are organized scientific workflows with data and compute-intensive tasks and also have some special characteristics. These characteristics include the tasks of scientific workflows that are executed in terms of integration, disintegration, pipeline, and parallelism, and thus require special attention to task management and data-oriented resource scheduling and management. The tasks executed during pipeline are considered as bottleneck executions, the failure of which result in the wholly futile execution, which requires a fault-tolerant-aware execution. The tasks executed during parallelism require similar instances of cloud resources, and thus, cluster-based execution may upgrade the system performance in terms of make-span and execution cost. Therefore, this research work presents a cluster-based, fault-tolerant and data-intensive (CFD) scheduling for scientific applications in cloud environments. The CFD strategy addresses the data intensiveness of tasks of scientific workflows with cluster-based, fault-tolerant mechanisms. The Montage scientific workflow is considered as a simulation and the results of the CFD strategy were compared with three well-known heuristic scheduling policies: (a) MCT, (b) Max-min, and (c) Min-min. The simulation results showed that the CFD strategy reduced the make-span by 14.28%, 20.37%, and 11.77%, respectively, as compared with the existing three policies. Similarly, the CFD reduces the execution cost by 1.27%, 5.3%, and 2.21%, respectively, as compared with the existing three policies. In case of the CFD strategy, the SLA is not violated with regard to time and cost constraints, whereas it is violated by the existing policies numerous times.

Assuntos

Algoritmos , Computação em Nuvem , Simulação por Computador , Heurística , Fluxo de Trabalho

4.

Bioinformatics recipes: creating, executing and distributing reproducible data analysis workflows.

Aberra, Natay; Sebastian, Aswathy; Maloy, Aaron P; Rees, Christopher B; Bartron, Meredith L; Albert, Istvan.

BMC Bioinformatics ; 21(1): 292, 2020 Jul 08.

Artigo em Inglês | MEDLINE | ID: mdl-32640986

RESUMO

BACKGROUND: Bioinformaticians collaborating with life scientists need software that allows them to involve their collaborators in the process of data analysis. RESULTS: We have developed a web application that allows researchers to publish and execute data analysis scripts. Within the platform bioinformaticians are able to deploy data analysis workflows (recipes) that their collaborators can execute via point and click interfaces. The results generated by the recipes are viewable via the web interface and consist of a snapshot of all the commands, printed messages and files that have been generated during the recipe run. A demonstration version of our software is available at https://www.bioinformatics.recipes/ . Detailed documentation for the software is available at: https://bioinformatics-recipes.readthedocs.io . The source code for the software is distributed through GitHub at https://github.com/ialbert/biostar-central . CONCLUSIONS: Our software platform supports collaborative interactions between bioinformaticians and life scientists. The software is presented via a web application that provides a high utility and user-friendly approach for conducting reproducible research. The recipes developed and shared through the web application are generic, with broad applicability and may be downloaded and executed on other computing platforms.

Assuntos

Biologia Computacional/métodos , Software , Análise de Dados , Reprodutibilidade dos Testes , Interface Usuário-Computador , Fluxo de Trabalho

5.

Reusable, extensible, and modifiable R scripts and Kepler workflows for comprehensive single set ChIP-seq analysis.

Cormier, Nathan; Kolisnik, Tyler; Bieda, Mark.

BMC Bioinformatics ; 17(1): 270, 2016 Jul 05.

Artigo em Inglês | MEDLINE | ID: mdl-27377783

RESUMO

BACKGROUND: There has been an enormous expansion of use of chromatin immunoprecipitation followed by sequencing (ChIP-seq) technologies. Analysis of large-scale ChIP-seq datasets involves a complex series of steps and production of several specialized graphical outputs. A number of systems have emphasized custom development of ChIP-seq pipelines. These systems are primarily based on custom programming of a single, complex pipeline or supply libraries of modules and do not produce the full range of outputs commonly produced for ChIP-seq datasets. It is desirable to have more comprehensive pipelines, in particular ones addressing common metadata tasks, such as pathway analysis, and pipelines producing standard complex graphical outputs. It is advantageous if these are highly modular systems, available as both turnkey pipelines and individual modules, that are easily comprehensible, modifiable and extensible to allow rapid alteration in response to new analysis developments in this growing area. Furthermore, it is advantageous if these pipelines allow data provenance tracking. RESULTS: We present a set of 20 ChIP-seq analysis software modules implemented in the Kepler workflow system; most (18/20) were also implemented as standalone, fully functional R scripts. The set consists of four full turnkey pipelines and 16 component modules. The turnkey pipelines in Kepler allow data provenance tracking. Implementation emphasized use of common R packages and widely-used external tools (e.g., MACS for peak finding), along with custom programming. This software presents comprehensive solutions and easily repurposed code blocks for ChIP-seq analysis and pipeline creation. Tasks include mapping raw reads, peakfinding via MACS, summary statistics, peak location statistics, summary plots centered on the transcription start site (TSS), gene ontology, pathway analysis, and de novo motif finding, among others. CONCLUSIONS: These pipelines range from those performing a single task to those performing full analyses of ChIP-seq data. The pipelines are supplied as both Kepler workflows, which allow data provenance tracking, and, in the majority of cases, as standalone R scripts. These pipelines are designed for ease of modification and repurposing.

Assuntos

Imunoprecipitação da Cromatina/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Biologia Computacional/métodos , Humanos

6.

Optimization of tomographic reconstruction workflows on geographically distributed resources.

Bicer, Tekin; Gürsoy, DogËa; Kettimuthu, Rajkumar; De Carlo, Francesco; Foster, Ian T.

J Synchrotron Radiat ; 23(Pt 4): 997-1005, 2016 07.

Artigo em Inglês | MEDLINE | ID: mdl-27359149

RESUMO

New technological advancements in synchrotron light sources enable data acquisitions at unprecedented levels. This emergent trend affects not only the size of the generated data but also the need for larger computational resources. Although beamline scientists and users have access to local computational resources, these are typically limited and can result in extended execution times. Applications that are based on iterative processing as in tomographic reconstruction methods require high-performance compute clusters for timely analysis of data. Here, time-sensitive analysis and processing of Advanced Photon Source data on geographically distributed resources are focused on. Two main challenges are considered: (i) modeling of the performance of tomographic reconstruction workflows and (ii) transparent execution of these workflows on distributed resources. For the former, three main stages are considered: (i) data transfer between storage and computational resources, (i) wait/queue time of reconstruction jobs at compute resources, and (iii) computation of reconstruction tasks. These performance models allow evaluation and estimation of the execution time of any given iterative tomographic reconstruction workflow that runs on geographically distributed resources. For the latter challenge, a workflow management system is built, which can automate the execution of workflows and minimize the user interaction with the underlying infrastructure. The system utilizes Globus to perform secure and efficient data transfer operations. The proposed models and the workflow management system are evaluated by using three high-performance computing and two storage resources, all of which are geographically distributed. Workflows were created with different computational requirements using two compute-intensive tomographic reconstruction algorithms. Experimental evaluation shows that the proposed models and system can be used for selecting the optimum resources, which in turn can provide up to 3.13× speedup (on experimented resources). Moreover, the error rates of the models range between 2.1 and 23.3% (considering workflow execution times), where the accuracy of the model estimations increases with higher computational demands in reconstruction tasks.

7.

Facilitating the use of large-scale biological data and tools in the era of translational bioinformatics.

Kouskoumvekaki, Irene; Shublaq, Nour; Brunak, Søren.

Brief Bioinform ; 15(6): 942-52, 2014 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-23908249

RESUMO

As both the amount of generated biological data and the processing compute power increase, computational experimentation is no longer the exclusivity of bioinformaticians, but it is moving across all biomedical domains. For bioinformatics to realize its translational potential, domain experts need access to user-friendly solutions to navigate, integrate and extract information out of biological databases, as well as to combine tools and data resources in bioinformatics workflows. In this review, we present services that assist biomedical scientists in incorporating bioinformatics tools into their research. We review recent applications of Cytoscape, BioGPS and DAVID for data visualization, integration and functional enrichment. Moreover, we illustrate the use of Taverna, Kepler, GenePattern, and Galaxy as open-access workbenches for bioinformatics workflows. Finally, we mention services that facilitate the integration of biomedical ontologies and bioinformatics tools in computational workflows.

Assuntos

Biologia Computacional/métodos , Ontologias Biológicas , Biologia Computacional/tendências , Interpretação Estatística de Dados , Sistemas de Gerenciamento de Base de Dados , Feminino , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Masculino , Software , Pesquisa Translacional Biomédica

8.

PARAMO: a PARAllel predictive MOdeling platform for healthcare analytic research using electronic health records.

Ng, Kenney; Ghoting, Amol; Steinhubl, Steven R; Stewart, Walter F; Malin, Bradley; Sun, Jimeng.

J Biomed Inform ; 48: 160-70, 2014 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-24370496

RESUMO

OBJECTIVE: Healthcare analytics research increasingly involves the construction of predictive models for disease targets across varying patient cohorts using electronic health records (EHRs). To facilitate this process, it is critical to support a pipeline of tasks: (1) cohort construction, (2) feature construction, (3) cross-validation, (4) feature selection, and (5) classification. To develop an appropriate model, it is necessary to compare and refine models derived from a diversity of cohorts, patient-specific features, and statistical frameworks. The goal of this work is to develop and evaluate a predictive modeling platform that can be used to simplify and expedite this process for health data. METHODS: To support this goal, we developed a PARAllel predictive MOdeling (PARAMO) platform which (1) constructs a dependency graph of tasks from specifications of predictive modeling pipelines, (2) schedules the tasks in a topological ordering of the graph, and (3) executes those tasks in parallel. We implemented this platform using Map-Reduce to enable independent tasks to run in parallel in a cluster computing environment. Different task scheduling preferences are also supported. RESULTS: We assess the performance of PARAMO on various workloads using three datasets derived from the EHR systems in place at Geisinger Health System and Vanderbilt University Medical Center and an anonymous longitudinal claims database. We demonstrate significant gains in computational efficiency against a standard approach. In particular, PARAMO can build 800 different models on a 300,000 patient data set in 3h in parallel compared to 9days if running sequentially. CONCLUSION: This work demonstrates that an efficient parallel predictive modeling platform can be developed for EHR data. This platform can facilitate large-scale modeling endeavors and speed-up the research workflow and reuse of health information. This platform is only a first step and provides the foundation for our ultimate goal of building analytic pipelines that are specialized for health data researchers.

Assuntos

Registros Eletrônicos de Saúde , Informática Médica/métodos , Algoritmos , Área Sob a Curva , Sistemas Computacionais , Sistemas de Apoio a Decisões Clínicas , Pesquisa sobre Serviços de Saúde , Humanos , Modelos Teóricos , Reprodutibilidade dos Testes , Software , Tennessee , Fatores de Tempo

9.

A qualitative assessment of using ChatGPT as large language model for scientific workflow development.

Sänger, Mario; De Mecquenem, Ninon; Lewinska, Katarzyna Ewa; Bountris, Vasilis; Lehmann, Fabian; Leser, Ulf; Kosch, Thomas.

Gigascience ; 132024 01 02.

Artigo em Inglês | MEDLINE | ID: mdl-38896539

RESUMO

BACKGROUND: Scientific workflow systems are increasingly popular for expressing and executing complex data analysis pipelines over large datasets, as they offer reproducibility, dependability, and scalability of analyses by automatic parallelization on large compute clusters. However, implementing workflows is difficult due to the involvement of many black-box tools and the deep infrastructure stack necessary for their execution. Simultaneously, user-supporting tools are rare, and the number of available examples is much lower than in classical programming languages. RESULTS: To address these challenges, we investigate the efficiency of large language models (LLMs), specifically ChatGPT, to support users when dealing with scientific workflows. We performed 3 user studies in 2 scientific domains to evaluate ChatGPT for comprehending, adapting, and extending workflows. Our results indicate that LLMs efficiently interpret workflows but achieve lower performance for exchanging components or purposeful workflow extensions. We characterize their limitations in these challenging scenarios and suggest future research directions. CONCLUSIONS: Our results show a high accuracy for comprehending and explaining scientific workflows while achieving a reduced performance for modifying and extending workflow descriptions. These findings clearly illustrate the need for further research in this area.

Assuntos

Fluxo de Trabalho , Linguagens de Programação , Software , Biologia Computacional/métodos , Humanos

10.

Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems.

Djaffardjy, Marine; Marchment, George; Sebe, Clémence; Blanchet, Raphael; Bellajhame, Khalid; Gaignard, Alban; Lemoine, Frédéric; Cohen-Boulakia, Sarah.

Comput Struct Biotechnol J ; 21: 2075-2085, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-36968012

RESUMO

Data analysis pipelines are now established as an effective means for specifying and executing bioinformatics data analysis and experiments. While scripting languages, particularly Python, R and notebooks, are popular and sufficient for developing small-scale pipelines that are often intended for a single user, it is now widely recognized that they are by no means enough to support the development of large-scale, shareable, maintainable and reusable pipelines capable of handling large volumes of data and running on high performance computing clusters. This review outlines the key requirements for building large-scale data pipelines and provides a mapping of existing solutions that fulfill them. We then highlight the benefits of using scientific workflow systems to get modular, reproducible and reusable bioinformatics data analysis pipelines. We finally discuss current workflow reuse practices based on an empirical study we performed on a large collection of workflows.

11.

Neural simulation pipeline: Enabling container-based simulations on-premise and in public clouds.

Chlasta, Karol; Sochaczewski, Pawel; Wójcik, Grzegorz M; Krejtz, Izabela.

Front Neuroinform ; 17: 1122470, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37025550

RESUMO

In this study, we explore the simulation setup in computational neuroscience. We use GENESIS, a general purpose simulation engine for sub-cellular components and biochemical reactions, realistic neuron models, large neural networks, and system-level models. GENESIS supports developing and running computer simulations but leaves a gap for setting up today's larger and more complex models. The field of realistic models of brain networks has overgrown the simplicity of earliest models. The challenges include managing the complexity of software dependencies and various models, setting up model parameter values, storing the input parameters alongside the results, and providing execution statistics. Moreover, in the high performance computing (HPC) context, public cloud resources are becoming an alternative to the expensive on-premises clusters. We present Neural Simulation Pipeline (NSP), which facilitates the large-scale computer simulations and their deployment to multiple computing infrastructures using the infrastructure as the code (IaC) containerization approach. The authors demonstrate the effectiveness of NSP in a pattern recognition task programmed with GENESIS, through a custom-built visual system, called RetNet(8 × 5,1) that uses biologically plausible Hodgkin-Huxley spiking neurons. We evaluate the pipeline by performing 54 simulations executed on-premise, at the Hasso Plattner Institute's (HPI) Future Service-Oriented Computing (SOC) Lab, and through the Amazon Web Services (AWS), the biggest public cloud service provider in the world. We report on the non-containerized and containerized execution with Docker, as well as present the cost per simulation in AWS. The results show that our neural simulation pipeline can reduce entry barriers to neural simulations, making them more practical and cost-effective.

12.

Provenance-and machine learning-based recommendation of parameter values in scientific workflows.

Silva Junior, Daniel; Pacitti, Esther; Paes, Aline; de Oliveira, Daniel.

PeerJ Comput Sci ; 7: e606, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34307859

RESUMO

Scientific Workflows (SWfs) have revolutionized how scientists in various domains of science conduct their experiments. The management of SWfs is performed by complex tools that provide support for workflow composition, monitoring, execution, capturing, and storage of the data generated during execution. In some cases, they also provide components to ease the visualization and analysis of the generated data. During the workflow's composition phase, programs must be selected to perform the activities defined in the workflow specification. These programs often require additional parameters that serve to adjust the program's behavior according to the experiment's goals. Consequently, workflows commonly have many parameters to be manually configured, encompassing even more than one hundred in many cases. Wrongly parameters' values choosing can lead to crash workflows executions or provide undesired results. As the execution of data- and compute-intensive workflows is commonly performed in a high-performance computing environment e.g., (a cluster, a supercomputer, or a public cloud), an unsuccessful execution configures a waste of time and resources. In this article, we present FReeP-Feature Recommender from Preferences, a parameter value recommendation method that is designed to suggest values for workflow parameters, taking into account past user preferences. FReeP is based on Machine Learning techniques, particularly in Preference Learning. FReeP is composed of three algorithms, where two of them aim at recommending the value for one parameter at a time, and the third makes recommendations for n parameters at once. The experimental results obtained with provenance data from two broadly used workflows showed FReeP usefulness in the recommendation of values for one parameter. Furthermore, the results indicate the potential of FReeP to recommend values for n parameters in scientific workflows.

13.

The Collaborative Research Center FONDA.

Leser, Ulf; Hilbrich, Marcus; Draxl, Claudia; Eisert, Peter; Grunske, Lars; Hostert, Patrick; Kainmüller, Dagmar; Kao, Odej; Kehr, Birte; Kehrer, Timo; Koch, Christoph; Markl, Volker; Meyerhenke, Henning; Rabl, Tilmann; Reinefeld, Alexander; Reinert, Knut; Ritter, Kerstin; Scheuermann, Björn; Schintke, Florian; Schweikardt, Nicole; Weidlich, Matthias.

Datenbank Spektrum ; 21(3): 255-260, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34786019

RESUMO

Today's scientific data analysis very often requires complex Data Analysis Workflows (DAWs) executed over distributed computational infrastructures, e.g., clusters. Much research effort is devoted to the tuning and performance optimization of specific workflows for specific clusters. However, an arguably even more important problem for accelerating research is the reduction of development, adaptation, and maintenance times of DAWs. We describe the design and setup of the Collaborative Research Center (CRC) 1404 "FONDA -- Foundations of Workflows for Large-Scale Scientific Data Analysis", in which roughly 50 researchers jointly investigate new technologies, algorithms, and models to increase the portability, adaptability, and dependability of DAWs executed over distributed infrastructures. We describe the motivation behind our project, explain its underlying core concepts, introduce FONDA's internal structure, and sketch our vision for the future of workflow-based scientific data analysis. We also describe some lessons learned during the "making of" a CRC in Computer Science with strong interdisciplinary components, with the aim to foster similar endeavors.

14.

Perspectives on automated composition of workflows in the life sciences.

Lamprecht, Anna-Lena; Palmblad, Magnus; Ison, Jon; Schwämmle, Veit; Al Manir, Mohammad Sadnan; Altintas, Ilkay; Baker, Christopher J O; Ben Hadj Amor, Ammar; Capella-Gutierrez, Salvador; Charonyktakis, Paulos; Crusoe, Michael R; Gil, Yolanda; Goble, Carole; Griffin, Timothy J; Groth, Paul; Ienasescu, Hans; Jagtap, Pratik; Kalas, Matús; Kasalica, Vedran; Khanteymoori, Alireza; Kuhn, Tobias; Mei, Hailiang; Ménager, Hervé; Möller, Steffen; Richardson, Robin A; Robert, Vincent; Soiland-Reyes, Stian; Stevens, Robert; Szaniszlo, Szoke; Verberne, Suzan; Verhoeven, Aswin; Wolstencroft, Katherine.

F1000Res ; 10: 897, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34804501

RESUMO

Scientific data analyses often combine several computational tools in automated pipelines, or workflows. Thousands of such workflows have been used in the life sciences, though their composition has remained a cumbersome manual process due to a lack of standards for annotation, assembly, and implementation. Recent technological advances have returned the long-standing vision of automated workflow composition into focus. This article summarizes a recent Lorentz Center workshop dedicated to automated composition of workflows in the life sciences. We survey previous initiatives to automate the composition process, and discuss the current state of the art and future perspectives. We start by drawing the "big picture" of the scientific workflow development life cycle, before surveying and discussing current methods, technologies and practices for semantic domain modelling, automation in workflow development, and workflow assessment. Finally, we derive a roadmap of individual and community-based actions to work toward the vision of automated workflow development in the forthcoming years. A central outcome of the workshop is a general description of the workflow life cycle in six stages: 1) scientific question or hypothesis, 2) conceptual workflow, 3) abstract workflow, 4) concrete workflow, 5) production workflow, and 6) scientific results. The transitions between stages are facilitated by diverse tools and methods, usually incorporating domain knowledge in some form. Formal semantic domain modelling is hard and often a bottleneck for the application of semantic technologies. However, life science communities have made considerable progress here in recent years and are continuously improving, renewing interest in the application of semantic technologies for workflow exploration, composition and instantiation. Combined with systematic benchmarking with reference data and large-scale deployment of production-stage workflows, such technologies enable a more systematic process of workflow development than we know today. We believe that this can lead to more robust, reusable, and sustainable workflows in the future.

Assuntos

Disciplinas das Ciências Biológicas , Biologia Computacional , Benchmarking , Software , Fluxo de Trabalho

15.

Towards FAIR protocols and workflows: the OpenPREDICT use case.

Celebi, Remzi; Rebelo Moreira, Joao; Hassan, Ahmed A; Ayyar, Sandeep; Ridder, Lars; Kuhn, Tobias; Dumontier, Michel.

PeerJ Comput Sci ; 6: e281, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-33816932

RESUMO

It is essential for the advancement of science that researchers share, reuse and reproduce each other's workflows and protocols. The FAIR principles are a set of guidelines that aim to maximize the value and usefulness of research data, and emphasize the importance of making digital objects findable and reusable by others. The question of how to apply these principles not just to data but also to the workflows and protocols that consume and produce them is still under debate and poses a number of challenges. In this paper we describe a two-fold approach of simultaneously applying the FAIR principles to scientific workflows as well as the involved data. We apply and evaluate our approach on the case of the PREDICT workflow, a highly cited drug repurposing workflow. This includes FAIRification of the involved datasets, as well as applying semantic technologies to represent and store data about the detailed versions of the general protocol, of the concrete workflow instructions, and of their execution traces. We propose a semantic model to address these specific requirements and was evaluated by answering competency questions. This semantic model consists of classes and relations from a number of existing ontologies, including Workflow4ever, PROV, EDAM, and BPMN. This allowed us then to formulate and answer new kinds of competency questions. Our evaluation shows the high degree to which our FAIRified OpenPREDICT workflow now adheres to the FAIR principles and the practicality and usefulness of being able to answer our new competency questions.

16.

Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv.

Khan, Farah Zaib; Soiland-Reyes, Stian; Sinnott, Richard O; Lonie, Andrew; Goble, Carole; Crusoe, Michael R.

Gigascience ; 8(11)2019 11 01.

Artigo em Inglês | MEDLINE | ID: mdl-31675414

RESUMO

BACKGROUND: The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable automation, scaling, adaptation, and provenance support. However, there are still several challenges associated with the effective sharing, publication, and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms. RESULTS: Based on best-practice recommendations identified from the literature on workflow design, sharing, and publishing, we define a hierarchical provenance framework to achieve uniformity in provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information. To realize this framework, we present CWLProv, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. We use open source community-driven standards, interoperable workflow definitions in Common Workflow Language (CWL), structured provenance representation using the W3C PROV model, and resource aggregation and sharing as workflow-centric research objects generated along with the final outputs of a given workflow enactment. We demonstrate the utility of this approach through a practical implementation of CWLProv and evaluation using real-life genomic workflows developed by independent groups. CONCLUSIONS: The underlying principles of the standards utilized by CWLProv enable semantically rich and executable research objects that capture computational workflows with retrospective provenance such that any platform supporting CWL will be able to understand the analysis, reuse the methods for partial reruns, or reproduce the analysis to validate the published findings.

Assuntos

Genômica , Modelos Teóricos , Fluxo de Trabalho , Humanos , Software

17.

BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments.

Mondelli, Maria Luiza; Magalhães, Thiago; Loss, Guilherme; Wilde, Michael; Foster, Ian; Mattoso, Marta; Katz, Daniel; Barbosa, Helio; de Vasconcelos, Ana Tereza R; Ocaña, Kary; Gadelha, Luiz M R.

PeerJ ; 6: e5551, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-30186700

RESUMO

Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process.

18.

Molecular structures enumeration and virtual screening in the chemical space with RetroPath2.0.

Koch, Mathilde; Duigou, Thomas; Carbonell, Pablo; Faulon, Jean-Loup.

J Cheminform ; 9(1): 64, 2017 Dec 19.

Artigo em Inglês | MEDLINE | ID: mdl-29260340

RESUMO

BACKGROUND: Network generation tools coupled with chemical reaction rules have been mainly developed for synthesis planning and more recently for metabolic engineering. Using the same core algorithm, these tools apply a set of rules to a source set of compounds, stopping when a sink set of compounds has been produced. When using the appropriate sink, source and rules, this core algorithm can be used for a variety of applications beyond those it has been developed for. RESULTS: Here, we showcase the use of the open source workflow RetroPath2.0. First, we mathematically prove that we can generate all structural isomers of a molecule using a reduced set of reaction rules. We then use this enumeration strategy to screen the chemical space around a set of monomers and predict their glass transition temperatures, as well as around aminoglycosides to search structures maximizing antibacterial activity. We also perform a screening around aminoglycosides with enzymatic reaction rules to ensure biosynthetic accessibility. We finally use our workflow on an E. coli model to complete E. coli metabolome, with novel molecules generated using promiscuous enzymatic reaction rules. These novel molecules are searched on the MS spectra of an E. coli cell lysate interfacing our workflow with OpenMS through the KNIME Analytics Platform. CONCLUSION: We provide an easy to use and modify, modular, and open-source workflow. We demonstrate its versatility through a variety of use cases including molecular structure enumeration, virtual screening in the chemical space, and metabolome completion. Because it is open source and freely available on MyExperiment.org, workflow community contributions should likely expand further the features of the tool, even beyond the use cases presented in the paper.

19.

Kepler WebView: A Lightweight, Portable Framework for Constructing Real-time Web Interfaces of Scientific Workflows.

Crawl, Daniel; Singh, Alok; Altintas, Ilkay.

Procedia Comput Sci ; 80: 673-679, 2016.

Artigo em Inglês | MEDLINE | ID: mdl-28232853

RESUMO

Modern web technologies facilitate the creation of high-quality data visualizations, and rich, interactive components across a wide variety of devices. Scientific workflow systems can greatly benefit from these technologies by giving scientists a better understanding of their data or model leading to new insights. While several projects have enabled web access to scientific workflow systems, they are primarily organized as a large portal server encapsulating the workflow engine. In this vision paper, we propose the design for Kepler WebView, a lightweight framework that integrates web technologies with the Kepler Scientific Workflow System. By embedding a web server in the Kepler process, Kepler WebView enables a wide variety of usage scenarios that would be difficult or impossible using the portal model.

20.

Scientific workflows for bibliometrics.

Guler, Arzu Tugce; Waaijer, Cathelijn J F; Palmblad, Magnus.

Scientometrics ; 107: 385-398, 2016.

Artigo em Inglês | MEDLINE | ID: mdl-27122644

RESUMO

Scientific workflows organize the assembly of specialized software into an overall data flow and are particularly well suited for multi-step analyses using different types of software tools. They are also favorable in terms of reusability, as previously designed workflows could be made publicly available through the myExperiment community and then used in other workflows. We here illustrate how scientific workflows and the Taverna workbench in particular can be used in bibliometrics. We discuss the specific capabilities of Taverna that makes this software a powerful tool in this field, such as automated data import via Web services, data extraction from XML by XPaths, and statistical analysis and visualization with R. The support of the latter is particularly relevant, as it allows integration of a number of recently developed R packages specifically for bibliometrics. Examples are used to illustrate the possibilities of Taverna in the fields of bibliometrics and scientometrics.

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA