|

1.

Artificial intelligence based data curation: enabling a patient-centric European health data space.

de Zegher, Isabelle; Norak, Kerli; Steiger, Dominik; Müller, Heimo; Kalra, Dipak; Scheenstra, Bart; Cina, Isabella; Shulz, Stefan; Uma, Kanimozhi; Kalendralis, Petros; Lotmam, Eno-Martin; Benedikt, Martin; Dumontier, Michel; Celebi, Remzi.

Front Med (Lausanne) ; 11: 1365501, 2024.

Article En | MEDLINE | ID: mdl-38813389

The emerging European Health Data Space (EHDS) Regulation opens new prospects for large-scale sharing and re-use of health data. Yet, the proposed regulation suffers from two important limitations: it is designed to benefit the whole population with limited consideration for individuals, and the generation of secondary datasets from heterogeneous, unlinked patient data will remain burdensome. AIDAVA, a Horizon Europe project that started in September 2022, proposes to address both shortcomings by providing patients with an AI-based virtual assistant that maximises automation in the integration and transformation of their health data into an interoperable, longitudinal health record. This personal record can then be used to inform patient-related decisions at the point of care, whether this is the usual point of care or a possible cross-border point of care. The personal record can also be used to generate population datasets for research and policymaking. The proposed solution will enable a much-needed paradigm shift in health data management, implementing a 'curate once at patient level, use many times' approach, primarily for the benefit of patients and their care providers, but also for more efficient generation of high-quality secondary datasets. After 15 months, the project shows promising preliminary results in achieving automation in the integration and transformation of heterogeneous data of each individual patient, once the content of the data sources managed by the data holders has been formally described. Additionally, the conceptualization phase of the project identified a set of recommendations for the development of a patient-centric EHDS, significantly facilitating the generation of data for secondary use.

2.

Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy.

Sun, Chang; van Soest, Johan; Dumontier, Michel.

J Biomed Inform ; 143: 104404, 2023 07.

Article En | MEDLINE | ID: mdl-37268168

A large amount of personal health data that is highly valuable to the scientific community is still not accessible or requires a lengthy request process due to privacy concerns and legal restrictions. As a solution, synthetic data has been studied and proposed to be a promising alternative to this issue. However, generating realistic and privacy-preserving synthetic personal health data retains challenges such as simulating the characteristics of the patients' data that are in the minority classes, capturing the relations among variables in imbalanced data and transferring them to the synthetic data, and preserving individual patients' privacy. In this paper, we propose a differentially private conditional Generative Adversarial Network model (DP-CGANS) consisting of data transformation, sampling, conditioning, and network training to generate realistic and privacy-preserving personal data. Our model distinguishes categorical and continuous variables and transforms them into latent space separately for better training performance. We tackle the unique challenges of generating synthetic patient data due to the special data characteristics of personal health data. For example, patients with a certain disease are typically the minority in the dataset and the relations among variables are crucial to be observed. Our model is structured with a conditional vector as an additional input to present the minority class in the imbalanced data and maximally capture the dependency between variables. Moreover, we inject statistical noise into the gradients in the networking training process of DP-CGANS to provide a differential privacy guarantee. We extensively evaluate our model with state-of-the-art generative models on personal socio-economic datasets and real-world personal health datasets in terms of statistical similarity, machine learning performance, and privacy measurement. We demonstrate that our model outperforms other comparable models, especially in capturing the dependence between variables. Finally, we present the balance between data utility and privacy in synthetic data generation considering the different data structures and characteristics of real-world personal health data such as imbalanced classes, abnormal distributions, and data sparsity.

Machine Learning , Privacy , Humans , Minority Groups

3.

What do we mean with sound semantics, exactly? A survey of taxonomies and ontologies of everyday sounds.

Giordano, Bruno L; de Miranda Azevedo, Ricardo; Plasencia-Calaña, Yenisel; Formisano, Elia; Dumontier, Michel.

Front Psychol ; 13: 964209, 2022.

Article En | MEDLINE | ID: mdl-36312201

Taxonomies and ontologies for the characterization of everyday sounds have been developed in several research fields, including auditory cognition, soundscape research, artificial hearing, sound design, and medicine. Here, we surveyed 36 of such knowledge organization systems, which we identified through a systematic literature search. To evaluate the semantic domains covered by these systems within a homogeneous framework, we introduced a comprehensive set of verbal sound descriptors (sound source properties; attributes of sensation; sound signal descriptors; onomatopoeias; music genres), which we used to manually label the surveyed descriptor classes. We reveal that most taxonomies and ontologies were developed to characterize higher-level semantic relations between sound sources in terms of the sound-generating objects and actions involved (what/how), or in terms of the environmental context (where). This indicates the current lack of a comprehensive ontology of everyday sounds that covers simultaneously all semantic aspects of the relation between sounds. Such an ontology may have a wide range of applications and purposes, ranging from extending our scientific knowledge of auditory processes in the real world, to developing artificial hearing systems.

4.

Studying the association of diabetes and healthcare cost on distributed data from the Maastricht Study and Statistics Netherlands using a privacy-preserving federated learning infrastructure.

Sun, Chang; van Soest, Johan; Koster, Annemarie; Eussen, Simone J P M; Schram, Miranda T; Stehouwer, Coen D A; Dagnelie, Pieter C; Dumontier, Michel.

J Biomed Inform ; 134: 104194, 2022 10.

Article En | MEDLINE | ID: mdl-36064113

The mining of personal data collected by multiple organizations remains challenging in the presence of technical barriers, privacy concerns, and legal and/or organizational restrictions. While a number of privacy-preserving and data mining frameworks have recently emerged, much remains to show their practical utility. In this study, we implement and utilize a secure infrastructure using data from Statistics Netherlands and the Maastricht Study to learn the association between Type 2 Diabetes Mellitus (T2DM) and healthcare expenses considering the impact of lifestyle, physical activities, and complications of T2DM. Through experiments using real-world distributed personal data, we present the feasibility and effectiveness of the secure infrastructure for practical use cases of linking and analyzing vertically partitioned data across multiple organizations. We discovered that individuals diagnosed with T2DM had significantly higher expenses than those with prediabetes, while participants with prediabetes spent more than those without T2DM in all the included healthcare categories to different degrees. We further discuss a joint effort from technical, ethical-legal, and domain-specific experts that is highly valued for applying such a secure infrastructure to real-life use cases to protect data privacy.

Diabetes Mellitus, Type 2 , Prediabetic State , Diabetes Mellitus, Type 2/therapy , Health Care Costs , Humans , Netherlands , Privacy

5.

Biolink Model: A universal schema for knowledge graphs in clinical, biomedical, and translational science.

Unni, Deepak R; Moxon, Sierra A T; Bada, Michael; Brush, Matthew; Bruskiewich, Richard; Caufield, J Harry; Clemons, Paul A; Dancik, Vlado; Dumontier, Michel; Fecho, Karamarie; Glusman, Gustavo; Hadlock, Jennifer J; Harris, Nomi L; Joshi, Arpita; Putman, Tim; Qin, Guangrong; Ramsey, Stephen A; Shefchek, Kent A; Solbrig, Harold; Soman, Karthik; Thessen, Anne E; Haendel, Melissa A; Bizon, Chris; Mungall, Christopher J.

Clin Transl Sci ; 15(8): 1848-1855, 2022 08.

Article En | MEDLINE | ID: mdl-36125173

Within clinical, biomedical, and translational science, an increasing number of projects are adopting graphs for knowledge representation. Graph-based data models elucidate the interconnectedness among core biomedical concepts, enable data structures to be easily updated, and support intuitive queries, visualizations, and inference algorithms. However, knowledge discovery across these "knowledge graphs" (KGs) has remained difficult. Data set heterogeneity and complexity; the proliferation of ad hoc data formats; poor compliance with guidelines on findability, accessibility, interoperability, and reusability; and, in particular, the lack of a universally accepted, open-access model for standardization across biomedical KGs has left the task of reconciling data sources to downstream consumers. Biolink Model is an open-source data model that can be used to formalize the relationships between data structures in translational science. It incorporates object-oriented classification and graph-oriented features. The core of the model is a set of hierarchical, interconnected classes (or categories) and relationships between them (or predicates) representing biomedical entities such as gene, disease, chemical, anatomic structure, and phenotype. The model provides class and edge attributes and associations that guide how entities should relate to one another. Here, we highlight the need for a standardized data model for KGs, describe Biolink Model, and compare it with other models. We demonstrate the utility of Biolink Model in various initiatives, including the Biomedical Data Translator Consortium and the Monarch Initiative, and show how it has supported easier integration and interoperability of biomedical KGs, bringing together knowledge from multiple sources and helping to realize the goals of translational science.

Pattern Recognition, Automated , Translational Science, Biomedical , Knowledge

6.

Progress toward a universal biomedical data translator.

Fecho, Karamarie; Thessen, Anne E; Baranzini, Sergio E; Bizon, Chris; Hadlock, Jennifer J; Huang, Sui; Roper, Ryan T; Southall, Noel; Ta, Casey; Watkins, Paul B; Williams, Mark D; Xu, Hao; Byrd, William; Dancík, Vlado; Duby, Marc P; Dumontier, Michel; Glusman, Gustavo; Harris, Nomi L; Hinderer, Eugene W; Hyde, Greg; Johs, Adam; Su, Andrew I; Qin, Guangrong; Zhu, Qian.

Clin Transl Sci ; 2022 May 25.

Article En | MEDLINE | ID: mdl-35611543

Clinical, biomedical, and translational science has reached an inflection point in the breadth and diversity of available data and the potential impact of such data to improve human health and well-being. However, the data are often siloed, disorganized, and not broadly accessible due to discipline-specific differences in terminology and representation. To address these challenges, the Biomedical Data Translator Consortium has developed and tested a pilot knowledge graph-based "Translator" system capable of integrating existing biomedical data sets and "translating" those data into insights intended to augment human reasoning and accelerate translational science. Having demonstrated feasibility of the Translator system, the Translator program has since moved into development, and the Translator Consortium has made significant progress in the research, design, and implementation of an operational system. Herein, we describe the current system's architecture, performance, and quality of results. We apply Translator to several real-world use cases developed in collaboration with subject-matter experts. Finally, we discuss the scientific and technical features of Translator and compare those features to other state-of-the-art, biomedical graph-based question-answering systems.

7.

BioSimulators: a central registry of simulation engines and services for recommending specific tools.

Shaikh, Bilal; Smith, Lucian P; Vasilescu, Dan; Marupilla, Gnaneswara; Wilson, Michael; Agmon, Eran; Agnew, Henry; Andrews, Steven S; Anwar, Azraf; Beber, Moritz E; Bergmann, Frank T; Brooks, David; Brusch, Lutz; Calzone, Laurence; Choi, Kiri; Cooper, Joshua; Detloff, John; Drawert, Brian; Dumontier, Michel; Ermentrout, G Bard; Faeder, James R; Freiburger, Andrew P; Fröhlich, Fabian; Funahashi, Akira; Garny, Alan; Gennari, John H; Gleeson, Padraig; Goelzer, Anne; Haiman, Zachary; Hasenauer, Jan; Hellerstein, Joseph L; Hermjakob, Henning; Hoops, Stefan; Ison, Jon C; Jahn, Diego; Jakubowski, Henry V; Jordan, Ryann; Kalas, Matús; König, Matthias; Liebermeister, Wolfram; Sheriff, Rahuman S Malik; Mandal, Synchon; McDougal, Robert; Medley, J Kyle; Mendes, Pedro; Müller, Robert; Myers, Chris J; Naldi, Aurelien; Nguyen, Tung V N; Nickerson, David P.

Nucleic Acids Res ; 50(W1): W108-W114, 2022 07 05.

Article En | MEDLINE | ID: mdl-35524558

Computational models have great potential to accelerate bioscience, bioengineering, and medicine. However, it remains challenging to reproduce and reuse simulations, in part, because the numerous formats and methods for simulating various subsystems and scales remain siloed by different software tools. For example, each tool must be executed through a distinct interface. To help investigators find and use simulation tools, we developed BioSimulators (https://biosimulators.org), a central registry of the capabilities of simulation tools and consistent Python, command-line and containerized interfaces to each version of each tool. The foundation of BioSimulators is standards, such as CellML, SBML, SED-ML and the COMBINE archive format, and validation tools for simulation projects and simulation tools that ensure these standards are used consistently. To help modelers find tools for particular projects, we have also used the registry to develop recommendation services. We anticipate that BioSimulators will help modelers exchange, reproduce, and combine simulations.

Computer Simulation , Software , Humans , Bioengineering , Models, Biological , Registries , Research Personnel

8.

Semantic modelling of common data elements for rare disease registries, and a prototype workflow for their deployment over registry data.

Kaliyaperumal, Rajaram; Wilkinson, Mark D; Moreno, Pablo Alarcón; Benis, Nirupama; Cornet, Ronald; Dos Santos Vieira, Bruna; Dumontier, Michel; Bernabé, César Henrique; Jacobsen, Annika; Le Cornec, Clémence M A; Godoy, Mario Prieto; Queralt-Rosinach, Núria; Schultze Kool, Leo J; Swertz, Morris A; van Damme, Philip; van der Velde, K Joeri; Lalout, Nawel; Zhang, Shuxin; Roos, Marco.

J Biomed Semantics ; 13(1): 9, 2022 03 15.

Article En | MEDLINE | ID: mdl-35292119

BACKGROUND: The European Platform on Rare Disease Registration (EU RD Platform) aims to address the fragmentation of European rare disease (RD) patient data, scattered among hundreds of independent and non-coordinating registries, by establishing standards for integration and interoperability. The first practical output of this effort was a set of 16 Common Data Elements (CDEs) that should be implemented by all RD registries. Interoperability, however, requires decisions beyond data elements - including data models, formats, and semantics. Within the European Joint Programme on Rare Diseases (EJP RD), we aim to further the goals of the EU RD Platform by generating reusable RD semantic model templates that follow the FAIR Data Principles. RESULTS: Through a team-based iterative approach, we created semantically grounded models to represent each of the CDEs, using the SemanticScience Integrated Ontology as the core framework for representing the entities and their relationships. Within that framework, we mapped the concepts represented in the CDEs, and their possible values, into domain ontologies such as the Orphanet Rare Disease Ontology, Human Phenotype Ontology and National Cancer Institute Thesaurus. Finally, we created an exemplar, reusable ETL pipeline that we will be deploying over these non-coordinating data repositories to assist them in creating model-compliant FAIR data without requiring site-specific coding nor expertise in Linked Data or FAIR. CONCLUSIONS: Within the EJP RD project, we determined that creating reusable, expert-designed templates reduced or eliminated the requirement for our participating biomedical domain experts and rare disease data hosts to understand OWL semantics. This enabled them to publish highly expressive FAIR data using tools and approaches that were already familiar to them.

Common Data Elements , Rare Diseases , Humans , Registries , Semantics , Workflow

9.

Prediction of illness remission in patients with Obsessive-Compulsive Disorder with supervised machine learning.

Grassi, Massimiliano; Rickelt, Judith; Caldirola, Daniela; Eikelenboom, Merijn; van Oppen, Patricia; Dumontier, Michel; Perna, Giampaolo; Schruers, Koen.

J Affect Disord ; 296: 117-125, 2022 01 01.

Article En | MEDLINE | ID: mdl-34600172

INTRODUCTION: The course of OCD differs widely among OCD patients, varying from chronic symptoms to full remission. No tools for individual prediction of OCD remission are currently available. This study aimed to develop a machine learning algorithm to predict OCD remission after two years, using solely predictors easily accessible in the daily clinical routine. METHODS: Subjects were recruited in a longitudinal multi-center study (NOCDA). Gradient boosted decision trees were used as supervised machine learning technique. The training of the algorithm was performed with 227 predictors and 213 observations collected in a single clinical center. Hyper-parameter optimization was performed with cross-validation and a Bayesian optimization strategy. The predictive performance of the algorithm was subsequently tested in an independent sample of 215 observations collected in five different centers. Between-center differences were investigated with a bootstrap resampling approach. RESULTS: The average predictive performance of the algorithm in the test centers resulted in an AUROC of 0.7820, a sensitivity of 73.42%, and a specificity of 71.45%. Results also showed a significant between-center variation in the predictive performance. The most important predictors resulted related to OCD severity, OCD chronic course, use of psychotropic medications, and better global functioning. LIMITATIONS: All recruiting centers followed the same assessment protocol and are in The Netherlands. Moreover, the sample of the data recruited in some of the test centers was limited in size. DISCUSSION: The algorithm demonstrated a moderate average predictive performance, and future studies will focus on increasing the stability of the predictive performance across clinical settings.

Obsessive-Compulsive Disorder , Bayes Theorem , Humans , Machine Learning , Obsessive-Compulsive Disorder/therapy , Remission Induction , Supervised Machine Learning

10.

Relation extraction from DailyMed structured product labels by optimally combining crowd, experts and machines.

Shingjergji, Krist; Celebi, Remzi; Scholtes, Jan; Dumontier, Michel.

J Biomed Inform ; 122: 103902, 2021 10.

Article En | MEDLINE | ID: mdl-34481057

The effectiveness of machine learning models to provide accurate and consistent results in drug discovery and clinical decision support is strongly dependent on the quality of the data used. However, substantive amounts of open data that drive drug discovery suffer from a number of issues including inconsistent representation, inaccurate reporting, and incomplete context. For example, databases of FDA-approved drug indications used in computational drug repositioning studies do not distinguish between treatments that simply offer symptomatic relief from those that target the underlying pathology. Moreover, drug indication sources often lack proper provenance and have little overlap. Consequently, new predictions can be of poor quality as they offer little in the way of new insights. Hence, work remains to be done to establish higher quality databases of drug indications that are suitable for use in drug discovery and repositioning studies. Here, we report on the combination of weak supervision (i.e., programmatic labeling and crowdsourcing) and deep learning methods for relation extraction from DailyMed text to create a higher quality drug-disease relation dataset. The generated drug-disease relation data shows a high overlap with DrugCentral, a manually curated dataset. Using this dataset, we constructed a machine learning model to classify relations between drugs and diseases from text into four categories; treatment, symptomatic relief, contradiction, and effect, exhibiting an improvement of 15.5% with Bi-LSTM (F1 score of 71.8%) over the best performing discrete method. Access to high quality data is crucial to building accurate and reliable drug repurposing prediction models. Our work suggests how the combination of crowds, experts, and machine learning methods can go hand-in-hand to improve datasets and predictive models.

Crowdsourcing , Machine Learning , Drug Repositioning

11.

Privacy preserving distributed learning classifiers - Sequential learning with small sets of data.

Zerka, Fadila; Urovi, Visara; Bottari, Fabio; Leijenaar, Ralph T H; Walsh, Sean; Gabrani-Juma, Hanif; Gueuning, Martin; Vaidyanathan, Akshayaa; Vos, Wim; Occhipinti, Mariaelena; Woodruff, Henry C; Dumontier, Michel; Lambin, Philippe.

Comput Biol Med ; 136: 104716, 2021 09.

Article En | MEDLINE | ID: mdl-34364262

BACKGROUND: Artificial intelligence (AI) typically requires a significant amount of high-quality data to build reliable models, where gathering enough data within a single institution can be particularly challenging. In this study we investigated the impact of using sequential learning to exploit very small, siloed sets of clinical and imaging data to train AI models. Furthermore, we evaluated the capacity of such models to achieve equivalent performance when compared to models trained with the same data over a single centralized database. METHODS: We propose a privacy preserving distributed learning framework, learning sequentially from each dataset. The framework is applied to three machine learning algorithms: Logistic Regression, Support Vector Machines (SVM), and Perceptron. The models were evaluated using four open-source datasets (Breast cancer, Indian liver, NSCLC-Radiomics dataset, and Stage III NSCLC). FINDINGS: The proposed framework ensured a comparable predictive performance against a centralized learning approach. Pairwise DeLong tests showed no significant difference between the compared pairs for each dataset. INTERPRETATION: Distributed learning contributes to preserve medical data privacy. We foresee this technology will increase the number of collaborative opportunities to develop robust AI, becoming the default solution in scenarios where collecting enough data from a single reliable source is logistically impossible. Distributed sequential learning provides privacy persevering means for institutions with small but clinically valuable datasets to collaboratively train predictive AI while preserving the privacy of their patients. Such models perform similarly to models that are built on a larger central dataset.

Artificial Intelligence , Privacy , Algorithms , Humans , Machine Learning , Neural Networks, Computer

12.

Semantic micro-contributions with decentralized nanopublication services.

Kuhn, Tobias; Taelman, Ruben; Emonet, Vincent; Antonatos, Haris; Soiland-Reyes, Stian; Dumontier, Michel.

PeerJ Comput Sci ; 7: e387, 2021.

Article En | MEDLINE | ID: mdl-33817033

While the publication of Linked Data has become increasingly common, the process tends to be a relatively complicated and heavy-weight one. Linked Data is typically published by centralized entities in the form of larger dataset releases, which has the downside that there is a central bottleneck in the form of the organization or individual responsible for the releases. Moreover, certain kinds of data entries, in particular those with subjective or original content, currently do not fit into any existing dataset and are therefore more difficult to publish. To address these problems, we present here an approach to use nanopublications and a decentralized network of services to allow users to directly publish small Linked Data statements through a simple and user-friendly interface, called Nanobench, powered by semantic templates that are themselves published as nanopublications. The published nanopublications are cryptographically verifiable and can be queried through a redundant and decentralized network of services, based on the grlc API generator and a new quad extension of Triple Pattern Fragments. We show here that these two kinds of services are complementary and together allow us to query nanopublications in a reliable and efficient manner. We also show that Nanobench makes it indeed very easy for users to publish Linked Data statements, even for those who have no prior experience in Linked Data publishing.

13.

SIENA: Semi-automatic semantic enhancement of datasets using concept recognition.

Grigoriu, Andreea; Zaveri, Amrapali; Weiss, Gerhard; Dumontier, Michel.

J Biomed Semantics ; 12(1): 5, 2021 03 24.

Article En | MEDLINE | ID: mdl-33761996

BACKGROUND: The amount of available data, which can facilitate answering scientific research questions, is growing. However, the different formats of published data are expanding as well, creating a serious challenge when multiple datasets need to be integrated for answering a question. RESULTS: This paper presents a semi-automated framework that provides semantic enhancement of biomedical data, specifically gene datasets. The framework involved a concept recognition task using machine learning, in combination with the BioPortal annotator. Compared to using methods which require only the BioPortal annotator for semantic enhancement, the proposed framework achieves the highest results. CONCLUSIONS: Using concept recognition combined with machine learning techniques and annotation with a biomedical ontology, the proposed framework can provide datasets to reach their full potential of providing meaningful information, which can answer scientific research questions.

Biological Ontologies , Semantics , Machine Learning

14.

InContext: curation of medical context for drug indications.

Moodley, Kody; Rieswijk, Linda; Oprea, Tudor I; Dumontier, Michel.

J Biomed Semantics ; 12(1): 2, 2021 02 12.

Article En | MEDLINE | ID: mdl-33579375

Accurate and precise information about the therapeutic uses (indications) of a drug is essential for applications in drug repurposing and precision medicine. Leading online drug resources such as DrugCentral and DrugBank provide rich information about various properties of drugs, including their indications. However, because indications in such databases are often partly automatically mined, some may prove to be inaccurate or imprecise. Particularly challenging for text mining methods is the task of distinguishing between general disease mentions in drug product labels and actual indications for the drug. For this, the qualifying medical context of the disease mentions in the text should be studied. Some examples include contraindications, co-prescribed drugs and target patient qualifications. No existing indication curation efforts attempt to capture such information in a precise way. Here we fill this gap by presenting a novel curation protocol for extracting indications and machine processable annotations of contextual information about the therapeutic use of a drug. We implemented the protocol on a reference set of FDA-approved drug product labels on the DailyMed website to curate indications for 150 anti-cancer and cardiovascular drugs. The resulting corpus - InContext - focuses on anti-cancer and cardiovascular drugs because of the heightened societal interest in cancer and heart disease. In order to understand how InContext relates with existing reputable drug indication databases, we analysed it's overlap with a state-of-the-art indications database - LabeledIn - as well as a reputable online drug compendium - DrugCentral. We found that 40% of indications sampled from DrugCentral (and 23% from LabeledIn) respectively, could not be accounted for in InContext. This raises questions about the veracity of indications not appearing in InContext. The additional contextual information curated by InContext about disease mentions in drug SPLs provides a foundation for more precise, structured and formal representations of knowledge related to drug therapeutic use, in order to increase accuracy and agreement of drug indication extraction methods for in silico drug repurposing.

Data Mining , Pharmaceutical Preparations , Databases, Pharmaceutical , Humans , Precision Medicine

15.

Finding the Evidence Base Using Citation Networks: Do 300 to 400 US Physicians Die by Suicide Annually?

Leung, Tiffany I; Pendharkar, Sima; Chen, Chwen-Yuen Angie; Dumontier, Michel.

J Gen Intern Med ; 36(4): 1129-1131, 2021 04.

Article En | MEDLINE | ID: mdl-32462565

Physicians , Suicide Prevention , Humans

16.

Ten simple rules for making training materials FAIR.

Garcia, Leyla; Batut, Bérénice; Burke, Melissa L; Kuzak, Mateusz; Psomopoulos, Fotis; Arcila, Ricardo; Attwood, Teresa K; Beard, Niall; Carvalho-Silva, Denise; Dimopoulos, Alexandros C; Del Angel, Victoria Dominguez; Dumontier, Michel; Gurwitz, Kim T; Krause, Roland; McQuilton, Peter; Le Pera, Loredana; Morgan, Sarah L; Rauste, Päivi; Via, Allegra; Kahlem, Pascal; Rustici, Gabriella; van Gelder, Celia W G; Palagi, Patricia M.

PLoS Comput Biol ; 16(5): e1007854, 2020 05.

Article En | MEDLINE | ID: mdl-32437350

Everything we do today is becoming more and more reliant on the use of computers. The field of biology is no exception; but most biologists receive little or no formal preparation for the increasingly computational aspects of their discipline. In consequence, informal training courses are often needed to plug the gaps; and the demand for such training is growing worldwide. To meet this demand, some training programs are being expanded, and new ones are being developed. Key to both scenarios is the creation of new course materials. Rather than starting from scratch, however, it's sometimes possible to repurpose materials that already exist. Yet finding suitable materials online can be difficult: They're often widely scattered across the internet or hidden in their home institutions, with no systematic way to find them. This is a common problem for all digital objects. The scientific community has attempted to address this issue by developing a set of rules (which have been called the Findable, Accessible, Interoperable and Reusable [FAIR] principles) to make such objects more findable and reusable. Here, we show how to apply these rules to help make training materials easier to find, (re)use, and adapt, for the benefit of all.

Computer-Assisted Instruction/standards , Guidelines as Topic , Biology/education , Computational Biology , Humans , Information Storage and Retrieval

17.

Towards FAIR protocols and workflows: the OpenPREDICT use case.

Celebi, Remzi; Rebelo Moreira, Joao; Hassan, Ahmed A; Ayyar, Sandeep; Ridder, Lars; Kuhn, Tobias; Dumontier, Michel.

PeerJ Comput Sci ; 6: e281, 2020.

Article En | MEDLINE | ID: mdl-33816932

It is essential for the advancement of science that researchers share, reuse and reproduce each other's workflows and protocols. The FAIR principles are a set of guidelines that aim to maximize the value and usefulness of research data, and emphasize the importance of making digital objects findable and reusable by others. The question of how to apply these principles not just to data but also to the workflows and protocols that consume and produce them is still under debate and poses a number of challenges. In this paper we describe a two-fold approach of simultaneously applying the FAIR principles to scientific workflows as well as the involved data. We apply and evaluate our approach on the case of the PREDICT workflow, a highly cited drug repurposing workflow. This includes FAIRification of the involved datasets, as well as applying semantic technologies to represent and store data about the detailed versions of the general protocol, of the concrete workflow instructions, and of their execution traces. We propose a semantic model to address these specific requirements and was evaluated by answering competency questions. This semantic model consists of classes and relations from a number of existing ontologies, including Workflow4ever, PROV, EDAM, and BPMN. This allowed us then to formulate and answer new kinds of competency questions. Our evaluation shows the high degree to which our FAIRified OpenPREDICT workflow now adheres to the FAIR principles and the practicality and usefulness of being able to answer our new competency questions.

18.

A Minimal Information Model for Potential Drug-Drug Interactions.

Hochheiser, Harry; Jing, Xia; Garcia, Elizabeth A; Ayvaz, Serkan; Sahay, Ratnesh; Dumontier, Michel; Banda, Juan M; Beyan, Oya; Brochhausen, Mathias; Draper, Evan; Habiel, Sam; Hassanzadeh, Oktie; Herrero-Zazo, Maria; Hocum, Brian; Horn, John; LeBaron, Brian; Malone, Daniel C; Nytrø, Øystein; Reese, Thomas; Romagnoli, Katrina; Schneider, Jodi; Zhang, Louisa Yu; Boyce, Richard D.

Front Pharmacol ; 11: 608068, 2020.

Article En | MEDLINE | ID: mdl-33762928

Despite the significant health impacts of adverse events associated with drug-drug interactions, no standard models exist for managing and sharing evidence describing potential interactions between medications. Minimal information models have been used in other communities to establish community consensus around simple models capable of communicating useful information. This paper reports on a new minimal information model for describing potential drug-drug interactions. A task force of the Semantic Web in Health Care and Life Sciences Community Group of the World-Wide Web consortium engaged informaticians and drug-drug interaction experts in in-depth examination of recent literature and specific potential interactions. A consensus set of information items was identified, along with example descriptions of selected potential drug-drug interactions (PDDIs). User profiles and use cases were developed to demonstrate the applicability of the model. Ten core information items were identified: drugs involved, clinical consequences, seriousness, operational classification statement, recommended action, mechanism of interaction, contextual information/modifying factors, evidence about a suspected drug-drug interaction, frequency of exposure, and frequency of harm to exposed persons. Eight best practice recommendations suggest how PDDI knowledge artifact creators can best use the 10 information items when synthesizing drug interaction evidence into artifacts intended to aid clinicians. This model has been included in a proposed implementation guide developed by the HL7 Clinical Decision Support Workgroup and in PDDIs published in the CDS Connect repository. The complete description of the model can be found at https://w3id.org/hclscg/pddi.

19.

Evaluation of knowledge graph embedding approaches for drug-drug interaction prediction in realistic settings.

Celebi, Remzi; Uyar, Huseyin; Yasar, Erkan; Gumus, Ozgur; Dikenelli, Oguz; Dumontier, Michel.

BMC Bioinformatics ; 20(1): 726, 2019 Dec 18.

Article En | MEDLINE | ID: mdl-31852427

BACKGROUND: Current approaches to identifying drug-drug interactions (DDIs), include safety studies during drug development and post-marketing surveillance after approval, offer important opportunities to identify potential safety issues, but are unable to provide complete set of all possible DDIs. Thus, the drug discovery researchers and healthcare professionals might not be fully aware of potentially dangerous DDIs. Predicting potential drug-drug interaction helps reduce unanticipated drug interactions and drug development costs and optimizes the drug design process. Methods for prediction of DDIs have the tendency to report high accuracy but still have little impact on translational research due to systematic biases induced by networked/paired data. In this work, we aimed to present realistic evaluation settings to predict DDIs using knowledge graph embeddings. We propose a simple disjoint cross-validation scheme to evaluate drug-drug interaction predictions for the scenarios where the drugs have no known DDIs. RESULTS: We designed different evaluation settings to accurately assess the performance for predicting DDIs. The settings for disjoint cross-validation produced lower performance scores, as expected, but still were good at predicting the drug interactions. We have applied Logistic Regression, Naive Bayes and Random Forest on DrugBank knowledge graph with the 10-fold traditional cross validation using RDF2Vec, TransE and TransD. RDF2Vec with Skip-Gram generally surpasses other embedding methods. We also tested RDF2Vec on various drug knowledge graphs such as DrugBank, PharmGKB and KEGG to predict unknown drug-drug interactions. The performance was not enhanced significantly when an integrated knowledge graph including these three datasets was used. CONCLUSION: We showed that the knowledge embeddings are powerful predictors and comparable to current state-of-the-art methods for inferring new DDIs. We addressed the evaluation biases by introducing drug-wise and pairwise disjoint test classes. Although the performance scores for drug-wise and pairwise disjoint seem to be low, the results can be considered to be realistic in predicting the interactions for drugs with limited interaction information.

Drug Interactions , Bayes Theorem , Knowledge , Logistic Models , Pattern Recognition, Automated

20.

FAIRshake: Toolkit to Evaluate the FAIRness of Research Digital Resources.

Clarke, Daniel J B; Wang, Lily; Jones, Alex; Wojciechowicz, Megan L; Torre, Denis; Jagodnik, Kathleen M; Jenkins, Sherry L; McQuilton, Peter; Flamholz, Zachary; Silverstein, Moshe C; Schilder, Brian M; Robasky, Kimberly; Castillo, Claris; Idaszak, Ray; Ahalt, Stanley C; Williams, Jason; Schurer, Stephan; Cooper, Daniel J; de Miranda Azevedo, Ricardo; Klenk, Juergen A; Haendel, Melissa A; Nedzel, Jared; Avillach, Paul; Shimoyama, Mary E; Harris, Rayna M; Gamble, Meredith; Poten, Rudy; Charbonneau, Amanda L; Larkin, Jennie; Brown, C Titus; Bonazzi, Vivien R; Dumontier, Michel J; Sansone, Susanna-Assunta; Ma'ayan, Avi.

Cell Syst ; 9(5): 417-421, 2019 11 27.

Article En | MEDLINE | ID: mdl-31677972

As more digital resources are produced by the research community, it is becoming increasingly important to harmonize and organize them for synergistic utilization. The findable, accessible, interoperable, and reusable (FAIR) guiding principles have prompted many stakeholders to consider strategies for tackling this challenge. The FAIRshake toolkit was developed to enable the establishment of community-driven FAIR metrics and rubrics paired with manual and automated FAIR assessments. FAIR assessments are visualized as an insignia that can be embedded within digital-resources-hosting websites. Using FAIRshake, a variety of biomedical digital resources were manually and automatically evaluated for their level of FAIRness.

Information Dissemination/methods , Internet/trends , Online Systems/standards , Health Resources/standards , Humans