Search | VHL Search Portal

Where to search top-K biomedical ontologies?

Oliveira, Daniela; Butt, Anila Sahar; Haller, Armin; Rebholz-Schuhmann, Dietrich; Sahay, Ratnesh.

Brief Bioinform ; 20(4): 1477-1491, 2019 07 19.

Article in English | MEDLINE | ID: mdl-29579141

ABSTRACT

MOTIVATION: Searching for precise terms and terminological definitions in the biomedical data space is problematic, as researchers find overlapping, closely related and even equivalent concepts in a single or multiple ontologies. Search engines that retrieve ontological resources often suggest an extensive list of search results for a given input term, which leads to the tedious task of selecting the best-fit ontological resource (class or property) for the input term and reduces user confidence in the retrieval engines. A systematic evaluation of these search engines is necessary to understand their strengths and weaknesses in different search requirements. RESULT: We have implemented seven comparable Information Retrieval ranking algorithms to search through ontologies and compared them against four search engines for ontologies. Free-text queries have been performed, the outcomes have been judged by experts and the ranking algorithms and search engines have been evaluated against the expert-based ground truth (GT). In addition, we propose a probabilistic GT that is developed automatically to provide deeper insights and confidence to the expert-based GT as well as evaluating a broader range of search queries. CONCLUSION: The main outcome of this work is the identification of key search factors for biomedical ontologies together with search requirements and a set of recommendations that will help biomedical experts and ontology engineers to select the best-suited retrieval mechanism in their search scenarios. We expect that this evaluation will allow researchers and practitioners to apply the current search techniques more reliably and that it will help them to select the right solution for their daily work. AVAILABILITY: The source code (of seven ranking algorithms), ground truths and experimental results are available at https://github.com/danielapoliveira/bioont-search-benchmark.

Subject(s)

Biological Ontologies/statistics & numerical data , Algorithms , Computational Biology , Expert Systems , Humans , Information Storage and Retrieval , Models, Statistical , Search Engine

Improving data workflow systems with cloud services and use of open data for bioinformatics research.

Karim, Md Rezaul; Michel, Audrey; Zappa, Achille; Baranov, Pavel; Sahay, Ratnesh; Rebholz-Schuhmann, Dietrich.

Brief Bioinform ; 19(5): 1035-1050, 2018 09 28.

Article in English | MEDLINE | ID: mdl-28419324

ABSTRACT

Data workflow systems (DWFSs) enable bioinformatics researchers to combine components for data access and data analytics, and to share the final data analytics approach with their collaborators. Increasingly, such systems have to cope with large-scale data, such as full genomes (about 200 GB each), public fact repositories (about 100 TB of data) and 3D imaging data at even larger scales. As moving the data becomes cumbersome, the DWFS needs to embed its processes into a cloud infrastructure, where the data are already hosted. As the standardized public data play an increasingly important role, the DWFS needs to comply with Semantic Web technologies. This advancement to DWFS would reduce overhead costs and accelerate the progress in bioinformatics research based on large-scale data and public resources, as researchers would require less specialized IT knowledge for the implementation. Furthermore, the high data growth rates in bioinformatics research drive the demand for parallel and distributed computing, which then imposes a need for scalability and high-throughput capabilities onto the DWFS. As a result, requirements for data sharing and access to public knowledge bases suggest that compliance of the DWFS with Semantic Web standards is necessary. In this article, we will analyze the existing DWFS with regard to their capabilities toward public open data use as well as large-scale computational and human interface requirements. We untangle the parameters for selecting a preferable solution for bioinformatics research with particular consideration to using cloud services and Semantic Web technologies. Our analysis leads to research guidelines and recommendations toward the development of future DWFS for the bioinformatics research community.

Subject(s)

Cloud Computing , Computational Biology/methods , Workflow , Big Data , Data Interpretation, Statistical , Database Management Systems , Drug Discovery/statistics & numerical data , Genomics/statistics & numerical data , Humans , Information Dissemination , Knowledge Bases , Semantic Web/statistics & numerical data , User-Computer Interface

Predicting links between tumor samples and genes using 2-Layered graph based diffusion approach.

Timilsina, Mohan; Yang, Haixuan; Sahay, Ratnesh; Rebholz-Schuhmann, Dietrich.

BMC Bioinformatics ; 20(1): 462, 2019 Sep 09.

Article in English | MEDLINE | ID: mdl-31500564

ABSTRACT

BACKGROUND: Determining the association between tumor sample and the gene is demanding because it requires a high cost for conducting genetic experiments. Thus, the discovered association between tumor sample and gene further requires clinical verification and validation. This entire mechanism is time-consuming and expensive. Due to this issue, predicting the association between tumor samples and genes remain a challenge in biomedicine. RESULTS: Here we present, a computational model based on a heat diffusion algorithm which can predict the association between tumor samples and genes. We proposed a 2-layered graph. In the first layer, we constructed a graph of tumor samples and genes where these two types of nodes are connected by "hasGene" relationship. In the second layer, the gene nodes are connected by "interaction" relationship. We applied the heat diffusion algorithms in nine different variants of genetic interaction networks extracted from STRING and BioGRID database. The heat diffusion algorithm predicted the links between tumor samples and genes with mean AUC-ROC score of 0.84. This score is obtained by using weighted genetic interactions of fusion or co-occurrence channels from the STRING database. For the unweighted genetic interaction from the BioGRID database, the algorithms predict the links with an AUC-ROC score of 0.74. CONCLUSIONS: We demonstrate that the gene-gene interaction scores could improve the predictive power of the heat diffusion model to predict the links between tumor samples and genes. We showed the efficient runtime of the heat diffusion algorithm in various genetic interaction network. We statistically validated our prediction quality of the links between tumor samples and genes.

Subject(s)

Algorithms , Genes, Neoplasm , Neoplasms/genetics , Area Under Curve , DNA Methylation/genetics , Databases, Factual , Diffusion , Epistasis, Genetic , Gene Regulatory Networks , Humans , ROC Curve , Reproducibility of Results

Convolutional Embedded Networks for Population Scale Clustering and Bio-Ancestry Inferencing.

Karim, Md Rezaul; Cochez, Michael; Zappa, Achille; Sahay, Ratnesh; Rebholz-Schuhmann, Dietrich; Beyan, Oya; Decker, Stefan.

IEEE/ACM Trans Comput Biol Bioinform ; 19(1): 369-382, 2022.

Article in English | MEDLINE | ID: mdl-32750845

ABSTRACT

The study of genetic variants (GVs) can help find correlating population groups and to identify cohorts that are predisposed to common diseases and explain differences in disease susceptibility and how patients react to drugs. Machine learning techniques are increasingly being applied to identify interacting GVs to understand their complex phenotypic traits. Since the performance of a learning algorithm not only depends on the size and nature of the data but also on the quality of underlying representation, deep neural networks (DNNs) can learn non-linear mappings that allow transforming GVs data into more clustering and classification friendly representations than manual feature selection. In this paper, we propose convolutional embedded networks (CEN) in which we combine two DNN architectures called convolutional embedded clustering (CEC) and convolutional autoencoder (CAE) classifier for clustering individuals and predicting geographic ethnicity based on GVs, respectively. We employed CAE-based representation learning to 95 million GVs from the '1000 genomes' (covering 2,504 individuals from 26 ethnic origins) and 'Simons genome diversity' (covering 279 individuals from 130 ethnic origins) projects. Quantitative and qualitative analyses with a focus on accuracy and scalability show that our approach outperforms state-of-the-art approaches such as VariantSpark and ADMIXTURE. In particular, CEC can cluster targeted population groups in 22 hours with an adjusted rand index (ARI) of 0.915, the normalized mutual information (NMI) of 0.92, and the clustering accuracy (ACC) of 89 percent. Contrarily, the CAE classifier can predict the geographic ethnicity of unknown samples with an F1 and Mathews correlation coefficient (MCC) score of 0.9004 and 0.8245, respectively. Further, to provide interpretations of the predictions, we identify significant biomarkers using gradient boosted trees (GBT) and SHapley Additive exPlanations (SHAP). Overall, our approach is transparent and faster than the baseline methods, and scalable for 5 to 100 percent of the full human genome.

Subject(s)

Machine Learning , Neural Networks, Computer , Algorithms , Cluster Analysis , Humans

A Minimal Information Model for Potential Drug-Drug Interactions.

Hochheiser, Harry; Jing, Xia; Garcia, Elizabeth A; Ayvaz, Serkan; Sahay, Ratnesh; Dumontier, Michel; Banda, Juan M; Beyan, Oya; Brochhausen, Mathias; Draper, Evan; Habiel, Sam; Hassanzadeh, Oktie; Herrero-Zazo, Maria; Hocum, Brian; Horn, John; LeBaron, Brian; Malone, Daniel C; Nytrø, Øystein; Reese, Thomas; Romagnoli, Katrina; Schneider, Jodi; Zhang, Louisa Yu; Boyce, Richard D.

Front Pharmacol ; 11: 608068, 2020.

Article in English | MEDLINE | ID: mdl-33762928

ABSTRACT

Despite the significant health impacts of adverse events associated with drug-drug interactions, no standard models exist for managing and sharing evidence describing potential interactions between medications. Minimal information models have been used in other communities to establish community consensus around simple models capable of communicating useful information. This paper reports on a new minimal information model for describing potential drug-drug interactions. A task force of the Semantic Web in Health Care and Life Sciences Community Group of the World-Wide Web consortium engaged informaticians and drug-drug interaction experts in in-depth examination of recent literature and specific potential interactions. A consensus set of information items was identified, along with example descriptions of selected potential drug-drug interactions (PDDIs). User profiles and use cases were developed to demonstrate the applicability of the model. Ten core information items were identified: drugs involved, clinical consequences, seriousness, operational classification statement, recommended action, mechanism of interaction, contextual information/modifying factors, evidence about a suspected drug-drug interaction, frequency of exposure, and frequency of harm to exposed persons. Eight best practice recommendations suggest how PDDI knowledge artifact creators can best use the 10 information items when synthesizing drug interaction evidence into artifacts intended to aid clinicians. This model has been included in a proposed implementation guide developed by the HL7 Clinical Decision Support Workgroup and in PDDIs published in the CDS Connect repository. The complete description of the model can be found at https://w3id.org/hclscg/pddi.

SAFE: SPARQL Federation over RDF Data Cubes with Access Control.

Khan, Yasar; Saleem, Muhammad; Mehdi, Muntazir; Hogan, Aidan; Mehmood, Qaiser; Rebholz-Schuhmann, Dietrich; Sahay, Ratnesh.

J Biomed Semantics ; 8(1): 5, 2017 Feb 01.

Article in English | MEDLINE | ID: mdl-28148277

ABSTRACT

BACKGROUND: Several query federation engines have been proposed for accessing public Linked Open Data sources. However, in many domains, resources are sensitive and access to these resources is tightly controlled by stakeholders; consequently, privacy is a major concern when federating queries over such datasets. In the Healthcare and Life Sciences (HCLS) domain real-world datasets contain sensitive statistical information: strict ownership is granted to individuals working in hospitals, research labs, clinical trial organisers, etc. Therefore, the legal and ethical concerns on (i) preserving the anonymity of patients (or clinical subjects); and (ii) respecting data ownership through access control; are key challenges faced by the data analytics community working within the HCLS domain. Likewise statistical data play a key role in the domain, where the RDF Data Cube Vocabulary has been proposed as a standard format to enable the exchange of such data. However, to the best of our knowledge, no existing approach has looked to optimise federated queries over such statistical data. RESULTS: We present SAFE: a query federation engine that enables policy-aware access to sensitive statistical datasets represented as RDF data cubes. SAFE is designed specifically to query statistical RDF data cubes in a distributed setting, where access control is coupled with source selection, user profiles and their access rights. SAFE proposes a join-aware source selection method that avoids wasteful requests to irrelevant and unauthorised data sources. In order to preserve anonymity and enforce stricter access control, SAFE's indexing system does not hold any data instances-it stores only predicates and endpoints. The resulting data summary has a significantly lower index generation time and size compared to existing engines, which allows for faster updates when sources change. CONCLUSIONS: We validate the performance of the system with experiments over real-world datasets provided by three clinical organisations as well as legacy linked datasets. We show that SAFE enables granular graph-level access control over distributed clinical RDF data cubes and efficiently reduces the source selection and overall query execution time when compared with general-purpose SPARQL query federation engines in the targeted setting.

Subject(s)

Biological Ontologies , Information Storage and Retrieval , Algorithms , Confidentiality , Databases, Factual , Humans , Software

Towards precision medicine: discovering novel gynecological cancer biomarkers and pathways using linked data.

Jha, Alokkumar; Khan, Yasar; Mehdi, Muntazir; Karim, Md Rezaul; Mehmood, Qaiser; Zappa, Achille; Rebholz-Schuhmann, Dietrich; Sahay, Ratnesh.

J Biomed Semantics ; 8(1): 40, 2017 Sep 19.

Article in English | MEDLINE | ID: mdl-28927463

ABSTRACT

BACKGROUND: Next Generation Sequencing (NGS) is playing a key role in therapeutic decision making for the cancer prognosis and treatment. The NGS technologies are producing a massive amount of sequencing datasets. Often, these datasets are published from the isolated and different sequencing facilities. Consequently, the process of sharing and aggregating multisite sequencing datasets are thwarted by issues such as the need to discover relevant data from different sources, built scalable repositories, the automation of data linkage, the volume of the data, efficient querying mechanism, and information rich intuitive visualisation. RESULTS: We present an approach to link and query different sequencing datasets (TCGA, COSMIC, REACTOME, KEGG and GO) to indicate risks for four cancer types - Ovarian Serous Cystadenocarcinoma (OV), Uterine Corpus Endometrial Carcinoma (UCEC), Uterine Carcinosarcoma (UCS), Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma (CESC) - covering the 16 healthy tissue-specific genes from Illumina Human Body Map 2.0. The differentially expressed genes from Illumina Human Body Map 2.0 are analysed together with the gene expressions reported in COSMIC and TCGA repositories leading to the discover of potential biomarkers for a tissue-specific cancer. CONCLUSION: We analyse the tissue expression of genes, copy number variation (CNV), somatic mutation, and promoter methylation to identify associated pathways and find novel biomarkers. We discovered twenty (20) mutated genes and three (3) potential pathways causing promoter changes in different gynaecological cancer types. We propose a data-interlinked platform called BIOOPENER that glues together heterogeneous cancer and biomedical repositories. The key approach is to find correspondences (or data links) among genetic, cellular and molecular features across isolated cancer datasets giving insight into cancer progression from normal to diseased tissues. The proposed BIOOPENER platform enriches mutations by filling in missing links from TCGA, COSMIC, REACTOME, KEGG and GO datasets and provides an interlinking mechanism to understand cancer progression from normal to diseased tissues with pathway components, which in turn helped to map mutations, associated phenotypes, pathways, and mechanism.

Subject(s)

Biomarkers, Tumor/metabolism , Genital Diseases, Female/metabolism , Medical Informatics/methods , Precision Medicine/methods , Data Mining , Databases, Factual , Female , Humans

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL