Search | VHL Regional Portal

Robust importance sampling for error estimation in the context of optimal Bayesian transfer learning.

Maddouri, Omar; Qian, Xiaoning; Alexander, Francis J; Dougherty, Edward R; Yoon, Byung-Jun.

Patterns (N Y) ; 3(3): 100428, 2022 Mar 11.

Article in English | MEDLINE | ID: mdl-35510184

ABSTRACT

Classification has been a major task for building intelligent systems because it enables decision-making under uncertainty. Classifier design aims at building models from training data for representing feature-label distributions-either explicitly or implicitly. In many scientific or clinical settings, training data are typically limited, which impedes the design and evaluation of accurate classifiers. Atlhough transfer learning can improve the learning in target domains by incorporating data from relevant source domains, it has received little attention for performance assessment, notably in error estimation. Here, we investigate knowledge transferability in the context of classification error estimation within a Bayesian paradigm. We introduce a class of Bayesian minimum mean-square error estimators for optimal Bayesian transfer learning, which enables rigorous evaluation of classification error under uncertainty in small-sample settings. Using Monte Carlo importance sampling, we illustrate the outstanding performance of the proposed estimator for a broad family of classifiers that span diverse learning capabilities.

Synthetic data for design and evaluation of binary classifiers in the context of Bayesian transfer learning.

Maddouri, Omar; Qian, Xiaoning; Alexander, Francis J; Dougherty, Edward R; Yoon, Byung-Jun.

Data Brief ; 42: 108113, 2022 Jun.

Article in English | MEDLINE | ID: mdl-35434232

ABSTRACT

Transfer learning (TL) techniques can enable effective learning in data scarce domains by allowing one to re-purpose data or scientific knowledge available in relevant source domains for predictive tasks in a target domain of interest. In this Data in Brief article, we present a synthetic dataset for binary classification in the context of Bayesian transfer learning, which can be used for the design and evaluation of TL-based classifiers. For this purpose, we consider numerous combinations of classification settings, based on which we simulate a diverse set of feature-label distributions with varying learning complexity. For each set of model parameters, we provide a pair of target and source datasets that have been jointly sampled from the underlying feature-label distributions in the target and source domains, respectively. For both target and source domains, the data in a given class and domain are normally distributed, where the distributions across domains are related to each other through a joint prior. To ensure the consistency of the classification complexity across the provided datasets, we have controlled the Bayes error such that it is maintained within a range of predefined values that mimic realistic classification scenarios across different relatedness levels. The provided datasets may serve as useful resources for designing and benchmarking transfer learning schemes for binary classification as well as the estimation of classification error.

Deep graph representations embed network information for robust disease marker identification.

Maddouri, Omar; Qian, Xiaoning; Yoon, Byung-Jun.

Bioinformatics ; 38(4): 1075-1086, 2022 01 27.

Article in English | MEDLINE | ID: mdl-34788368

ABSTRACT

MOTIVATION: Accurate disease diagnosis and prognosis based on omics data rely on the effective identification of robust prognostic and diagnostic markers that reflect the states of the biological processes underlying the disease pathogenesis and progression. In this article, we present GCNCC, a Graph Convolutional Network-based approach for Clustering and Classification, that can identify highly effective and robust network-based disease markers. Based on a geometric deep learning framework, GCNCC learns deep network representations by integrating gene expression data with protein interaction data to identify highly reproducible markers with consistently accurate prediction performance across independent datasets possibly from different platforms. GCNCC identifies these markers by clustering the nodes in the protein interaction network based on latent similarity measures learned by the deep architecture of a graph convolutional network, followed by a supervised feature selection procedure that extracts clusters that are highly predictive of the disease state. RESULTS: By benchmarking GCNCC based on independent datasets from different diseases (psychiatric disorder and cancer) and different platforms (microarray and RNA-seq), we show that GCNCC outperforms other state-of-the-art methods in terms of accuracy and reproducibility. AVAILABILITY AND IMPLEMENTATION: https://github.com/omarmaddouri/GCNCC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Neural Networks, Computer , Protein Interaction Maps , Humans , Reproducibility of Results

Neuro-symbolic representation learning on biological knowledge graphs.

Alshahrani, Mona; Khan, Mohammad Asif; Maddouri, Omar; Kinjo, Akira R; Queralt-Rosinach, Núria; Hoehndorf, Robert.

Bioinformatics ; 33(17): 2723-2730, 2017 Sep 01.

Article in English | MEDLINE | ID: mdl-28449114

ABSTRACT

MOTIVATION: Biological data and knowledge bases increasingly rely on Semantic Web technologies and the use of knowledge graphs for data integration, retrieval and federated queries. In the past years, feature learning methods that are applicable to graph-structured data are becoming available, but have not yet widely been applied and evaluated on structured biological knowledge. Results: We develop a novel method for feature learning on biological knowledge graphs. Our method combines symbolic methods, in particular knowledge representation using symbolic logic and automated reasoning, with neural networks to generate embeddings of nodes that encode for related information within knowledge graphs. Through the use of symbolic logic, these embeddings contain both explicit and implicit information. We apply these embeddings to the prediction of edges in the knowledge graph representing problems of function prediction, finding candidate genes of diseases, protein-protein interactions, or drug target relations, and demonstrate performance that matches and sometimes outperforms traditional approaches based on manually crafted features. Our method can be applied to any biological knowledge graph, and will thereby open up the increasing amount of Semantic Web based knowledge bases in biology to use in machine learning and data analytics. AVAILABILITY AND IMPLEMENTATION: https://github.com/bio-ontology-research-group/walking-rdf-and-owl. CONTACT: robert.hoehndorf@kaust.edu.sa. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Computational Biology/methods , Knowledge Bases , Machine Learning , Neural Networks, Computer , Humans

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL