Search | VHL Search Portal

1.

Leveraging Bayesian networks and information theory to learn risk factors for breast cancer metastasis.

Jiang, Xia; Wells, Alan; Brufsky, Adam; Shetty, Darshan; Shajihan, Kahmil; Neapolitan, Richard E.

BMC Bioinformatics ; 21(1): 298, 2020 Jul 10.

Article in English | MEDLINE | ID: mdl-32650714

ABSTRACT

BACKGROUND: Even though we have established a few risk factors for metastatic breast cancer (MBC) through epidemiologic studies, these risk factors have not proven to be effective in predicting an individual's risk of developing metastasis. Therefore, identifying critical risk factors for MBC continues to be a major research imperative, and one which can lead to advances in breast cancer clinical care. The objective of this research is to leverage Bayesian Networks (BN) and information theory to identify key risk factors for breast cancer metastasis from data. METHODS: We develop the Markov Blanket and Interactive risk factor Learner (MBIL) algorithm, which learns single and interactive risk factors having a direct influence on a patient's outcome. We evaluate the effectiveness of MBIL using simulated datasets, and compare MBIL with the BN learning algorithms Fast Greedy Search (FGS), PC algorithm (PC), and CPC algorithm (CPC). We apply MBIL to learn risk factors for 5 year breast cancer metastasis using a clinical dataset we curated. We evaluate the learned risk factors by consulting with breast cancer experts and literature. We further evaluate the effectiveness of MBIL at learning risk factors for breast cancer metastasis by comparing it to the BN learning algorithms Necessary Path Condition (NPC) and Greedy Equivalent Search (GES). RESULTS: The averages of the Jaccard index for the simulated datasets containing 2000 records were 0.705, 0.272, 0.228, and 0.147 for MBIL, FGS, PC, and CPC respectively. MBIL, NPC, and GES all learned that grade and lymph_nodes_positive are direct risk factors for 5 year metastasis. Only MBIL and NPC found that surgical_margins is a direct risk factor. Only NPC found that invasive is a direct risk factor. MBIL learned that HER2 and ER interact to directly affect 5 year metastasis. Neither GES nor NPC learned that HER2 and ER are direct risk factors. DISCUSSION: The results involving simulated datasets indicated that MBIL can learn direct risk factors substantially better than standard Bayesian network learning algorithms. An application of MBIL to a real breast cancer dataset identified both single and interactive risk factors that directly influence breast cancer metastasis, which can be investigated further.

Subject(s)

Algorithms , Breast Neoplasms/pathology , Bayes Theorem , Female , Humans , Information Theory , Markov Chains , Neoplasm Metastasis , Risk Factors

2.

Using natural language processing and machine learning to identify breast cancer local recurrence.

Zeng, Zexian; Espino, Sasa; Roy, Ankita; Li, Xiaoyu; Khan, Seema A; Clare, Susan E; Jiang, Xia; Neapolitan, Richard; Luo, Yuan.

BMC Bioinformatics ; 19(Suppl 17): 498, 2018 Dec 28.

Article in English | MEDLINE | ID: mdl-30591037

ABSTRACT

BACKGROUND: Identifying local recurrences in breast cancer from patient data sets is important for clinical research and practice. Developing a model using natural language processing and machine learning to identify local recurrences in breast cancer patients can reduce the time-consuming work of a manual chart review. METHODS: We design a novel concept-based filter and a prediction model to detect local recurrences using EHRs. In the training dataset, we manually review a development corpus of 50 progress notes and extract partial sentences that indicate breast cancer local recurrence. We process these partial sentences to obtain a set of Unified Medical Language System (UMLS) concepts using MetaMap, and we call it positive concept set. We apply MetaMap on patients' progress notes and retain only the concepts that fall within the positive concept set. These features combined with the number of pathology reports recorded for each patient are used to train a support vector machine to identify local recurrences. RESULTS: We compared our model with three baseline classifiers using either full MetaMap concepts, filtered MetaMap concepts, or bag of words. Our model achieved the best AUC (0.93 in cross-validation, 0.87 in held-out testing). CONCLUSIONS: Compared to a labor-intensive chart review, our model provides an automated way to identify breast cancer local recurrences. We expect that by minimally adapting the positive concept set, this study has the potential to be replicated at other institutions with a moderately sized training dataset.

Subject(s)

Breast Neoplasms/diagnosis , Machine Learning , Natural Language Processing , Neoplasm Recurrence, Local/diagnosis , Cohort Studies , Electronic Health Records , Female , Humans , Reproducibility of Results , Support Vector Machine , Unified Medical Language System

3.

Evaluation of a two-stage framework for prediction using big genomic data.

Jiang, Xia; Neapolitan, Richard E.

Brief Bioinform ; 16(6): 912-21, 2015 Nov.

Article in English | MEDLINE | ID: mdl-25788325

ABSTRACT

We are in the era of abundant 'big' or 'high-dimensional' data. These data afford us the opportunity to discover predictors of an event of interest, and to estimate occurrence of the event based on values of these predictors. For example, 'genome-wide association studies' examine millions of single-nucleotide polymorphisms (SNPs), along with disease status. We can learn SNPs that affect disease status from these data sets, and use the knowledge learned to predict disease likelihood. Owing to the large number of features, it is difficult for many prediction methods to use all the features directly. The ReliefF algorithm ranks a set of features in terms of how well they predict a target. It can be used to identify good predictors, which can then be provided to a prediction method. We compared the performance of eight prediction methods when predicting binary outcomes using high-dimensional discrete data sets. We performed two-stage prediction, where ReliefF is used in the first stage to identify good predictors. Bayesian network (BN)-based methods performed best overall. Furthermore, ReliefF did not improve their performance. The BN-based methods use the Bayesian Dirichlet Equivalent Uniform score to evaluate candidate models, and use BN inference algorithms to perform prediction. This score and these algorithms were developed for discrete variables. This perhaps explains why they perform better in this domain. Many prediction methods are available, and researchers have little reason for choosing one over the other in the domain of binary prediction using high-dimensional data sets. Our results indicate that the best choices overall are BN-based methods.

Subject(s)

Genomics , Algorithms , Evaluation Studies as Topic , Genome-Wide Association Study , Polymorphism, Single Nucleotide

4.

Discovering causal interactions using Bayesian network scoring and information gain.

Zeng, Zexian; Jiang, Xia; Neapolitan, Richard.

BMC Bioinformatics ; 17(1): 221, 2016 May 26.

Article in English | MEDLINE | ID: mdl-27230078

ABSTRACT

BACKGROUND: The problem of learning causal influences from data has recently attracted much attention. Standard statistical methods can have difficulty learning discrete causes, which interacting to affect a target, because the assumptions in these methods often do not model discrete causal relationships well. An important task then is to learn such interactions from data. Motivated by the problem of learning epistatic interactions from datasets developed in genome-wide association studies (GWAS), researchers conceived new methods for learning discrete interactions. However, many of these methods do not differentiate a model representing a true interaction from a model representing non-interacting causes with strong individual affects. The recent algorithm MBS-IGain addresses this difficulty by using Bayesian network learning and information gain to discover interactions from high-dimensional datasets. However, MBS-IGain requires marginal effects to detect interactions containing more than two causes. If the dataset is not high-dimensional, we can avoid this shortcoming by doing an exhaustive search. RESULTS: We develop Exhaustive-IGain, which is like MBS-IGain but does an exhaustive search. We compare the performance of Exhaustive-IGain to MBS-IGain using low-dimensional simulated datasets based on interactions with marginal effects and ones based on interactions without marginal effects. Their performance is similar on the datasets based on marginal effects. However, Exhaustive-IGain compellingly outperforms MBS-IGain on the datasets based on 3 and 4-cause interactions without marginal effects. We apply Exhaustive-IGain to investigate how clinical variables interact to affect breast cancer survival, and obtain results that agree with judgements of a breast cancer oncologist. CONCLUSIONS: We conclude that the combined use of information gain and Bayesian network scoring enables us to discover higher order interactions with no marginal effects if we perform an exhaustive search. We further conclude that Exhaustive-IGain can be effective when applied to real data.

Subject(s)

Bayes Theorem , Databases, Genetic/standards , Genome-Wide Association Study/methods , Breast Neoplasms/genetics , Breast Neoplasms/mortality , Female , Humans

5.

LEAP: biomarker inference through learning and evaluating association patterns.

Jiang, Xia; Neapolitan, Richard E.

Genet Epidemiol ; 39(3): 173-84, 2015 Mar.

Article in English | MEDLINE | ID: mdl-25677188

ABSTRACT

Single nucleotide polymorphism (SNP) high-dimensional datasets are available from Genome Wide Association Studies (GWAS). Such data provide researchers opportunities to investigate the complex genetic basis of diseases. Much of genetic risk might be due to undiscovered epistatic interactions, which are interactions in which combination of several genes affect disease. Research aimed at discovering interacting SNPs from GWAS datasets proceeded in two directions. First, tools were developed to evaluate candidate interactions. Second, algorithms were developed to search over the space of candidate interactions. Another problem when learning interacting SNPs, which has not received much attention, is evaluating how likely it is that the learned SNPs are associated with the disease. A complete system should provide this information as well. We develop such a system. Our system, called LEAP, includes a new heuristic search algorithm for learning interacting SNPs, and a Bayesian network based algorithm for computing the probability of their association. We evaluated the performance of LEAP using 100 1,000-SNP simulated datasets, each of which contains 15 SNPs involved in interactions. When learning interacting SNPs from these datasets, LEAP outperformed seven others methods. Furthermore, only SNPs involved in interactions were found to be probable. We also used LEAP to analyze real Alzheimer's disease and breast cancer GWAS datasets. We obtained interesting and new results from the Alzheimer's dataset, but limited results from the breast cancer dataset. We conclude that our results support that LEAP is a useful tool for extracting candidate interacting SNPs from high-dimensional datasets and determining their probability.

Subject(s)

Algorithms , Alzheimer Disease/genetics , Biomarkers/analysis , Breast Neoplasms/genetics , Genome-Wide Association Study , Polymorphism, Single Nucleotide/genetics , Artificial Intelligence , Bayes Theorem , Epistasis, Genetic , Female , Humans

6.

Pan-cancer analysis of TCGA data reveals notable signaling pathways.

Neapolitan, Richard; Horvath, Curt M; Jiang, Xia.

BMC Cancer ; 15: 516, 2015 Jul 14.

Article in English | MEDLINE | ID: mdl-26169172

ABSTRACT

BACKGROUND: A signal transduction pathway (STP) is a network of intercellular information flow initiated when extracellular signaling molecules bind to cell-surface receptors. Many aberrant STPs have been associated with various cancers. To develop optimal treatments for cancer patients, it is important to discover which STPs are implicated in a cancer or cancer-subtype. The Cancer Genome Atlas (TCGA) makes available gene expression level data on cases and controls in ten different types of cancer including breast cancer, colon adenocarcinoma, glioblastoma, kidney renal papillary cell carcinoma, low grade glioma, lung adenocarcinoma, lung squamous cell carcinoma, ovarian carcinoma, rectum adenocarcinoma, and uterine corpus endometriod carcinoma. Signaling Pathway Impact Analysis (SPIA) is a software package that analyzes gene expression data to identify whether a pathway is relevant in a given condition. METHODS: We present the results of a study that uses SPIA to investigate all 157 signaling pathways in the KEGG PATHWAY database. We analyzed each of the ten cancer types mentioned above separately, and we perform a pan-cancer analysis by grouping the data for all the cancer types. RESULTS: In each analysis several pathways were found to be markedly more significant than all the other pathways. We call them notable. Research has already established a connection between many of these pathways and the corresponding cancer type. However, some of our discovered pathways appear to be new findings. Altogether there were 37 notable findings in the separate analyses, 26 of them occurred in 7 pathways. These 7 pathways included the 4 notable pathways discovered in the pan-cancer analysis. So, our results suggest that these 7 pathways account for much of the mechanisms of cancer. Furthermore, by looking at the overlap among pathways, we identified possible regions on the pathways where the aberrant activity is occurring. CONCLUSIONS: We obtained 37 notable findings concerning 18 pathways. Some of them appear to be new discoveries. Furthermore, we identified regions on pathways where the aberrant activity might be occurring. We conclude that our results will prove to be valuable to cancer researchers because they provide many opportunities for laboratory and clinical follow-up studies.

Subject(s)

Genomics , Neoplasms/genetics , Neoplasms/metabolism , Signal Transduction , Cluster Analysis , Computational Biology , Databases, Genetic , Female , Genomics/methods , Humans , Male , Protein Interaction Mapping , Protein Interaction Maps

7.

Learning genetic epistasis using Bayesian network scoring criteria.

Jiang, Xia; Neapolitan, Richard E; Barmada, M Michael; Visweswaran, Shyam.

BMC Bioinformatics ; 12: 89, 2011 Mar 31.

Article in English | MEDLINE | ID: mdl-21453508

ABSTRACT

BACKGROUND: Gene-gene epistatic interactions likely play an important role in the genetic basis of many common diseases. Recently, machine-learning and data mining methods have been developed for learning epistatic relationships from data. A well-known combinatorial method that has been successfully applied for detecting epistasis is Multifactor Dimensionality Reduction (MDR). Jiang et al. created a combinatorial epistasis learning method called BNMBL to learn Bayesian network (BN) epistatic models. They compared BNMBL to MDR using simulated data sets. Each of these data sets was generated from a model that associates two SNPs with a disease and includes 18 unrelated SNPs. For each data set, BNMBL and MDR were used to score all 2-SNP models, and BNMBL learned significantly more correct models. In real data sets, we ordinarily do not know the number of SNPs that influence phenotype. BNMBL may not perform as well if we also scored models containing more than two SNPs. Furthermore, a number of other BN scoring criteria have been developed. They may detect epistatic interactions even better than BNMBL.Although BNs are a promising tool for learning epistatic relationships from data, we cannot confidently use them in this domain until we determine which scoring criteria work best or even well when we try learning the correct model without knowledge of the number of SNPs in that model. RESULTS: We evaluated the performance of 22 BN scoring criteria using 28,000 simulated data sets and a real Alzheimer's GWAS data set. Our results were surprising in that the Bayesian scoring criterion with large values of a hyperparameter called α performed best. This score performed better than other BN scoring criteria and MDR at recall using simulated data sets, at detecting the hardest-to-detect models using simulated data sets, and at substantiating previous results using the real Alzheimer's data set. CONCLUSIONS: We conclude that representing epistatic interactions using BN models and scoring them using a BN scoring criterion holds promise for identifying epistatic genetic variants in data. In particular, the Bayesian scoring criterion with large values of a hyperparameter α appears more promising than a number of alternatives.

Subject(s)

Computational Biology/methods , Epistasis, Genetic , Models, Genetic , Bayes Theorem , Genotype , Humans , Multifactor Dimensionality Reduction , Polymorphism, Single Nucleotide/genetics

8.

Stopping Rules for Computer Adaptive Testing When Item Banks Have Nonuniform Information.

Morris, Scott B; Bass, Michael; Howard, Elizabeth; Neapolitan, Richard E.

Int J Test ; 20(2): 146-168, 2020.

Article in English | MEDLINE | ID: mdl-32982603

ABSTRACT

The standard error (SE) stopping rule, which terminates a computer adaptive test (CAT) when the SE is less than a threshold, is effective when there are informative questions for all trait levels. However, in domains such as patient reported outcomes, the items in a bank might all target one end of the trait continuum (e.g., negative symptoms), and the bank may lack depth for many individuals. In such cases, the predicted standard error reduction (PSER) stopping rule will stop the CAT even if the SE threshold has not been reached, and can avoid administering excessive questions that provide little additional information. By tuning the parameters of the PSER algorithm, a practitioner can specify a desired tradeoff between accuracy and efficiency. Using simulated data for the PROMIS Anxiety and Physical Function banks, we demonstrate that these parameters can substantially impact CAT performance. When the parameters were optimally tuned, the PSER stopping rule was found to outperform the SE stopping rule overall and particularly for individuals not targeted by the bank, and presented roughly the same number of items across the trait continuum. Therefore, the PSER stopping rule provides an effective method for balancing the precision and efficiency of a CAT.

9.

A clinical decision support system learned from data to personalize treatment recommendations towards preventing breast cancer metastasis.

Jiang, Xia; Wells, Alan; Brufsky, Adam; Neapolitan, Richard.

PLoS One ; 14(3): e0213292, 2019.

Article in English | MEDLINE | ID: mdl-30849111

ABSTRACT

OBJECTIVE: A Clinical Decision Support System (CDSS) that can amass Electronic Health Record (EHR) and other patient data holds promise to provide accurate classification and guide treatment choices. Our objective is to develop the Decision Support System for Making Personalized Assessments and Recommendations Concerning Breast Cancer Patients (DPAC), which is a CDSS learned from data that recommends the optimal treatment decisions based on a patient's features. METHOD: We developed a Bayesian network architecture called Causal Modeling with Internal Layers (CAMIL), and an algorithm called Treatment Feature Interactions (TFI), which learns from data the interactions needed in a CAMIL model. Using the TFI algorithm, we learned interactions for six treatments from the LSDS-5YDM dataset. We created a CAMIL model using these interactions, resulting in a DPAC which recommends treatments towards preventing 5-year breast cancer metastasis. RESULTS: In a 5-fold cross-validation analysis, we compared the probability of being metastasis free in 5 years for patients who made decisions recommended by DPAC to those who did not. These probabilities are (the probability for those making the decisions appears first): chemotherapy (.938, .872); breast/chest wall radiation (.939, .902); nodal field radiation (.940, .784); antihormone (.941, .906); HER2 inhibitors (.934, .880); neadjuvant therapy (.931, .837). In an application of DPAC to the independent METABRIC dataset, the probabilities for chemotherapy were (.845, .788). DISCUSSION: Patients who took the advice of DPAC had, as a group, notably better outcomes than those who did not. We conclude that DPAC is effective at amassing and analyzing data towards treatment recommendations. Some of the findings in DPAC are controversial. For example, DPAC says that chemotherapy increases the chances of metastasis for many node negative patients. This controversy shows the importance of developing a conclusive version of DPAC to ensure we provide patients with the best patient-specific treatment recommendations.

Subject(s)

Bayes Theorem , Breast Neoplasms/therapy , Decision Support Systems, Clinical , Models, Theoretical , Practice Guidelines as Topic/standards , Precision Medicine , Adolescent , Adult , Algorithms , Child , Child, Preschool , Combined Modality Therapy , Female , Humans , Infant , Infant, Newborn , Middle Aged , Neoplasm Metastasis , Young Adult

10.

Conjugated equine estrogen and medroxyprogesterone acetate are associated with decreased risk of breast cancer relative to bioidentical hormone therapy and controls.

Zeng, Zexian; Jiang, Xia; Li, Xiaoyu; Wells, Alan; Luo, Yuan; Neapolitan, Richard.

PLoS One ; 13(5): e0197064, 2018.

Article in English | MEDLINE | ID: mdl-29768475

ABSTRACT

OBJECTIVE: By the 1990s it became popular for women to use hormone therapy (HT) to ease menopause symptoms. Bioidentical estrogen and progesterone are supplements whose molecular structures are identical to what is made in the human body, while synthetic supplements are ones whose structures are not. After the Women's Health Initiative found that the combined use of the synthetics conjugated equine estrogen (CEE) and medroxyprogesterone acetate (MPA) increased breast cancer risk, prescriptions for synthetic HT declined considerably. Since then there has been an increased interest in bioidentical HT; today there are a plethora of websites touting their benefits. However, no peer-reviewed articles support these claims. We performed a retrospective study with the objective of verifying the hypothesis that bioidentical HT is associated with decreased breast cancer risk than CEE & MPA. METHODS: We searched The Northwestern Medicine Enterprise Data Warehouse for women who initiated HT use after age 50. Women who did not take any HT drug after age 50 served as controls. Nine HT protocols were investigated for breast cancer risk. RESULTS: Significant results include CEE Alone is associated with decreased breast cancer risk (HR = 0.31), Other Synthetic Estrogen Alone is associated with increased breast cancer risk (HR = 1.49), Bioidentical Estrogen Alone is associated with decreased breast cancer risk(HR = 0.65), CEE & MPA is associated with reduced breast cancer risk (HR = 0.43), and CEE & MPA is associated with reduced breast cancer risk relative to Bioidentical Estrogen & Progesterone (HR = 0.25). DISCUSSION: Our results indicate CEE & MPA is superior to bioidentical HT as far as breast cancer risk. Furthermore, this combination is associated with decrease of breast cancer risk, contrary to previous findings. Additional retrospective studies are needed to confirm our results.

Subject(s)

Breast Neoplasms/epidemiology , Estrogen Replacement Therapy , Estrogens, Conjugated (USP)/therapeutic use , Medroxyprogesterone/therapeutic use , Menopause/drug effects , Breast Neoplasms/chemically induced , Estrogens, Conjugated (USP)/adverse effects , Female , Humans , Medroxyprogesterone/adverse effects , Middle Aged

11.

Advancing the efficiency and efficacy of patient reported outcomes with multivariate computer adaptive testing.

Morris, Scott; Bass, Mike; Lee, Mirinae; Neapolitan, Richard E.

J Am Med Inform Assoc ; 24(5): 897-902, 2017 Sep 01.

Article in English | MEDLINE | ID: mdl-28444397

ABSTRACT

OBJECTIVE: The Patient Reported Outcomes Measurement Information System (PROMIS) initiative developed an array of patient reported outcome (PRO) measures. To reduce the number of questions administered, PROMIS utilizes unidimensional item response theory and unidimensional computer adaptive testing (UCAT), which means a separate set of questions is administered for each measured trait. Multidimensional item response theory (MIRT) and multidimensional computer adaptive testing (MCAT) simultaneously assess correlated traits. The objective was to investigate the extent to which MCAT reduces patient burden relative to UCAT in the case of PROs. METHODS: One MIRT and 3 unidimensional item response theory models were developed using the related traits anxiety, depression, and anger. Using these models, MCAT and UCAT performance was compared with simulated individuals. RESULTS: Surprisingly, the root mean squared error for both methods increased with the number of items. These results were driven by large errors for individuals with low trait levels. A second analysis focused on individuals aligned with item content. For these individuals, both MCAT and UCAT accuracies improved with additional items. Furthermore, MCAT reduced the test length by 50%. DISCUSSION: For the PROMIS Emotional Distress banks, neither UCAT nor MCAT provided accurate estimates for individuals at low trait levels. Because the items in these banks were designed to detect clinical levels of distress, there is little information for individuals with low trait values. However, trait estimates for individuals targeted by the banks were accurate and MCAT asked substantially fewer questions. CONCLUSION: By reducing the number of items administered, MCAT can allow clinicians and researchers to assess a wider range of PROs with less patient burden.

Subject(s)

Computers , Patient Reported Outcome Measures , Surveys and Questionnaires , Humans , Information Systems , Models, Psychological , Precision Medicine

12.

A Primer on Bayesian Decision Analysis With an Application to a Kidney Transplant Decision.

Neapolitan, Richard; Jiang, Xia; Ladner, Daniela P; Kaplan, Bruce.

Transplantation ; 100(3): 489-96, 2016 Mar.

Article in English | MEDLINE | ID: mdl-26900809

ABSTRACT

A clinical decision support system (CDSS) is a computer program, which is designed to assist health care professionals with decision making tasks. A well-developed CDSS weighs the benefits of therapy versus the cost in terms of loss of quality of life and financial loss and recommends the decision that can be expected to provide maximum overall benefit. This article provides an introduction to developing CDSSs using Bayesian networks, such CDSS can help with the often complex decisions involving transplants. First, we review Bayes theorem in the context of medical decision making. Then, we introduce Bayesian networks, which can model probabilistic relationships among many related variables and are based on Bayes theorem. Next, we discuss influence diagrams, which are Bayesian networks augmented with decision and value nodes and which can be used to develop CDSSs that are able to recommend decisions that maximize the expected utility of the predicted outcomes to the patient. By way of comparison, we examine the benefit and challenges of using the Kidney Donor Risk Index as the sole decision tool. Finally, we develop a schema for an influence diagram that models generalized kidney transplant decisions and show how the influence diagram approach can provide the clinician and the potential transplant recipient with a valuable decision support tool.

Subject(s)

Bayes Theorem , Decision Support Systems, Clinical , Decision Support Techniques , Donor Selection/methods , Kidney Transplantation/methods , Patient Selection , Algorithms , Decision Trees , Humans , Kidney Transplantation/adverse effects , Life Expectancy , Risk Assessment , Risk Factors , Treatment Outcome

13.

Study of integrated heterogeneous data reveals prognostic power of gene expression for breast cancer survival.

Neapolitan, Richard E; Jiang, Xia.

PLoS One ; 10(2): e0117658, 2015.

Article in English | MEDLINE | ID: mdl-25723490

ABSTRACT

BACKGROUND: Studies show that thousands of genes are associated with prognosis of breast cancer. Towards utilizing available genetic data, efforts have been made to predict outcomes using gene expression data, and a number of commercial products have been developed. These products have the following shortcomings: 1) They use the Cox model for prediction. However, the RSF model has been shown to significantly outperform the Cox model. 2) Testing was not done to see if a complete set of clinical predictors could predict as well as the gene expression signatures. METHODOLOGY/FINDINGS: We address these shortcomings. The METABRIC data set concerns 1981 breast cancer tumors. Features include 21 clinical features, expression levels for 16,384 genes, and survival. We compare the survival prediction performance of the Cox model and the RSF model using the clinical data and the gene expression data to their performance using only the clinical data. We obtain significantly better results when we used both clinical data and gene expression data for 5 year, 10 year, and 15 year survival prediction. When we replace the gene expression data by PAM50 subtype, our results are significant only for 5 year and 15 year prediction. We obtain significantly better results using the RSF model over the Cox model. Finally, our results indicate that gene expression data alone may predict long-term survival. CONCLUSIONS/SIGNIFICANCE: Our results indicate that we can obtain improved survival prediction using clinical data and gene expression data compared to prediction using only clinical data. We further conclude that we can obtain improved survival prediction using the RSF model instead of the Cox model. These results are significant because by incorporating more gene expression data with clinical features and using the RSF model, we could develop decision support systems that better utilize heterogeneous information to improve outcome prediction and decision making.

Subject(s)

Breast Neoplasms/genetics , Breast Neoplasms/mortality , Gene Expression Regulation, Neoplastic , Algorithms , Breast Neoplasms/pathology , Cluster Analysis , Computational Biology , Female , Gene Expression Profiling , Humans , Lymphatic Metastasis , Neoplasm Staging , Prognosis , Time Factors , Transcriptome , Tumor Burden

14.

Learning Predictive Interactions Using Information Gain and Bayesian Network Scoring.

Jiang, Xia; Jao, Jeremy; Neapolitan, Richard.

PLoS One ; 10(12): e0143247, 2015.

Article in English | MEDLINE | ID: mdl-26624895

ABSTRACT

BACKGROUND: The problems of correlation and classification are long-standing in the fields of statistics and machine learning, and techniques have been developed to address these problems. We are now in the era of high-dimensional data, which is data that can concern billions of variables. These data present new challenges. In particular, it is difficult to discover predictive variables, when each variable has little marginal effect. An example concerns Genome-wide Association Studies (GWAS) datasets, which involve millions of single nucleotide polymorphism (SNPs), where some of the SNPs interact epistatically to affect disease status. Towards determining these interacting SNPs, researchers developed techniques that addressed this specific problem. However, the problem is more general, and so these techniques are applicable to other problems concerning interactions. A difficulty with many of these techniques is that they do not distinguish whether a learned interaction is actually an interaction or whether it involves several variables with strong marginal effects. METHODOLOGY/FINDINGS: We address this problem using information gain and Bayesian network scoring. First, we identify candidate interactions by determining whether together variables provide more information than they do separately. Then we use Bayesian network scoring to see if a candidate interaction really is a likely model. Our strategy is called MBS-IGain. Using 100 simulated datasets and a real GWAS Alzheimer's dataset, we investigated the performance of MBS-IGain. CONCLUSIONS/SIGNIFICANCE: When analyzing the simulated datasets, MBS-IGain substantially out-performed nine previous methods at locating interacting predictors, and at identifying interactions exactly. When analyzing the real Alzheimer's dataset, we obtained new results and results that substantiated previous findings. We conclude that MBS-IGain is highly effective at finding interactions in high-dimensional datasets. This result is significant because we have increasingly abundant high-dimensional data in many domains, and to learn causes and perform prediction/classification using these data, we often must first identify interactions.

Subject(s)

Computational Biology/methods , Machine Learning , Algorithms , Bayes Theorem , Genome-Wide Association Study , Polymorphism, Single Nucleotide

15.

Utilizing Multidimensional Computer Adaptive Testing to Mitigate Burden With Patient Reported Outcomes.

Bass, Michael; Morris, Scott; Neapolitan, Richard.

AMIA Annu Symp Proc ; 2015: 320-8, 2015.

Article in English | MEDLINE | ID: mdl-26958163

ABSTRACT

Utilization of patient-reported outcome measures (PROs) had been limited by the lack of psychometrically sound measures scored in real-time. The Patient Reported Outcomes Measurement Information System (PROMIS) initiative developed a broad array of high-quality PRO measures. Towards reducing the number of items administered in measuring PROs, PROMIS employs Item Response Theory (IRT) and Computer Adaptive Testing (CAT). By only administering questions targeted to the subject's trait level, CAT has cut testing times in half(1). The IRT/CAT implementation in PROMIS is unidimensional in that there is a separate set of questions administered for each measured trait. However, there are often correlations among traits. Multidimensional IRT (MIRT) and multidimensional CAT (MCAT) provide items concerning several correlated traits, and should ameliorate patient burden. We developed an MIRT model using existing PROMIS item banks for depression and anxiety, developed MCAT software, and compared the efficiency of the MCAT approach to the unidimensional approach. Note: Research reported in this publication was supported in part by the National Library of Medicine of the National Institutes of Health under Award Number R01LM011962.

Subject(s)

Anxiety/diagnosis , Decision Making, Computer-Assisted , Depression/diagnosis , Patient Reported Outcome Measures , Psychometrics/methods , Computer Simulation , Diagnosis, Computer-Assisted , Humans

16.

Inferring Aberrant Signal Transduction Pathways in Ovarian Cancer from TCGA Data.

Neapolitan, Richard; Jiang, Xia.

Cancer Inform ; 13(Suppl 1): 29-36, 2014.

Article in English | MEDLINE | ID: mdl-25392681

ABSTRACT

This paper concerns a new method for identifying aberrant signal transduction pathways (STPs) in cancer using case/control gene expression-level datasets, and applying that method and an existing method to an ovarian carcinoma dataset. Both methods identify STPs that are plausibly linked to all cancers based on current knowledge. Thus, the paper is most appropriate for the cancer informatics community. Our hypothesis is that STPs that are altered in tumorous tissue can be identified by applying a new Bayesian network (BN)-based method (causal analysis of STP aberration (CASA)) and an existing method (signaling pathway impact analysis (SPIA)) to the cancer genome atlas (TCGA) gene expression-level datasets. To test this hypothesis, we analyzed 20 cancer-related STPs and 6 randomly chosen STPs using the 591 cases in the TCGA ovarian carcinoma dataset, and the 102 controls in all 5 TCGA cancer datasets. We identified all the genes related to each of the 26 pathways, and developed separate gene expression datasets for each pathway. The results of the two methods were highly correlated. Furthermore, many of the STPs that ranked highest according to both methods are plausibly linked to all cancers based on current knowledge. Finally, CASA ranked the cancer-related STPs over the randomly selected STPs at a significance level below 0.05 (P = 0.047), but SPIA did not (P = 0.083).

17.

Modeling the altered expression levels of genes on signaling pathways in tumors as causal bayesian networks.

Neapolitan, Richard; Xue, Diyang; Jiang, Xia.

Cancer Inform ; 13: 77-84, 2014.

Article in English | MEDLINE | ID: mdl-24932098

ABSTRACT

This paper concerns a study indicating that the expression levels of genes in signaling pathways can be modeled using a causal Bayesian network (BN) that is altered in tumorous tissue. These results open up promising areas of future research that can help identify driver genes and therapeutic targets. So, it is most appropriate for the cancer informatics community. Our central hypothesis is that the expression levels of genes that code for proteins on a signal transduction network (STP) are causally related and that this causal structure is altered when the STP is involved in cancer. To test this hypothesis, we analyzed 5 STPs associated with breast cancer, 7 STPs associated with other cancers, and 10 randomly chosen pathways, using a breast cancer gene expression level dataset containing 529 cases and 61 controls. We identified all the genes related to each of the 22 pathways and developed separate gene expression datasets for each pathway. We obtained significant results indicating that the causal structure of the expression levels of genes coding for proteins on STPs, which are believed to be implicated in both breast cancer and in all cancers, is more altered in the cases relative to the controls than the causal structure of the randomly chosen pathways.

18.

A new method for predicting patient survivorship using efficient bayesian network learning.

Jiang, Xia; Xue, Diyang; Brufsky, Adam; Khan, Seema; Neapolitan, Richard.

Cancer Inform ; 13: 47-57, 2014.

Article in English | MEDLINE | ID: mdl-24558297

ABSTRACT

The purpose of this investigation is to develop and evaluate a new Bayesian network (BN)-based patient survivorship prediction method. The central hypothesis is that the method predicts patient survivorship well, while having the capability to handle high-dimensional data and be incorporated into a clinical decision support system (CDSS). We have developed EBMC_Survivorship (EBMC_S), which predicts survivorship for each year individually. EBMC_S is based on the EBMC BN algorithm, which has been shown to handle high-dimensional data. BNs have excellent architecture for decision support systems. In this study, we evaluate EBMC_S using the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) dataset, which concerns breast tumors. A 5-fold cross-validation study indicates that EMBC_S performs better than the Cox proportional hazard model and is comparable to the random survival forest method. We show that EBMC_S provides additional information such as sensitivity analyses, which covariates predict each year, and yearly areas under the ROC curve (AUROCs). We conclude that our investigation supports the central hypothesis.

19.

A comparative analysis of methods for predicting clinical outcomes using high-dimensional genomic datasets.

Jiang, Xia; Cai, Binghuang; Xue, Diyang; Lu, Xinghua; Cooper, Gregory F; Neapolitan, Richard E.

J Am Med Inform Assoc ; 21(e2): e312-9, 2014 Oct.

Article in English | MEDLINE | ID: mdl-24737607

ABSTRACT

OBJECTIVE: The objective of this investigation is to evaluate binary prediction methods for predicting disease status using high-dimensional genomic data. The central hypothesis is that the Bayesian network (BN)-based method called efficient Bayesian multivariate classifier (EBMC) will do well at this task because EBMC builds on BN-based methods that have performed well at learning epistatic interactions. METHOD: We evaluate how well eight methods perform binary prediction using high-dimensional discrete genomic datasets containing epistatic interactions. The methods are as follows: naive Bayes (NB), model averaging NB (MANB), feature selection NB (FSNB), EBMC, logistic regression (LR), support vector machines (SVM), Lasso, and extreme learning machines (ELM). We use a hundred 1000-single nucleotide polymorphism (SNP) simulated datasets, ten 10,000-SNP datasets, six semi-synthetic sets, and two real genome-wide association studies (GWAS) datasets in our evaluation. RESULTS: In fivefold cross-validation studies, the SVM performed best on the 1000-SNP dataset, while the BN-based methods performed best on the other datasets, with EBMC exhibiting the best overall performance. In-sample testing indicates that LR, SVM, Lasso, ELM, and NB tend to overfit the data. DISCUSSION: EBMC performed better than NB when there are several strong predictors, whereas NB performed better when there are many weak predictors. Furthermore, for all BN-based methods, prediction capability did not degrade as the dimension increased. CONCLUSIONS: Our results support the hypothesis that EBMC performs well at binary outcome prediction using high-dimensional discrete datasets containing epistatic-like interactions. Future research using more GWAS datasets is needed to further investigate the potential of EBMC.

Subject(s)

Bayes Theorem , Databases, Genetic , Genomics , Neural Networks, Computer , Epistasis, Genetic , Genome-Wide Association Study , Humans , Prognosis , ROC Curve

20.

Mining pure, strict epistatic interactions from high-dimensional datasets: ameliorating the curse of dimensionality.

Jiang, Xia; Neapolitan, Richard E.

PLoS One ; 7(10): e46771, 2012.

Article in English | MEDLINE | ID: mdl-23071633

ABSTRACT

BACKGROUND: The interaction between loci to affect phenotype is called epistasis. It is strict epistasis if no proper subset of the interacting loci exhibits a marginal effect. For many diseases, it is likely that unknown epistatic interactions affect disease susceptibility. A difficulty when mining epistatic interactions from high-dimensional datasets concerns the curse of dimensionality. There are too many combinations of SNPs to perform an exhaustive search. A method that could locate strict epistasis without an exhaustive search can be considered the brass ring of methods for analyzing high-dimensional datasets. METHODOLOGY/FINDINGS: A SNP pattern is a Bayesian network representing SNP-disease relationships. The Bayesian score for a SNP pattern is the probability of the data given the pattern, and has been used to learn SNP patterns. We identified a bound for the score of a SNP pattern. The bound provides an upper limit on the Bayesian score of any pattern that could be obtained by expanding a given pattern. We felt that the bound might enable the data to say something about the promise of expanding a 1-SNP pattern even when there are no marginal effects. We tested the bound using simulated datasets and semi-synthetic high-dimensional datasets obtained from GWAS datasets. We found that the bound was able to dramatically reduce the search time for strict epistasis. Using an Alzheimer's dataset, we showed that it is possible to discover an interaction involving the APOE gene based on its score because of its large marginal effect, but that the bound is most effective at discovering interactions without marginal effects. CONCLUSIONS/SIGNIFICANCE: We conclude that the bound appears to ameliorate the curse of dimensionality in high-dimensional datasets. This is a very consequential result and could be pivotal in our efforts to reveal the dark matter of genetic disease risk from high-dimensional datasets.

Subject(s)

Algorithms , Data Mining , Epistasis, Genetic , Alzheimer Disease/genetics , Bayes Theorem , Breast Neoplasms/genetics , Case-Control Studies , Computer Simulation , Female , Genome-Wide Association Study , Humans , Markov Chains , Models, Genetic , Oligonucleotide Array Sequence Analysis , Polymorphism, Single Nucleotide

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL