Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 15 de 15
Filter
1.
Am J Psychiatry ; : appiajp20230247, 2024 May 15.
Article in English | MEDLINE | ID: mdl-38745458

ABSTRACT

OBJECTIVE: Treatment-resistant depression (TRD) occurs in roughly one-third of all individuals with major depressive disorder (MDD). Although research has suggested a significant common variant genetic component of liability to TRD, with heritability estimated at 8% when compared with non-treatment-resistant MDD, no replicated genetic loci have been identified, and the genetic architecture of TRD remains unclear. A key barrier to this work has been the paucity of adequately powered cohorts for investigation, largely because of the challenge in prospectively investigating this phenotype. The objective of this study was to perform a well-powered genetic study of TRD. METHODS: Using receipt of electroconvulsive therapy (ECT) as a surrogate for TRD, the authors applied standard machine learning methods to electronic health record data to derive predicted probabilities of receiving ECT. These probabilities were then applied as a quantitative trait in a genome-wide association study of 154,433 genotyped patients across four large biobanks. RESULTS: Heritability estimates ranged from 2% to 4.2%, and significant genetic overlap was observed with cognition, attention deficit hyperactivity disorder, schizophrenia, alcohol and smoking traits, and body mass index. Two genome-wide significant loci were identified, both previously implicated in metabolic traits, suggesting shared biology and potential pharmacological implications. CONCLUSIONS: This work provides support for the utility of estimation of disease probability for genomic investigation and provides insights into the genetic architecture and biology of TRD.

2.
medRxiv ; 2024 Mar 18.
Article in English | MEDLINE | ID: mdl-38562678

ABSTRACT

Suicide prevention requires risk identification, appropriate intervention, and follow-up. Traditional risk identification relies on patient self-reporting, support network reporting, or face-to-face screening with validated instruments or history and physical exam. In the last decade, statistical risk models have been studied and more recently deployed to augment clinical judgment. Models have generally been found to be low precision or problematic at scale due to low incidence. Few have been tested in clinical practice, and none have been tested in clinical trials to our knowledge. Methods: We report the results of a pragmatic randomized controlled trial (RCT) in three outpatient adult Neurology clinic settings. This two-arm trial compared the effectiveness of Interruptive and Non-Interruptive Clinical Decision Support (CDS) to prompt further screening of suicidal ideation for those predicted to be high risk using a real-time, validated statistical risk model of suicide attempt risk, with the decision to screen as the primary end point. Secondary outcomes included rates of suicidal ideation and attempts in both arms. Manual chart review of every trial encounter was used to determine if suicide risk assessment was subsequently documented. Results: From August 16, 2022, through February 16, 2023, our study randomized 596 patient encounters across 561 patients for providers to receive either Interruptive or Non-Interruptive CDS in a 1:1 ratio. Adjusting for provider cluster effects, Interruptive CDS led to significantly higher numbers of decisions to screen (42%=121/289 encounters) compared to Non-Interruptive CDS (4%=12/307) (odds ratio=17.7, p-value <0.001). Secondarily, no documented episodes of suicidal ideation or attempts occurred in either arm. While the proportion of documented assessments among those noting the decision to screen was higher for providers in the Non-Interruptive arm (92%=11/12) than in the Interruptive arm (52%=63/121), the interruptive CDS was associated with more frequent documentation of suicide risk assessment (63/289 encounters compared to 11/307, p-value<0.001). Conclusions: In this pragmatic RCT of real-time predictive CDS to guide suicide risk assessment, Interruptive CDS led to higher numbers of decisions to screen and documented suicide risk assessments. Well-powered large-scale trials randomizing this type of CDS compared to standard of care are indicated to measure effectiveness in reducing suicidal self-harm. ClinicalTrials.gov Identifier: NCT05312437.

3.
Transl Psychiatry ; 14(1): 58, 2024 Jan 25.
Article in English | MEDLINE | ID: mdl-38272862

ABSTRACT

Bipolar disorder is a leading contributor to disability, premature mortality, and suicide. Early identification of risk for bipolar disorder using generalizable predictive models trained on diverse cohorts around the United States could improve targeted assessment of high risk individuals, reduce misdiagnosis, and improve the allocation of limited mental health resources. This observational case-control study intended to develop and validate generalizable predictive models of bipolar disorder as part of the multisite, multinational PsycheMERGE Network across diverse and large biobanks with linked electronic health records (EHRs) from three academic medical centers: in the Northeast (Massachusetts General Brigham), the Mid-Atlantic (Geisinger) and the Mid-South (Vanderbilt University Medical Center). Predictive models were developed and valid with multiple algorithms at each study site: random forests, gradient boosting machines, penalized regression, including stacked ensemble learning algorithms combining them. Predictors were limited to widely available EHR-based features agnostic to a common data model including demographics, diagnostic codes, and medications. The main study outcome was bipolar disorder diagnosis as defined by the International Cohort Collection for Bipolar Disorder, 2015. In total, the study included records for 3,529,569 patients including 12,533 cases (0.3%) of bipolar disorder. After internal and external validation, algorithms demonstrated optimal performance in their respective development sites. The stacked ensemble achieved the best combination of overall discrimination (AUC = 0.82-0.87) and calibration performance with positive predictive values above 5% in the highest risk quantiles at all three study sites. In conclusion, generalizable predictive models of risk for bipolar disorder can be feasibly developed across diverse sites to enable precision medicine. Comparison of a range of machine learning methods indicated that an ensemble approach provides the best performance overall but required local retraining. These models will be disseminated via the PsycheMERGE Network website.


Subject(s)
Bipolar Disorder , Humans , Bipolar Disorder/diagnosis , Case-Control Studies , Risk Assessment/methods , Machine Learning , Electronic Health Records
4.
medRxiv ; 2023 Dec 01.
Article in English | MEDLINE | ID: mdl-38076830

ABSTRACT

Post marketing safety surveillance depends in part on the ability to detect concerning clinical events at scale. Spontaneous reporting might be an effective component of safety surveillance, but it requires awareness and understanding among healthcare professionals to achieve its potential. Reliance on readily available structured data such as diagnostic codes risk under-coding and imprecision. Clinical textual data might bridge these gaps, and natural language processing (NLP) has been shown to aid in scalable phenotyping across healthcare records in multiple clinical domains. In this study, we developed and validated a novel incident phenotyping approach using unstructured clinical textual data agnostic to Electronic Health Record (EHR) and note type. It's based on a published, validated approach (PheRe) used to ascertain social determinants of health and suicidality across entire healthcare records. To demonstrate generalizability, we validated this approach on two separate phenotypes that share common challenges with respect to accurate ascertainment: 1) suicide attempt; 2) sleep-related behaviors. With samples of 89,428 records and 35,863 records for suicide attempt and sleep-related behaviors, respectively, we conducted silver standard (diagnostic coding) and gold standard (manual chart review) validation. We showed Area Under the Precision-Recall Curve of ∼ 0.77 (95% CI 0.75-0.78) for suicide attempt and AUPR ∼ 0.31 (95% CI 0.28-0.34) for sleep-related behaviors. We also evaluated performance by coded race and demonstrated differences in performance by race were dissimilar across phenotypes and require algorithmovigilance and debiasing prior to implementation.

5.
medRxiv ; 2023 Nov 01.
Article in English | MEDLINE | ID: mdl-37961557

ABSTRACT

The value of genetic information for improving the performance of clinical risk prediction models has yielded variable conclusions. Many methodological decisions have the potential to contribute to differential results across studies. Here, we performed multiple modeling experiments integrating clinical and demographic data from electronic health records (EHR) and genetic data to understand which decision points may affect performance. Clinical data in the form of structured diagnostic codes, medications, procedural codes, and demographics were extracted from two large independent health systems and polygenic risk scores (PRS) were generated across all patients with genetic data in the corresponding biobanks. Crohn's disease was used as the model phenotype based on its substantial genetic component, established EHR-based definition, and sufficient prevalence for model training and testing. We investigated the impact of PRS integration method, as well as choices regarding training sample, model complexity, and performance metrics. Overall, our results show that including PRS resulted in higher performance by some metrics but the gain in performance was only robust when combined with demographic data alone. Improvements were inconsistent or negligible after including additional clinical information. The impact of genetic information on performance also varied by PRS integration method, with a small improvement in some cases from combining PRS with the output of a clinical model (late-fusion) compared to its inclusion an additional feature (early-fusion). The effects of other modeling decisions varied between institutions though performance increased with more compute-intensive models such as random forest. This work highlights the importance of considering methodological decision points in interpreting the impact on prediction performance when including PRS information in clinical models.

6.
JAMA Netw Open ; 6(11): e2342750, 2023 Nov 01.
Article in English | MEDLINE | ID: mdl-37938841

ABSTRACT

Importance: Suicide remains an ongoing concern in the US military. Statistical models have not been broadly disseminated for US Navy service members. Objective: To externally validate and update a statistical suicide risk model initially developed in a civilian setting with an emphasis on primary care. Design, Setting, and Participants: This retrospective cohort study used data collected from 2007 through 2017 among active-duty US Navy service members. The external civilian model was applied to every visit at Naval Medical Center Portsmouth (NMCP), its NMCP Naval Branch Health Clinics (NBHCs), and TRICARE Prime Clinics (TPCs) that fall within the NMCP area. The model was retrained and recalibrated using visits to NBHCs and TPCs and updated using Department of Defense (DoD)-specific billing codes and demographic characteristics, including expanded race and ethnicity categories. Domain and temporal analyses were performed with bootstrap validation. Data analysis was performed from September 2020 to December 2022. Exposure: Visit to US NMCP. Main Outcomes and Measures: Recorded suicidal behavior on the day of or within 30 days of a visit. Performance was assessed using area under the receiver operating curve (AUROC), area under the precision recall curve (AUPRC), Brier score, and Spiegelhalter z-test statistic. Results: Of the 260 583 service members, 6529 (2.5%) had a recorded suicidal behavior, 206 412 (79.2%) were male; 104 835 (40.2%) were aged 20 to 24 years; and 9458 (3.6%) were Asian, 56 715 (21.8%) were Black or African American, and 158 277 (60.7%) were White. Applying the civilian-trained model resulted in an AUROC of 0.77 (95% CI, 0.74-0.79) and an AUPRC of 0.004 (95% CI, 0.003-0.005) at NBHCs with poor calibration (Spiegelhalter P < .001). Retraining the algorithm improved AUROC to 0.92 (95% CI, 0.91-0.93) and AUPRC to 0.66 (95% CI, 0.63-0.68). Number needed to screen in the top risk tiers was 366 for the external model and 200 for the retrained model; the lower number indicates better performance. Domain validation showed AUROC of 0.90 (95% CI, 0.90-0.91) and AUPRC of 0.01 (95% CI, 0.01-0.01), and temporal validation showed AUROC of 0.75 (95% CI, 0.72-0.78) and AUPRC of 0.003 (95% CI, 0.003-0.005). Conclusions and Relevance: In this cohort study of active-duty Navy service members, a civilian suicide attempt risk model was externally validated. Retraining and updating with DoD-specific variables improved performance. Domain and temporal validation results were similar to external validation, suggesting that implementing an external model in US Navy primary care clinics may bypass the need for costly internal development and expedite the automation of suicide prevention in these clinics.


Subject(s)
Models, Statistical , Suicide, Attempted , Humans , Male , Female , Cohort Studies , Retrospective Studies , Primary Health Care
7.
medRxiv ; 2023 Sep 30.
Article in English | MEDLINE | ID: mdl-37808705

ABSTRACT

Purpose: To estimate the association of psychiatric polygenic scores with healthcare utilization and comorbidity burden. Methods: Observational cohort study (N = 118,882) of adolescent and adult biobank participants with linked electronic health records (EHRs) from three diverse study sites; (Massachusetts General Brigham, Vanderbilt University Medical Center, Geisinger). Polygenic scores (PGS) were derived from the largest available GWAS of major depressive depression, bipolar disorder, and schizophrenia at the time of analysis. Negative binomial regression models were used to estimate the association between each psychiatric PGS and healthcare utilization and comorbidity burden. Healthcare utilization was measured as frequency of emergency department (ED), inpatient (IP), and outpatient (OP) visits. Comorbidity burden was defined by the Elixhauser Comorbidity Index and the Charlson Comorbidity Index. Results: Participants had a median follow-up duration of 12 years in the EHR. Individuals in the top decile of polygenic score for major depressive disorder had significantly more ED visits (RR=1.22, 95% CI; 1.17, 1.29) compared to those the lowest decile. Increases were also observed with IP and comorbidity burden. Among those diagnosed with depression and in the highest decile of the PGS, there was an increase in all utilization types (ED: RR=1.56, 95% CI 1.41, 1.72; OP: RR=1.16, 95% CI 1.08, 1.24; IP: RR=1.23, 95% CI 1.12, 1.36) post-diagnosis. No clinically significant results were observed with bipolar and schizophrenia polygenic scores. Conclusions: Polygenic score for depression is modestly associated with increased healthcare resource utilization and comorbidity burden, in the absence of diagnosis. Following a diagnosis of depression, the PGS was associated with further increases in healthcare utilization. These findings suggest that depression genetic risk is associated with utilization and burden of chronic disease in real-world settings.

8.
medRxiv ; 2023 Feb 26.
Article in English | MEDLINE | ID: mdl-36865341

ABSTRACT

Bipolar disorder is a leading contributor to disability, premature mortality, and suicide. Early identification of risk for bipolar disorder using generalizable predictive models trained on diverse cohorts around the United States could improve targeted assessment of high risk individuals, reduce misdiagnosis, and improve the allocation of limited mental health resources. This observational case-control study intended to develop and validate generalizable predictive models of bipolar disorder as part of the multisite, multinational PsycheMERGE Consortium across diverse and large biobanks with linked electronic health records (EHRs) from three academic medical centers: in the Northeast (Massachusetts General Brigham), the Mid-Atlantic (Geisinger) and the Mid-South (Vanderbilt University Medical Center). Predictive models were developed and validated with multiple algorithms at each study site: random forests, gradient boosting machines, penalized regression, including stacked ensemble learning algorithms combining them. Predictors were limited to widely available EHR-based features agnostic to a common data model including demographics, diagnostic codes, and medications. The main study outcome was bipolar disorder diagnosis as defined by the International Cohort Collection for Bipolar Disorder, 2015. In total, the study included records for 3,529,569 patients including 12,533 cases (0.3%) of bipolar disorder. After internal and external validation, algorithms demonstrated optimal performance in their respective development sites. The stacked ensemble achieved the best combination of overall discrimination (AUC = 0.82 - 0.87) and calibration performance with positive predictive values above 5% in the highest risk quantiles at all three study sites. In conclusion, generalizable predictive models of risk for bipolar disorder can be feasibly developed across diverse sites to enable precision medicine. Comparison of a range of machine learning methods indicated that an ensemble approach provides the best performance overall but required local retraining. These models will be disseminated via the PsycheMERGE Consortium website.

9.
Sci Rep ; 12(1): 15146, 2022 09 07.
Article in English | MEDLINE | ID: mdl-36071081

ABSTRACT

Methods relying on diagnostic codes to identify suicidal ideation and suicide attempt in Electronic Health Records (EHRs) at scale are suboptimal because suicide-related outcomes are heavily under-coded. We propose to improve the ascertainment of suicidal outcomes using natural language processing (NLP). We developed information retrieval methodologies to search over 200 million notes from the Vanderbilt EHR. Suicide query terms were extracted using word2vec. A weakly supervised approach was designed to label cases of suicidal outcomes. The NLP validation of the top 200 retrieved patients showed high performance for suicidal ideation (area under the receiver operator curve [AUROC]: 98.6, 95% confidence interval [CI] 97.1-99.5) and suicide attempt (AUROC: 97.3, 95% CI 95.2-98.7). Case extraction produced the best performance when combining NLP and diagnostic codes and when accounting for negated suicide expressions in notes. Overall, we demonstrated that scalable and accurate NLP methods can be developed to identify suicidal behavior in EHRs to enhance prevention efforts, predictive models, and precision medicine.


Subject(s)
Suicidal Ideation , Suicide, Attempted , Electronic Health Records , Humans , Information Storage and Retrieval , Natural Language Processing
10.
JAMA Netw Open ; 5(5): e2212095, 2022 05 02.
Article in English | MEDLINE | ID: mdl-35560048

ABSTRACT

Importance: Understanding the differences and potential synergies between traditional clinician assessment and automated machine learning might enable more accurate and useful suicide risk detection. Objective: To evaluate the respective and combined abilities of a real-time machine learning model and the Columbia Suicide Severity Rating Scale (C-SSRS) to predict suicide attempt (SA) and suicidal ideation (SI). Design, Setting, and Participants: This cohort study included encounters with adult patients (aged ≥18 years) at a major academic medical center. The C-SSRS was administered during routine care, and a Vanderbilt Suicide Attempt and Ideation Likelihood (VSAIL) prediction was generated in the electronic health record. Encounters took place in the inpatient, ambulatory surgical, and emergency department settings. Data were collected from June 2019 to September 2020. Main Outcomes and Measures: Primary outcomes were the incidence of SA and SI, encoded as International Classification of Diseases codes, occurring within various time periods after an index visit. We evaluated the retrospective validity of the C-SSRS, VSAIL, and ensemble models combining both. Discrimination metrics included area under the receiver operating curve (AUROC), area under the precision-recall curve (AUPR), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). Results: The cohort included 120 398 unique index visits for 83 394 patients (mean [SD] age, 51.2 [20.6] years; 38 107 [46%] men; 45 273 [54%] women; 13 644 [16%] Black; 63 869 [77%] White). Within 30 days of an index visit, the combined models had higher AUROC (SA: 0.874-0.887; SI: 0.869-0.879) than both the VSAIL (SA: 0.729; SI: 0.773) and C-SSRS (SA: 0.823; SI: 0.777) models. In the highest risk-decile, ensemble methods had PPV of 1.3% to 1.4% for SA and 8.3% to 8.7% for SI and sensitivity of 77.6% to 79.5% for SA and 67.4% to 70.1% for SI, outperforming VSAIL (PPV for SA: 0.4%; PPV for SI: 3.9%; sensitivity for SA: 28.8%; sensitivity for SI: 35.1%) and C-SSRS (PPV for SA: 0.5%; PPV for SI: 3.5%; sensitivity for SA: 76.6%; sensitivity for SI: 68.8%). Conclusions and Relevance: In this study, suicide risk prediction was optimal when leveraging both in-person screening (for acute measures of risk in patient-reported suicidality) and historical EHR data (for underlying clinical factors that can quantify a patient's passive risk level). To improve suicide risk classification, prediction systems could combine pretrained machine learning with structured clinician assessment without needing to retrain the original model.


Subject(s)
Suicidal Ideation , Suicide, Attempted , Adolescent , Adult , Cohort Studies , Female , Humans , Machine Learning , Male , Middle Aged , Retrospective Studies
11.
J Am Med Inform Assoc ; 29(1): 22-32, 2021 12 28.
Article in English | MEDLINE | ID: mdl-34665246

ABSTRACT

OBJECTIVE: To develop and validate algorithms for predicting 30-day fatal and nonfatal opioid-related overdose using statewide data sources including prescription drug monitoring program data, Hospital Discharge Data System data, and Tennessee (TN) vital records. Current overdose prevention efforts in TN rely on descriptive and retrospective analyses without prognostication. MATERIALS AND METHODS: Study data included 3 041 668 TN patients with 71 479 191 controlled substance prescriptions from 2012 to 2017. Statewide data and socioeconomic indicators were used to train, ensemble, and calibrate 10 nonparametric "weak learner" models. Validation was performed using area under the receiver operating curve (AUROC), area under the precision recall curve, risk concentration, and Spiegelhalter z-test statistic. RESULTS: Within 30 days, 2574 fatal overdoses occurred after 4912 prescriptions (0.0069%) and 8455 nonfatal overdoses occurred after 19 460 prescriptions (0.027%). Discrimination and calibration improved after ensembling (AUROC: 0.79-0.83; Spiegelhalter P value: 0-.12). Risk concentration captured 47-52% of cases in the top quantiles of predicted probabilities. DISCUSSION: Partitioning and ensembling enabled all study data to be used given computational limits and helped mediate case imbalance. Predicting risk at the prescription level can aggregate risk to the patient, provider, pharmacy, county, and regional levels. Implementing these models into Tennessee Department of Health systems might enable more granular risk quantification. Prospective validation with more recent data is needed. CONCLUSION: Predicting opioid-related overdose risk at statewide scales remains difficult and models like these, which required a partnership between an academic institution and state health agency to develop, may complement traditional epidemiological methods of risk identification and inform public health decisions.


Subject(s)
Analgesics, Opioid , Prescription Drug Monitoring Programs , Analgesics, Opioid/therapeutic use , Hospitals , Humans , Machine Learning , Patient Discharge , Retrospective Studies , Tennessee/epidemiology
12.
JAMA Netw Open ; 4(3): e211428, 2021 03 01.
Article in English | MEDLINE | ID: mdl-33710291

ABSTRACT

Importance: Numerous prognostic models of suicide risk have been published, but few have been implemented outside of integrated managed care systems. Objective: To evaluate performance of a suicide attempt risk prediction model implemented in a vendor-supplied electronic health record to predict subsequent (1) suicidal ideation and (2) suicide attempt. Design, Setting, and Participants: This observational cohort study evaluated implementation of a suicide attempt prediction model in live clinical systems without alerting. The cohort comprised patients seen for any reason in adult inpatient, emergency department, and ambulatory surgery settings at an academic medical center in the mid-South from June 2019 to April 2020. Main Outcomes and Measures: Primary measures assessed external, prospective, and concurrent validity. Manual medical record validation of coded suicide attempts confirmed incident behaviors with intent to die. Subgroup analyses were performed based on demographic characteristics, relevant clinical context/setting, and presence or absence of universal screening. Performance was evaluated using discrimination (number needed to screen, C statistics, positive/negative predictive values) and calibration (Spiegelhalter z statistic). Recalibration was performed with logistic calibration. Results: The system generated 115 905 predictions for 77 973 patients (42 490 [54%] men, 35 404 [45%] women, 60 586 [78%] White, 12 620 [16%] Black). Numbers needed to screen in highest risk quantiles were 23 and 271 for suicidal ideation and attempt, respectively. Performance was maintained across demographic subgroups. Numbers needed to screen for suicide attempt by sex were 256 for men and 323 for women; and by race: 373, 176, and 407 for White, Black, and non-White/non-Black patients, respectively. Model C statistics were, across the health system: 0.836 (95% CI, 0.836-0.837); adult hospital: 0.77 (95% CI, 0.77-0.772); emergency department: 0.778 (95% CI, 0.777-0.778); psychiatry inpatient settings: 0.634 (95% CI, 0.633-0.636). Predictions were initially miscalibrated (Spiegelhalter z = -3.1; P = .001) with improvement after recalibration (Spiegelhalter z = 1.1; P = .26). Conclusions and Relevance: In this study, this real-time predictive model of suicide attempt risk showed reasonable numbers needed to screen in nonpsychiatric specialty settings in a large clinical system. Assuming that research-valid models will translate without performing this type of analysis risks inaccuracy in clinical practice, misclassification of risk, wasted effort, and missed opportunity to correct and prevent such problems. The next step is careful pairing with low-cost, low-harm preventive strategies in a pragmatic trial of effectiveness in preventing future suicidality.


Subject(s)
Electronic Health Records , Models, Statistical , Risk Assessment/methods , Suicidal Ideation , Suicide, Attempted/statistics & numerical data , Adult , Cohort Studies , Computer Systems , Female , Humans , Male , Middle Aged , Predictive Value of Tests
13.
AMIA Annu Symp Proc ; 2020: 1050-1058, 2020.
Article in English | MEDLINE | ID: mdl-33936481

ABSTRACT

Primary care represents a major opportunity for suicide prevention in the military. Significant advances have been made in using electronic health record data to predict suicide attempts in patient populations. With a user-centered design approach, we are developing an intervention that uses predictive analytics to inform care teams about their patients' risk of suicide attempt. We present our experience working with clinicians and staff in a military primary care setting to create preliminary designs and a context-specific usability testing plan for the deployment of the suicide risk indicator.


Subject(s)
Machine Learning , Military Personnel/psychology , Suicide Prevention , Suicide, Attempted/prevention & control , Suicide, Attempted/psychology , User-Centered Design , Electronic Health Records , Humans , Predictive Value of Tests , Risk Assessment , Risk Factors
14.
Genet Med ; 20(4): 470-473, 2018 04.
Article in English | MEDLINE | ID: mdl-28837159

ABSTRACT

PurposeThe Genomic Oligoarray and SNP Array Evaluation Tool 3.0 matches candidate genes within regions of homozygosity with a patient's phenotype, by mining OMIM for gene entries that contain a Clinical Synopsis. However, the tool cannot identify genes/disorders whose OMIM entries lack a descriptor of the mode of (Mendelian) inheritance. This study aimed to improve the tool's diagnostic power by building a database of autosomal recessive diseases not diagnosable through OMIM.MethodsWe extracted a list of all genes in OMIM that produce disease phenotypes but lack Clinical Synopses or other statements of mode of inheritance. We then searched PubMed for literature regarding each gene in order to infer its inheritance pattern.ResultsWe analyzed 1,392 genes. Disorders associated with 372 genes were annotated as recessive and 430 as dominant. Autosomal genes were ranked from 1 to 3, with 3 indicating the strongest evidence behind the inferred mode of inheritance. Of 834 autosomal genes, 158 were ranked as 1, 228 as 2, and 448 as 3.ConclusionThe 372 genes associated with recessive disorders will be contributed to the SNP array tool, and the entire database to OMIM. We anticipate that these findings will be useful in rare disease diagnostics.


Subject(s)
Computational Biology/methods , Databases, Genetic , Genetic Predisposition to Disease , Genome-Wide Association Study/methods , Inheritance Patterns , Genomics/methods , Genotype , Humans , Molecular Sequence Annotation , Phenotype
15.
BMC Bioinformatics ; 18(Suppl 14): 509, 2017 12 28.
Article in English | MEDLINE | ID: mdl-29297276

ABSTRACT

BACKGROUND: NCBI's Gene Expression Omnibus (GEO) is a rich community resource containing millions of gene expression experiments from human, mouse, rat, and other model organisms. However, information about each experiment (metadata) is in the format of an open-ended, non-standardized textual description provided by the depositor. Thus, classification of experiments for meta-analysis by factors such as gender, age of the sample donor, and tissue of origin is not feasible without assigning labels to the experiments. Automated approaches are preferable for this, primarily because of the size and volume of the data to be processed, but also because it ensures standardization and consistency. While some of these labels can be extracted directly from the textual metadata, many of the data available do not contain explicit text informing the researcher about the age and gender of the subjects with the study. To bridge this gap, machine-learning methods can be trained to use the gene expression patterns associated with the text-derived labels to refine label-prediction confidence. RESULTS: Our analysis shows only 26% of metadata text contains information about gender and 21% about age. In order to ameliorate the lack of available labels for these data sets, we first extract labels from the textual metadata for each GEO RNA dataset and evaluate the performance against a gold standard of manually curated labels. We then use machine-learning methods to predict labels, based upon gene expression of the samples and compare this to the text-based method. CONCLUSION: Here we present an automated method to extract labels for age, gender, and tissue from textual metadata and GEO data using both a heuristic approach as well as machine learning. We show the two methods together improve accuracy of label assignment to GEO samples.


Subject(s)
Algorithms , Gene Expression , Metadata , Age Factors , Animals , Automation , Databases, Genetic , Female , Gene Ontology , Humans , Machine Learning , Male , Middle Aged , Molecular Sequence Annotation , Rats , Reference Standards
SELECTION OF CITATIONS
SEARCH DETAIL
...