RESUMEN
Artificial intelligence (AI) is being increasingly integrated into scientific discovery to augment and accelerate research, helping scientists to generate hypotheses, design experiments, collect and interpret large datasets, and gain insights that might not have been possible using traditional scientific methods alone. Here we examine breakthroughs over the past decade that include self-supervised learning, which allows models to be trained on vast amounts of unlabelled data, and geometric deep learning, which leverages knowledge about the structure of scientific data to enhance model accuracy and efficiency. Generative AI methods can create designs, such as small-molecule drugs and proteins, by analysing diverse data modalities, including images and sequences. We discuss how these methods can help scientists throughout the scientific process and the central issues that remain despite such advances. Both developers and users of AI toolsneed a better understanding of when such approaches need improvement, and challenges posed by poor data quality and stewardship remain. These issues cut across scientific disciplines and require developing foundational algorithmic approaches that can contribute to scientific understanding or acquire it autonomously, making them critical areas of focus for AI innovation.
Asunto(s)
Inteligencia Artificial , Proyectos de Investigación , Inteligencia Artificial/normas , Inteligencia Artificial/tendencias , Conjuntos de Datos como Asunto , Aprendizaje Profundo , Proyectos de Investigación/normas , Proyectos de Investigación/tendencias , Aprendizaje Automático no SupervisadoRESUMEN
MOTIVATION: Thanks to the increasing availability of drug-drug interactions (DDI) datasets and large biomedical knowledge graphs (KGs), accurate detection of adverse DDI using machine learning models becomes possible. However, it remains largely an open problem how to effectively utilize large and noisy biomedical KG for DDI detection. Due to its sheer size and amount of noise in KGs, it is often less beneficial to directly integrate KGs with other smaller but higher quality data (e.g. experimental data). Most of existing approaches ignore KGs altogether. Some tries to directly integrate KGs with other data via graph neural networks with limited success. Furthermore most previous works focus on binary DDI prediction whereas the multi-typed DDI pharmacological effect prediction is more meaningful but harder task. RESULTS: To fill the gaps, we propose a new method SumGNN: knowledge summarization graph neural network, which is enabled by a subgraph extraction module that can efficiently anchor on relevant subgraphs from a KG, a self-attention based subgraph summarization scheme to generate reasoning path within the subgraph, and a multi-channel knowledge and data integration module that utilizes massive external biomedical knowledge for significantly improved multi-typed DDI predictions. SumGNN outperforms the best baseline by up to 5.54%, and performance gain is particularly significant in low data relation types. In addition, SumGNN provides interpretable prediction via the generated reasoning paths for each prediction. AVAILABILITY AND IMPLEMENTATION: The code is available in Supplementary Material. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Redes Neurales de la Computación , Reconocimiento de Normas Patrones Automatizadas , Interacciones Farmacológicas , Aprendizaje AutomáticoRESUMEN
MOTIVATION: Drug-target interaction (DTI) prediction is a foundational task for in-silico drug discovery, which is costly and time-consuming due to the need of experimental search over large drug compound space. Recent years have witnessed promising progress for deep learning in DTI predictions. However, the following challenges are still open: (i) existing molecular representation learning approaches ignore the sub-structural nature of DTI, thus produce results that are less accurate and difficult to explain and (ii) existing methods focus on limited labeled data while ignoring the value of massive unlabeled molecular data. RESULTS: We propose a Molecular Interaction Transformer (MolTrans) to address these limitations via: (i) knowledge inspired sub-structural pattern mining algorithm and interaction modeling module for more accurate and interpretable DTI prediction and (ii) an augmented transformer encoder to better extract and capture the semantic relations among sub-structures extracted from massive unlabeled biomedical data. We evaluate MolTrans on real-world data and show it improved DTI prediction performance compared to state-of-the-art baselines. AVAILABILITY AND IMPLEMENTATION: The model scripts are available at https://github.com/kexinhuang12345/moltrans. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Desarrollo de Medicamentos , Preparaciones Farmacéuticas , Algoritmos , Simulación por Computador , Descubrimiento de DrogasRESUMEN
SUMMARY: Accurate prediction of drug-target interactions (DTI) is crucial for drug discovery. Recently, deep learning (DL) models for show promising performance for DTI prediction. However, these models can be difficult to use for both computer scientists entering the biomedical field and bioinformaticians with limited DL experience. We present DeepPurpose, a comprehensive and easy-to-use DL library for DTI prediction. DeepPurpose supports training of customized DTI prediction models by implementing 15 compound and protein encoders and over 50 neural architectures, along with providing many other useful features. We demonstrate state-of-the-art performance of DeepPurpose on several benchmark datasets. AVAILABILITY AND IMPLEMENTATION: https://github.com/kexinhuang12345/DeepPurpose. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Aprendizaje Profundo , Preparaciones Farmacéuticas , Desarrollo de Medicamentos , Descubrimiento de Drogas , ProteínasRESUMEN
OBJECTIVE: This study was undertaken to determine the dose-response relation between epileptiform activity burden and outcomes in acutely ill patients. METHODS: A single center retrospective analysis was made of 1,967 neurologic, medical, and surgical patients who underwent >16 hours of continuous electroencephalography (EEG) between 2011 and 2017. We developed an artificial intelligence algorithm to annotate 11.02 terabytes of EEG and quantify epileptiform activity burden within 72 hours of recording. We evaluated burden (1) in the first 24 hours of recording, (2) in the 12-hours epoch with highest burden (peak burden), and (3) cumulatively through the first 72 hours of monitoring. Machine learning was applied to estimate the effect of epileptiform burden on outcome. Outcome measure was discharge modified Rankin Scale, dichotomized as good (0-4) versus poor (5-6). RESULTS: Peak epileptiform burden was independently associated with poor outcomes (p < 0.0001). Other independent associations included age, Acute Physiology and Chronic Health Evaluation II score, seizure on presentation, and diagnosis of hypoxic-ischemic encephalopathy. Model calibration error was calculated across 3 strata based on the time interval between last EEG measurement (up to 72 hours of monitoring) and discharge: (1) <5 days between last measurement and discharge, 0.0941 (95% confidence interval [CI] = 0.0706-0.1191); 5 to 10 days between last measurement and discharge, 0.0946 (95% CI = 0.0631-0.1290); >10 days between last measurement and discharge, 0.0998 (95% CI = 0.0698-0.1335). After adjusting for covariates, increase in peak epileptiform activity burden from 0 to 100% increased the probability of poor outcome by 35%. INTERPRETATION: Automated measurement of peak epileptiform activity burden affords a convenient, consistent, and quantifiable target for future multicenter randomized trials investigating whether suppressing epileptiform activity improves outcomes. ANN NEUROL 2021;90:300-311.
Asunto(s)
Inteligencia Artificial , Costo de Enfermedad , Convulsiones/diagnóstico , Convulsiones/fisiopatología , Anciano , Estudios de Cohortes , Electroencefalografía/métodos , Femenino , Humanos , Masculino , Persona de Mediana Edad , Estudios Retrospectivos , Resultado del TratamientoRESUMEN
The goal of molecular optimization is to generate molecules similar to a target molecule but with better chemical properties. Deep generative models have shown great success in molecule optimization. However, due to the iterative local generation process of deep generative models, the resulting molecules can significantly deviate from the input in molecular similarity and size, leading to poor chemical properties. The key issue here is that the existing deep generative models restrict their attention on substructure-level generation without considering the entire molecule as a whole. To address this challenge, we propose Molecule-Level Reward functions (MOLER) to encourage (1) the input and the generated molecule to be similar, and to ensure (2) the generated molecule has a similar size to the input. The proposed method can be combined with various deep generative models. Policy gradient technique is introduced to optimize reward-based objectives with small computational overhead. Empirical studies show that MOLER achieves up to 20.2% relative improvement in success rate over the best baseline method on several properties, including QED, DRD2 and LogP.
RESUMEN
There is a growing interest in applying deep learning (DL) to healthcare, driven by the availability of data with multiple feature channels in rich-data environments (e.g., intensive care units). However, in many other practical situations, we can only access data with much fewer feature channels in a poor-data environments (e.g., at home), which often results in predictive models with poor performance. How can we boost the performance of models learned from such poor-data environment by leveraging knowledge extracted from existing models trained using rich data in a related environment? To address this question, we develop a knowledge infusion framework named CHEER that can succinctly summarize such rich model into transferable representations, which can be incorporated into the poor model to improve its performance. The infused model is analyzed theoretically and evaluated empirically on several datasets. Our empirical results showed that CHEER outperformed baselines by 5.60% to 46.80% in terms of the macro-F1 score on multiple physiological datasets.
RESUMEN
BACKGROUND: Activity or audit log data are required for EHR privacy and security management but may also be useful for understanding desktop workflow. OBJECTIVE: We determined if the EHR audit log file, a rich source of complex time-stamped data on desktop activities, could be processed to derive primary care provider (PCP) level workflow measures. METHODS: We analyzed audit log data on 876 PCPs across 17,455 ambulatory care encounters that generated 578,394 time-stamped records. Each individual record represents a user interaction (e.g., point and click) that reflects all or part of a specific activity (e.g., order entry access). No dictionary exists to define how to combine clusters of sequential audit log records to represent identifiable PCP tasks. We determined if PARAFAC2 tensor factorization could: (1) learn to identify audit log record clusters that specifically represent defined PCP tasks; and (2) identify variation in how tasks are completed without the need for ground-truth labels. To interpret the result, we used the following PARAFAC2 factors: a matrix representing the task definitions and a matrix containing the frequency measure of each task for each encounter. RESULTS: PARAFAC2 automatically identified 4 clusters of audit log records that represent 4 common clinical encounter tasks: (1) medications' access, (2) notes' access, (3) order entry access, and (4) diagnosis modification. PARAFAC2 also identified the most common variants in how PCPs accomplish these tasks. It discovered variation in how the notes' access task was done, including identification of 9 distinct variants of notes access that explained 77% of the input data variation for notes. The discovered variants mapped to two known workflows for notes' access and to two distinct PCP user groups who accessed notes by either using the Visit Navigator or the Wrap-Up option. CONCLUSIONS: Our results demonstrate that EHR audit log data can be rapidly processed to create higher-level constructed features that represent time-stamped PCP tasks.
Asunto(s)
Registros Electrónicos de Salud , Personal de Salud , Humanos , Flujo de TrabajoRESUMEN
OBJECTIVE: Our aim is to extract clinically-meaningful phenotypes from longitudinal electronic health records (EHRs) of medically-complex children. This is a fragile set of patients consuming a disproportionate amount of pediatric care resources but who often end up with sub-optimal clinical outcome. The rise in available electronic health records (EHRs) provide a rich data source that can be used to disentangle their complex clinical conditions into concise, clinically-meaningful groups of characteristics. We aim at identifying those phenotypes and their temporal evolution in a scalable, computational manner, which avoids the time-consuming manual chart review. MATERIALS AND METHODS: We analyze longitudinal EHRs from Children's Healthcare of Atlanta including 1045 medically complex patients with a total of 59,948 encounters over 2â¯years. We apply a tensor factorization method called PARAFAC2 to extract: (a) clinically-meaningful groups of features (b) concise patient representations indicating the presence of a phenotype for each patient, and (c) temporal signatures indicating the evolution of those phenotypes over time for each patient. RESULTS: We identified four medically complex phenotypes, namely gastrointestinal disorders, oncological conditions, blood-related disorders, and neurological system disorders, which have distinct clinical characterizations among patients. We demonstrate the utility of patient representations produced by PARAFAC2, towards identifying groups of patients with significant survival variations. Finally, we showcase representative examples of the temporal phenotypic trends extracted for different patients. DISCUSSION: Unsupervised temporal phenotyping is an important task since it minimizes the burden on behalf of clinical experts, by relegating their involvement in the output phenotypes' validation. PARAFAC2 enjoys several compelling properties towards temporal computational phenotyping: (a) it is able to handle high-dimensional data and variable numbers of encounters across patients, (b) it has an intuitive interpretation and (c) it is free from ad-hoc parameter choices. Computational phenotypes, such as the ones computed by our approach, have multiple applications; we highlight three of them which are particularly useful for medically complex children: (1) integration into clinical decision support systems, (2) interpretable mortality prediction and 3) clinical trial recruitment. CONCLUSION: PARAFAC2 can be applied to unsupervised temporal phenotyping tasks where precise definitions of different phenotypes are absent, and lengths of patient records are varying.
Asunto(s)
Minería de Datos/métodos , Registros Electrónicos de Salud , Fenotipo , Algoritmos , Niño , Georgia , Humanos , Estudios LongitudinalesRESUMEN
BACKGROUND: Personalized, precision, P4, or stratified medicine is understood as a medical approach in which patients are stratified based on their disease subtype, risk, prognosis, or treatment response using specialized diagnostic tests. The key idea is to base medical decisions on individual patient characteristics, including molecular and behavioral biomarkers, rather than on population averages. Personalized medicine is deeply connected to and dependent on data science, specifically machine learning (often named Artificial Intelligence in the mainstream media). While during recent years there has been a lot of enthusiasm about the potential of 'big data' and machine learning-based solutions, there exist only few examples that impact current clinical practice. The lack of impact on clinical practice can largely be attributed to insufficient performance of predictive models, difficulties to interpret complex model predictions, and lack of validation via prospective clinical trials that demonstrate a clear benefit compared to the standard of care. In this paper, we review the potential of state-of-the-art data science approaches for personalized medicine, discuss open challenges, and highlight directions that may help to overcome them in the future. CONCLUSIONS: There is a need for an interdisciplinary effort, including data scientists, physicians, patient advocates, regulatory agencies, and health insurance organizations. Partially unrealistic expectations and concerns about data science-based solutions need to be better managed. In parallel, computational methods must advance more to provide direct benefit to clinical practice.
Asunto(s)
Medicina de Precisión/métodos , Humanos , Estudios ProspectivosRESUMEN
Patients with drug-resistant epilepsy (DRE) are at high risk of morbidity and mortality, yet their referral to specialist care is frequently delayed. The ability to identify patients at high risk of DRE at the time of treatment initiation, and to subsequently steer their treatment pathway toward more personalized interventions, has high clinical utility. Here, we aim to demonstrate the feasibility of developing algorithms for predicting DRE using machine learning methods. Longitudinal, intersected data sourced from US pharmacy, medical, and adjudicated hospital claims from 1,376,756 patients from 2006 to 2015 were analyzed; 292,892 met inclusion criteria for epilepsy, and 38,382 were classified as having DRE using a proxy measure for drug resistance. Patients were characterized using 1270 features reflecting demographics, comorbidities, medications, procedures, epilepsy status, and payer status. Data from 175,735 randomly selected patients were used to train three algorithms and from the remainder to assess the trained models' predictive power. A model with only age and sex was used as a benchmark. The best model, random forest, achieved an area under the receiver operating characteristic curve (95% confidence interval [CI]) of 0.764 (0.759, 0.770), compared with 0.657 (0.651, 0.663) for the benchmark model. Moreover, predicted probabilities for DRE were well-calibrated with the observed frequencies in the data. The model predicted drug resistance approximately 2â¯years before patients in the test dataset had failed two antiepileptic drugs (AEDs). Machine learning models constructed using claims data predicted which patients are likely to fail ≥3 AEDs and are at risk of developing DRE at the time of the first AED prescription. The use of such models can ensure that patients with predicted DRE receive specialist care with potentially more aggressive therapeutic interventions from diagnosis, to help reduce the serious sequelae of DRE.
Asunto(s)
Anticonvulsivantes/uso terapéutico , Epilepsia Refractaria , Aprendizaje Automático , Adulto , Algoritmos , Epilepsia Refractaria/diagnóstico , Epilepsia Refractaria/tratamiento farmacológico , Estudios de Factibilidad , Femenino , Humanos , Formulario de Reclamación de Seguro/estadística & datos numéricos , Masculino , Persona de Mediana Edad , Curva ROC , Análisis de RegresiónRESUMEN
OBJECTIVE: A significant challenge in treating rare forms of cancer such as Glioblastoma (GBM) is to find optimal personalized treatment plans for patients. The goals of our study is to predict which patients survive longer than the median survival time for GBM based on clinical and genomic factors, and to assess the predictive power of treatment patterns. METHOD: We developed a predictive model based on the clinical and genomic data from approximately 300 newly diagnosed GBM patients for a period of 2years. We proposed sequential mining algorithms with novel clinical constraints, namely, 'exact-order' and 'temporal overlap' constraints, to extract treatment patterns as features used in predictive modeling. With diverse features from clinical, genomic information and treatment patterns, we applied both logistic regression model and Cox regression to model patient survival outcome. RESULTS: The most predictive features influencing the survival period of GBM patients included mRNA expression levels of certain genes, some clinical characteristics such as age, Karnofsky performance score, and therapeutic agents prescribed in treatment patterns. Our models achieved c-statistic of 0.85 for logistic regression and 0.84 for Cox regression. CONCLUSIONS: We demonstrated the importance of diverse sources of features in predicting GBM patient survival outcome. The predictive model presented in this study is a preliminary step in a long-term plan of developing personalized treatment plans for GBM patients that can later be extended to other types of cancers.
Asunto(s)
Neoplasias Encefálicas , Minería de Datos , Marcadores Genéticos , Glioblastoma , Algoritmos , Humanos , Modelos Teóricos , Pronóstico , ARN Mensajero/metabolismo , Tasa de SupervivenciaRESUMEN
OBJECTIVE: Data in electronic health records (EHRs) is being increasingly leveraged for secondary uses, ranging from biomedical association studies to comparative effectiveness. To perform studies at scale and transfer knowledge from one institution to another in a meaningful way, we need to harmonize the phenotypes in such systems. Traditionally, this has been accomplished through expert specification of phenotypes via standardized terminologies, such as billing codes. However, this approach may be biased by the experience and expectations of the experts, as well as the vocabulary used to describe such patients. The goal of this work is to develop a data-driven strategy to (1) infer phenotypic topics within patient populations and (2) assess the degree to which such topics facilitate a mapping across populations in disparate healthcare systems. METHODS: We adapt a generative topic modeling strategy, based on latent Dirichlet allocation, to infer phenotypic topics. We utilize a variance analysis to assess the projection of a patient population from one healthcare system onto the topics learned from another system. The consistency of learned phenotypic topics was evaluated using (1) the similarity of topics, (2) the stability of a patient population across topics, and (3) the transferability of a topic across sites. We evaluated our approaches using four months of inpatient data from two geographically distinct healthcare systems: (1) Northwestern Memorial Hospital (NMH) and (2) Vanderbilt University Medical Center (VUMC). RESULTS: The method learned 25 phenotypic topics from each healthcare system. The average cosine similarity between matched topics across the two sites was 0.39, a remarkably high value given the very high dimensionality of the feature space. The average stability of VUMC and NMH patients across the topics of two sites was 0.988 and 0.812, respectively, as measured by the Pearson correlation coefficient. Also the VUMC and NMH topics have smaller variance of characterizing patient population of two sites than standard clinical terminologies (e.g., ICD9), suggesting they may be more reliably transferred across hospital systems. CONCLUSIONS: Phenotypic topics learned from EHR data can be more stable and transferable than billing codes for characterizing the general status of a patient population. This suggests that EHR-based research may be able to leverage such phenotypic topics as variables when pooling patient populations in predictive models.
Asunto(s)
Registros Electrónicos de Salud/organización & administración , Almacenamiento y Recuperación de la Información/métodos , Aprendizaje Automático , Registro Médico Coordinado/métodos , Vocabulario Controlado , Registros Electrónicos de Salud/clasificación , Procesamiento de Lenguaje Natural , Fenotipo , Estados UnidosRESUMEN
BACKGROUND: The electronic health record (EHR) contains a tremendous amount of data that if appropriately detected can lead to earlier identification of disease states such as heart failure (HF). Using a novel text and data analytic tool we explored the longitudinal EHR of over 50,000 primary care patients to identify the documentation of the signs and symptoms of HF in the years preceding its diagnosis. METHODS AND RESULTS: Retrospective analysis consisted of 4,644 incident HF cases and 45,981 group-matched control subjects. Documentation of Framingham HF signs and symptoms within encounter notes were carried out with the use of a previously validated natural language processing procedure. A total of 892,805 affirmed criteria were documented over an average observation period of 3.4 years. Among eventual HF cases, 85% had ≥1 criterion within 1 year before their HF diagnosis, as did 55% of control subjects. Substantial variability in the prevalence of individual signs and symptoms were found in both case and control subjects. CONCLUSIONS: HF signs and symptoms are frequently documented in a primary care population as identified through automated text and data mining of EHRs. Their frequent identification demonstrates the rich data available within EHRs that will allow for future work on automated criterion identification to help develop predictive models for HF.
Asunto(s)
Minería de Datos/estadística & datos numéricos , Registros Electrónicos de Salud/estadística & datos numéricos , Insuficiencia Cardíaca/diagnóstico , Insuficiencia Cardíaca/epidemiología , Vigilancia de la Población , Atención Primaria de Salud , Anciano , Anciano de 80 o más Años , Estudios de Casos y Controles , Estudios de Cohortes , Minería de Datos/métodos , Femenino , Humanos , Masculino , Persona de Mediana Edad , Vigilancia de la Población/métodos , Prevalencia , Atención Primaria de Salud/métodos , Estudios RetrospectivosRESUMEN
The dissemination of Electronic Health Records (EHRs) can be highly beneficial for a range of medical studies, spanning from clinical trials to epidemic control studies, but it must be performed in a way that preserves patients' privacy. This is not straightforward, because the disseminated data need to be protected against several privacy threats, while remaining useful for subsequent analysis tasks. In this work, we present a survey of algorithms that have been proposed for publishing structured patient data, in a privacy-preserving way. We review more than 45 algorithms, derive insights on their operation, and highlight their advantages and disadvantages. We also provide a discussion of some promising directions for future research in this area.
Asunto(s)
Algoritmos , Registros Electrónicos de Salud , Privacidad , Revelación de la VerdadRESUMEN
OBJECTIVE: Healthcare analytics research increasingly involves the construction of predictive models for disease targets across varying patient cohorts using electronic health records (EHRs). To facilitate this process, it is critical to support a pipeline of tasks: (1) cohort construction, (2) feature construction, (3) cross-validation, (4) feature selection, and (5) classification. To develop an appropriate model, it is necessary to compare and refine models derived from a diversity of cohorts, patient-specific features, and statistical frameworks. The goal of this work is to develop and evaluate a predictive modeling platform that can be used to simplify and expedite this process for health data. METHODS: To support this goal, we developed a PARAllel predictive MOdeling (PARAMO) platform which (1) constructs a dependency graph of tasks from specifications of predictive modeling pipelines, (2) schedules the tasks in a topological ordering of the graph, and (3) executes those tasks in parallel. We implemented this platform using Map-Reduce to enable independent tasks to run in parallel in a cluster computing environment. Different task scheduling preferences are also supported. RESULTS: We assess the performance of PARAMO on various workloads using three datasets derived from the EHR systems in place at Geisinger Health System and Vanderbilt University Medical Center and an anonymous longitudinal claims database. We demonstrate significant gains in computational efficiency against a standard approach. In particular, PARAMO can build 800 different models on a 300,000 patient data set in 3h in parallel compared to 9days if running sequentially. CONCLUSION: This work demonstrates that an efficient parallel predictive modeling platform can be developed for EHR data. This platform can facilitate large-scale modeling endeavors and speed-up the research workflow and reuse of health information. This platform is only a first step and provides the foundation for our ultimate goal of building analytic pipelines that are specialized for health data researchers.
Asunto(s)
Registros Electrónicos de Salud , Informática Médica/métodos , Algoritmos , Área Bajo la Curva , Sistemas de Computación , Sistemas de Apoyo a Decisiones Clínicas , Investigación sobre Servicios de Salud , Humanos , Modelos Teóricos , Reproducibilidad de los Resultados , Programas Informáticos , Tennessee , Factores de TiempoRESUMEN
The rapidly increasing availability of electronic health records (EHRs) from multiple heterogeneous sources has spearheaded the adoption of data-driven approaches for improved clinical research, decision making, prognosis, and patient management. Unfortunately, EHR data do not always directly and reliably map to medical concepts that clinical researchers need or use. Some recent studies have focused on EHR-derived phenotyping, which aims at mapping the EHR data to specific medical concepts; however, most of these approaches require labor intensive supervision from experienced clinical professionals. Furthermore, existing approaches are often disease-centric and specialized to the idiosyncrasies of the information technology and/or business practices of a single healthcare organization. In this paper, we propose Limestone, a nonnegative tensor factorization method to derive phenotype candidates with virtually no human supervision. Limestone represents the data source interactions naturally using tensors (a generalization of matrices). In particular, we investigate the interaction of diagnoses and medications among patients. The resulting tensor factors are reported as phenotype candidates that automatically reveal patient clusters on specific diagnoses and medications. Using the proposed method, multiple phenotypes can be identified simultaneously from data. We demonstrate the capability of Limestone on a cohort of 31,815 patient records from the Geisinger Health System. The dataset spans 7years of longitudinal patient records and was initially constructed for a heart failure onset prediction study. Our experiments demonstrate the robustness, stability, and the conciseness of Limestone-derived phenotypes. Our results show that using only 40 phenotypes, we can outperform the original 640 features (169 diagnosis categories and 471 medication types) to achieve an area under the receiver operator characteristic curve (AUC) of 0.720 (95% CI 0.715 to 0.725). Moreover, in consultation with a medical expert, we confirmed 82% of the top 50 candidates automatically extracted by Limestone are clinically meaningful.
Asunto(s)
Minería de Datos/métodos , Registros Electrónicos de Salud/clasificación , Algoritmos , Bases de Datos Factuales/clasificación , Humanos , FenotipoRESUMEN
The underrepresentation of gender, racial, and ethnic minorities in clinical trials is a problem undermining the efficacy of treatments on minorities and preventing precise estimates of the effects within these subgroups. We propose FRAMM, a deep reinforcement learning framework for fair trial site selection to help address this problem. We focus on two real-world challenges: the data modalities used to guide selection are often incomplete for many potential trial sites, and the site selection needs to simultaneously optimize for both enrollment and diversity. To address the missing data challenge, FRAMM has a modality encoder with a masked cross-attention mechanism for bypassing missing data. To make efficient trade-offs, FRAMM uses deep reinforcement learning with a reward function designed to simultaneously optimize for both enrollment and fairness. We evaluate FRAMM using real-world historical clinical trials and show that it outperforms the leading baseline in enrollment-only settings while also greatly improving diversity.