RESUMEN
INTRODUCTION: Existing data is often used for reproductive research and quality improvement. Electronic health records (EHRs) with a single data field for sex and gender conflate sex assigned at birth, genotype, gender identity, and the presence of anatomic tissue and organs. This is problematic for inclusion of transgender and gender-diverse populations in research. This article discusses considerations with a single-item sex and gender variable drawn from EHR records and describes an audit to determine variable validity as a criterion for inclusion or exclusion in perinatal research. METHODS: Individuals with a live birth at a large academic medical center from 2010 to 2022 were identified via electronic query, and records with male demographic information were reviewed to validate (1) the patient's date of birth and delivery date in the EHR matched the medical record number, (2) male sex and gender demographic information, and (3) male gender terms in EHR notes. RESULTS: All health records of male birthing individuals (n = 8) had EHR evidence of giving birth within the health system during the timeframe, and the date of birth matched the medical record number of the EHR. All had male gender in the EHR demographic information. Six patients did not have any male gender terms in available EHR notes, only female gender terms. Two records had recent notes using male gender terms. DISCUSSION: Current EHRs may not have reliable data on the gender and sex of gender-diverse individuals. A single sex and gender variable drawn from EHRs should not be used as inclusion or exclusion criteria for health research or quality improvement without additional record review. EHRs can be updated to collect more data on sex, gender identity, and other relevant variables to improve research and quality improvement.
RESUMEN
Current studies regarding the secondary use of electronic health records (EHR) predominantly rely on domain expertise and existing medical knowledge. Though significant efforts have been devoted to investigating the application of machine learning algorithms in the EHR, efficient and powerful representation of patients is needed to unleash the potential of discovering new medical patterns underlying the EHR. Here, we present an unsupervised method for embedding high-dimensional EHR data at the patient level, aimed at characterizing patient heterogeneity in complex diseases and identifying new disease patterns associated with clinical outcome disparities. Inspired by the architecture of modern language models-specifically transformers with attention mechanisms, we use patient diagnosis and procedure codes as vocabularies and treat each patient as a sentence to perform the patient embedding. We applied this approach to 34,851 unique medical codes across 1,046,649 longitudinal patient events, including 102,739 patients from the electronic Medical Records and GEnomics (eMERGE) Network. The resulting patient vectors demonstrated excellent performance in predicting future disease events (median AUROC = 0.87 within one year) and bulk phenotyping (median AUROC = 0.84). We then illustrated the utility of these patient vectors in revealing heterogeneous comorbidity patterns, exemplified by disease subtypes in colorectal cancer and systemic lupus erythematosus, and capturing distinct longitudinal disease trajectories. External validation using EHR data from the University of Washington confirmed robust model performance, with median AUROCs of 0.83 and 0.84 for bulk phenotyping tasks and disease onset prediction, respectively. Importantly, the model reproduced the clustering results of disease subtypes identified in the eMERGE cohort and uncovered variations in overall mortality among these subtypes. Together, these results underscore the potential of representation learning in EHRs to enhance patient characterization and associated clinical outcomes, thereby advancing disease forecasting and facilitating personalized medicine.
RESUMEN
Background: Large population-based DNA biobanks linked to electronic health records (EHRs) may provide novel opportunities to identify genetic drivers of ARDS. Research Question: Can we develop an EHR-based algorithm to identify ARDS in a biobank database, and can this validate a previously reported ARDS genetic risk factor? Study Design and Methods: We analyzed two parallel genotyped cohorts: a prospective biomarker cohort of critically ill adults (VALID), and a retrospective cohort of hospitalized participants enrolled in a de-identified EHR biobank (BioVU). ARDS was identified by clinician-investigator review in VALID and an EHR algorithm in BioVU (EHR-ARDS). We tested the association between the MUC5B promoter polymorphism rs35705950 with development of ARDS, and assessed if age modified this genetic association in each cohort. Results: In VALID, 2,795 patients were included, age was 55 [43, 66] (median [IQR]) years, and 718 (25.7%) developed ARDS. In BioVU, 9,025 hospitalized participants were included, age was 60 [48, 70] years, and 1,056 (11.7%) developed EHR-ARDS. We observed a significant age-related interaction effect on ARDS in VALID: among older patients, rs35705950 was associated with increased ARDS risk (OR: 1.44; 95%CI 1.08-1.92; p=0.012) whereas among younger patients this effect was absent (OR: 0.84; 95%CI: 0.62-1.14; p=0.26). In BioVU, rs35705950 was associated with increased risk for EHR-ARDS among all participants (OR: 1.20; 95%CI: 1.00-1.43, p=0.043) and this did not vary by age. The polymorphism was also associated worse oxygenation in mechanically ventilated BioVU participants, but had no association with oxygenation in VALID. Interpretation: The MUC5B promoter polymorphism was associated with ARDS in two cohorts of at-risk adults. Although age-related effect modification was observed only in VALID, BioVU identified a consistent association between MUC5B and ARDS risk regardless of age, and a novel association with oxygenation impairment. Our study highlights the potential for EHR biobanks to enable precision-medicine ARDS studies.
RESUMEN
Hypertriglyceridemia (HTG) is a common cardiovascular risk factor characterized by elevated triglyceride (TG) levels. Researchers have assessed the genetic factors that influence HTG in studies focused predominantly on individuals of European ancestry. However, relatively little is known about the contribution of genetic variation of HTG in people of African ancestry (AA), potentially constraining research and treatment opportunities. Our objective was to characterize genetic profiles among individuals of AA with mild-to-moderate HTG and severe HTG versus those with normal TGs by leveraging whole-genome sequencing data and longitudinal electronic health records available in the All of Us program. We compared the enrichment of functional variants within five canonical TG metabolism genes, an AA-specific polygenic risk score for TGs, and frequencies of 145 known potentially causal TG variants between HTG patients and normal TG among a cohort of AA patients (N = 15,373). Those with mild-to-moderate HTG (N = 342) and severe HTG (N ≤ 20) were more likely to carry APOA5 p.S19W (odds ratio = 1.94, 95% confidence interval = [1.48-2.54], P = 1.63 × 10-6 and OR = 3.65, 95% confidence interval: [1.22-10.93], P = 0.02, respectively) than those with normal TG. They were also more likely to have an elevated (top 10%) polygenic risk score, elevated carriage of potentially causal variant alleles, and carry any genetic risk factor. Alternative definitions of HTG yielded comparable results. In conclusion, individuals of AA with HTG were enriched for genetic risk factors compared to individuals with normal TGs.
Asunto(s)
Hipertrigliceridemia , Triglicéridos , Adulto , Femenino , Humanos , Masculino , Persona de Mediana Edad , Apolipoproteína A-V/genética , Negro o Afroamericano/genética , Población Negra/genética , Hipertrigliceridemia/etnología , Hipertrigliceridemia/genética , Triglicéridos/sangre , Estados Unidos/epidemiologíaRESUMEN
OBJECTIVE: Observational studies examining outcomes among opioid-exposed infants are limited by phenotype algorithms that may under identify opioid-exposed infants without neonatal opioid withdrawal syndrome (NOWS). We developed and validated the performance of different phenotype algorithms to identify opioid-exposed infants using electronic health record data. METHODS: We developed phenotype algorithms for the identification of opioid-exposed infants among a population of birthing person-infant dyads from an academic health care system (2010-2022). We derived phenotype algorithms from combinations of 6 unique indicators of in utero opioid exposure, including those from the infant record (NOWS or opioid-exposure diagnosis, positive toxicology) and birthing person record (opioid use disorder diagnosis, opioid drug exposure record, opioid listed on medication reconciliation, positive toxicology). We determined the positive predictive value (PPV) and 95% confidence interval for each phenotype algorithm using medical record review as the gold standard. RESULTS: Among 41 047 dyads meeting exclusion criteria, we identified 1558 infants (3.80%) with evidence of at least 1 indicator for opioid exposure and 32 (0.08%) meeting all 6 indicators of the phenotype algorithm. Among the sample of dyads randomly selected for review (n = 600), the PPV for the phenotype requiring only a single indicator was 95.4% (confidence interval: 93.3-96.8) with varying PPVs for the other phenotype algorithms derived from a combination of infant and birthing person indicators (PPV range: 95.4-100.0). CONCLUSIONS: Opioid-exposed infants can be accurately identified using electronic health record data. Our publicly available phenotype algorithms can be used to conduct research examining outcomes among opioid-exposed infants with and without NOWS.
Asunto(s)
Algoritmos , Registros Electrónicos de Salud , Síndrome de Abstinencia Neonatal , Fenotipo , Humanos , Recién Nacido , Femenino , Embarazo , Síndrome de Abstinencia Neonatal/diagnóstico , Analgésicos Opioides/efectos adversos , Trastornos Relacionados con Opioides/diagnóstico , MasculinoRESUMEN
Background: Endometriosis affects 10% of reproductive-age women, and yet, it goes undiagnosed for 3.6 years on average after symptoms onset. Despite large GWAS meta-analyses (N > 750,000), only a few dozen causal loci have been identified. We hypothesized that the challenges in identifying causal genes for endometriosis stem from heterogeneity across clinical and biological factors underlying endometriosis diagnosis. Methods: We extracted known endometriosis risk factors, symptoms, and concomitant conditions from the Penn Medicine Biobank (PMBB) and performed unsupervised spectral clustering on 4,078 women with endometriosis. The 5 clusters were characterized by utilizing additional electronic health record (EHR) variables, such as endometriosis-related comorbidities and confirmed surgical phenotypes. From four EHR-linked genetic datasets, PMBB, eMERGE, AOU, and UKBB, we extracted lead variants and tag variants 39 known endometriosis loci for association testing. We meta-analyzed ancestry-stratified case/control tests for each locus and cluster in addition to a positive control (Total N endometriosis cases = 10,108). Results: We have designated the five subtype clusters as pain comorbidities, uterine disorders, pregnancy complications, cardiometabolic comorbidities, and EHR-asymptomatic based on enriched features from each group. One locus, RNLS , surpassed the genome-wide significant threshold in the positive control. Thirteen more loci reached a Bonferroni threshold of 1.3 x 10 -3 (0.05 / 39) in the positive control. The cluster-stratified tests yielded more significant associations than the positive control for anywhere from 5 to 15 loci depending on the cluster. Bonferroni significant loci were identified for four out of five clusters, including WNT4 and GREB1 for the uterine disorders cluster, RNLS for the cardiometabolic cluster, FSHB for the pregnancy complications cluster, and SYNE1 and CDKN2B-AS1 for the EHR-asymptomatic cluster. This study enhances our understanding of the clinical presentation patterns of endometriosis subtypes, showcasing the innovative approach employed to investigate this complex disease.
RESUMEN
Background: Statins reduce low-density lipoprotein cholesterol (LDL-C) and are efficacious in the prevention of atherosclerotic cardiovascular disease (ASCVD). Dose-response to statins varies among patients and can be modeled using three distinct pharmacological properties: (1) E0 (baseline LDL-C), (2) ED50 (potency: median dose achieving 50% reduction in LDL-C); and (3) Emax (efficacy: maximum LDL-C reduction). However, individualized dose-response and its association with ASCVD events remains unknown. Objective: We analyze the relationship between ED50 and Emax with real-world cardiovascular disease outcomes. Method: We leveraged de-identified electronic health record data to identify individuals exposed to multiple doses of the three most commonly prescribed statins (atorvastatin, simvastatin, or rosuvastatin) within the context of their longitudinal healthcare. We derived ED50 and Emax to quantify the relationship with a composite outcome of ASCVD events and all-cause mortality. Results: We estimated ED50 and Emax for 3,033 unique individuals (atorvastatin: 1,632, simvastatin: 1,089, and rosuvastatin: 312) using a nonlinear, mixed effects dose-response model. Time-to-event analyses revealed that ED50 and Emax are independently associated with the primary endpoint. Hazard ratios were 0.85 (p < 0.01), 0.83 (p < 0.01), and 0.87 (p = 0.10) for ED50 and 1.13 (p < 0.001), 1.06 (p < 0.001), and 1.15 (p = 0.009) for Emax in the atorvastatin, simvastatin, and rosuvastatin cohorts, respectively. Conclusion: The class-wide association of ED50 and Emax with clinical outcomes indicates that these measures influence the risk for ASCVD events in patients on statins.
RESUMEN
OBJECTIVES: Phenotyping is a core task in observational health research utilizing electronic health records (EHRs). Developing an accurate algorithm demands substantial input from domain experts, involving extensive literature review and evidence synthesis. This burdensome process limits scalability and delays knowledge discovery. We investigate the potential for leveraging large language models (LLMs) to enhance the efficiency of EHR phenotyping by generating high-quality algorithm drafts. MATERIALS AND METHODS: We prompted four LLMs-GPT-4 and GPT-3.5 of ChatGPT, Claude 2, and Bard-in October 2023, asking them to generate executable phenotyping algorithms in the form of SQL queries adhering to a common data model (CDM) for three phenotypes (ie, type 2 diabetes mellitus, dementia, and hypothyroidism). Three phenotyping experts evaluated the returned algorithms across several critical metrics. We further implemented the top-rated algorithms and compared them against clinician-validated phenotyping algorithms from the Electronic Medical Records and Genomics (eMERGE) network. RESULTS: GPT-4 and GPT-3.5 exhibited significantly higher overall expert evaluation scores in instruction following, algorithmic logic, and SQL executability, when compared to Claude 2 and Bard. Although GPT-4 and GPT-3.5 effectively identified relevant clinical concepts, they exhibited immature capability in organizing phenotyping criteria with the proper logic, leading to phenotyping algorithms that were either excessively restrictive (with low recall) or overly broad (with low positive predictive values). CONCLUSION: GPT versions 3.5 and 4 are capable of drafting phenotyping algorithms by identifying relevant clinical criteria aligned with a CDM. However, expertise in informatics and clinical experience is still required to assess and further refine generated algorithms.
Asunto(s)
Algoritmos , Registros Electrónicos de Salud , Fenotipo , Humanos , Diabetes Mellitus Tipo 2 , Demencia , Hipotiroidismo , Procesamiento de Lenguaje NaturalRESUMEN
Apart from ancestry, personal or environmental covariates may contribute to differences in polygenic score (PGS) performance. We analyzed effects of covariate stratification and interaction on body mass index (BMI) PGS (PGSBMI) across four cohorts of European (N=491,111) and African (N=21,612) ancestry. Stratifying on binary covariates and quintiles for continuous covariates, 18/62 covariates had significant and replicable R2 differences among strata. Covariates with the largest differences included age, sex, blood lipids, physical activity, and alcohol consumption, with R2 being nearly double between best and worst performing quintiles for certain covariates. 28 covariates had significant PGSBMI-covariate interaction effects, modifying PGSBMI effects by nearly 20% per standard deviation change. We observed overlap between covariates that had significant R2 differences among strata and interaction effects - across all covariates, their main effects on BMI were correlated with their maximum R2 differences and interaction effects (0.56 and 0.58, respectively), suggesting high-PGSBMI individuals have highest R2 and increase in PGS effect. Using quantile regression, we show the effect of PGSBMI increases as BMI itself increases, and that these differences in effects are directly related to differences in R2 when stratifying by different covariates. Given significant and replicable evidence for context-specific PGSBMI performance and effects, we investigated ways to increase model performance taking into account non-linear effects. Machine learning models (neural networks) increased relative model R2 (mean 23%) across datasets. Finally, creating PGSBMI directly from GxAge GWAS effects increased relative R2 by 7.8%. These results demonstrate that certain covariates, especially those most associated with BMI, significantly affect both PGSBMI performance and effects across diverse cohorts and ancestries, and we provide avenues to improve model performance that consider these effects.
RESUMEN
The differential performance of polygenic risk scores (PRSs) by group is one of the major ethical barriers to their clinical use. It is also one of the main practical challenges for any implementation effort. The social repercussions of how people are grouped in PRS research must be considered in communications with research participants, including return of results. Here, we outline the decisions faced and choices made by a large multi-site clinical implementation study returning PRSs to diverse participants in handling this issue of differential performance. Our approach to managing the complexities associated with the differential performance of PRSs serves as a case study that can help future implementers of PRSs to plot an anticipatory course in response to this issue.
Asunto(s)
Predisposición Genética a la Enfermedad , Herencia Multifactorial , Humanos , Herencia Multifactorial/genética , Factores de Riesgo , Estudio de Asociación del Genoma Completo , Medición de Riesgo , Pruebas Genéticas/métodos , Puntuación de Riesgo GenéticoRESUMEN
Hypertriglyceridemia (HTG) is a common cardiovascular risk factor characterized by elevated circulating triglyceride (TG) levels. Researchers have assessed the genetic factors that influence HTG in studies focused predominantly on individuals of European ancestry (EA). However, relatively little is known about the contribution of genetic variation to HTG in people of AA, potentially constraining research and treatment opportunities; the lipid profile for African ancestry (AA) populations differs from that of EA populations-which may be partially attributable to genetics. Our objective was to characterize genetic profiles among individuals of AA with mild-to-moderate HTG and severe HTG versus those with normal TGs by leveraging whole genome sequencing (WGS) data and longitudinal electronic health records (EHRs) available in the All of Us (AoU) program. We compared the enrichment of functional variants within five canonical TG metabolism genes, an AA-specific polygenic risk score for TGs, and frequencies of 145 known potentially causal TG variants between patients with HTG and normal TG among a cohort of AA patients (N=15,373). Those with mild-to-moderate HTG (N=342) and severe HTG (N≤20) were more likely to carry APOA5 p.S19W (OR=1.94, 95% CI [1.48-2.54], p=1.63×10 -6 and OR=3.65, 95% CI [1.22-10.93], p=0.02, respectively) than those with normal TG. They were also more likely to have an elevated (top 10%) PRS, elevated carriage of potentially causal variant alleles, and carry any genetic risk factor. Alternative definitions of HTG yielded comparable results. In conclusion, individuals of AA with HTG were enriched for genetic risk factors compared to individuals with normal TGs.
RESUMEN
Polygenic variation unrelated to disease contributes to interindividual variation in baseline white blood cell (WBC) counts, but its clinical significance is uncharacterized. We investigated the clinical consequences of a genetic predisposition toward lower WBC counts among 89,559 biobank participants from tertiary care centers using a polygenic score for WBC count (PGSWBC) comprising single nucleotide polymorphisms not associated with disease. A predisposition to lower WBC counts was associated with a decreased risk of identifying pathology on a bone marrow biopsy performed for a low WBC count (odds-ratio = 0.55 per standard deviation increase in PGSWBC [95%CI, 0.30-0.94], p = 0.04), an increased risk of leukopenia (a low WBC count) when treated with a chemotherapeutic (n = 1724, hazard ratio [HR] = 0.78 [0.69-0.88], p = 4.0 × 10-5) or immunosuppressant (n = 354, HR = 0.61 [0.38-0.99], p = 0.04). A predisposition to benign lower WBC counts was associated with an increased risk of discontinuing azathioprine treatment (n = 1,466, HR = 0.62 [0.44-0.87], p = 0.006). Collectively, these findings suggest that there are genetically predisposed individuals who are susceptible to escalations or alterations in clinical care that may be harmful or of little benefit.
Asunto(s)
Predisposición Genética a la Enfermedad , Leucopenia , Herencia Multifactorial , Polimorfismo de Nucleótido Simple , Humanos , Recuento de Leucocitos , Masculino , Femenino , Leucopenia/genética , Leucopenia/sangre , Persona de Mediana Edad , Anciano , Adulto , Inmunosupresores/uso terapéuticoRESUMEN
BACKGROUND: Systemic lupus erythematosus (SLE) is a rare autoimmune disorder characterized by an unpredictable course of flares and remission with diverse manifestations. Lupus nephritis, one of the major disease manifestations of SLE for organ damage and mortality, is a key component of lupus classification criteria. Accurately identifying lupus nephritis in electronic health records (EHRs) would therefore benefit large cohort observational studies and clinical trials where characterization of the patient population is critical for recruitment, study design, and analysis. Lupus nephritis can be recognized through procedure codes and structured data, such as laboratory tests. However, other critical information documenting lupus nephritis, such as histologic reports from kidney biopsies and prior medical history narratives, require sophisticated text processing to mine information from pathology reports and clinical notes. In this study, we developed algorithms to identify lupus nephritis with and without natural language processing (NLP) using EHR data from the Northwestern Medicine Enterprise Data Warehouse (NMEDW). METHODS: We developed five algorithms: a rule-based algorithm using only structured data (baseline algorithm) and four algorithms using different NLP models. The first NLP model applied simple regular expression for keywords search combined with structured data. The other three NLP models were based on regularized logistic regression and used different sets of features including positive mention of concept unique identifiers (CUIs), number of appearances of CUIs, and a mixture of three components (i.e. a curated list of CUIs, regular expression concepts, structured data) respectively. The baseline algorithm and the best performing NLP algorithm were externally validated on a dataset from Vanderbilt University Medical Center (VUMC). RESULTS: Our best performing NLP model incorporated features from both structured data, regular expression concepts, and mapped concept unique identifiers (CUIs) and showed improved F measure in both the NMEDW (0.41 vs 0.79) and VUMC (0.52 vs 0.93) datasets compared to the baseline lupus nephritis algorithm. CONCLUSION: Our NLP MetaMap mixed model improved the F-measure greatly compared to the structured data only algorithm in both internal and external validation datasets. The NLP algorithms can serve as powerful tools to accurately identify lupus nephritis phenotype in EHR for clinical research and better targeted therapies.
Asunto(s)
Lupus Eritematoso Sistémico , Nefritis Lúpica , Humanos , Nefritis Lúpica/diagnóstico , Registros Electrónicos de Salud , Procesamiento de Lenguaje Natural , Fenotipo , Enfermedades RarasRESUMEN
Drug repurposing represents an attractive alternative to the costly and time-consuming process of new drug development, particularly for serious, widespread conditions with limited effective treatments, such as Alzheimer's disease (AD). Emerging generative artificial intelligence (GAI) technologies like ChatGPT offer the promise of expediting the review and summary of scientific knowledge. To examine the feasibility of using GAI for identifying drug repurposing candidates, we iteratively tasked ChatGPT with proposing the twenty most promising drugs for repurposing in AD, and tested the top ten for risk of incident AD in exposed and unexposed individuals over age 65 in two large clinical datasets: (1) Vanderbilt University Medical Center and (2) the All of Us Research Program. Among the candidates suggested by ChatGPT, metformin, simvastatin, and losartan were associated with lower AD risk in meta-analysis. These findings suggest GAI technologies can assimilate scientific insights from an extensive Internet-based search space, helping to prioritize drug repurposing candidates and facilitate the treatment of diseases.
RESUMEN
Genome-wide association studies (GWAS) have been instrumental in identifying genetic associations for various diseases and traits. However, uncovering genetic underpinnings among traits beyond univariate phenotype associations remains a challenge. Multi-phenotype associations (MPA), or genetic pleiotropy, offer important insights into shared genes and pathways among traits, enhancing our understanding of genetic architectures of complex diseases. GWAS of biobank-linked electronic health record (EHR) data are increasingly being utilized to identify MPA among various traits and diseases. However, methodologies that can efficiently take advantage of distributed EHR to detect MPA are still lacking. Here, we introduce mixWAS, a novel algorithm that efficiently and losslessly integrates multiple EHRs via summary statistics, allowing the detection of MPA among mixed phenotypes while accounting for heterogeneities across EHRs. Simulations demonstrate that mixWAS outperforms the widely used MPA detection method, Phenome-wide association study (PheWAS), across diverse scenarios. Applying mixWAS to data from seven EHRs in the US, we identified 4,534 MPA among blood lipids, BMI, and circulatory diseases. Validation in an independent EHR data from UK confirmed 97.7% of the associations. mixWAS fundamentally improves the detection of MPA and is available as a free, open-source software.
RESUMEN
INTRODUCTION: Phenotyping algorithms enable the interpretation of complex health data and definition of clinically relevant phenotypes; they have become crucial in biomedical research. However, the lack of standardization and transparency inhibits the cross-comparison of findings among different studies, limits large scale meta-analyses, confuses the research community, and prevents the reuse of algorithms, which results in duplication of efforts and the waste of valuable resources. RECOMMENDATIONS: Here, we propose five independent fundamental dimensions of phenotyping algorithms-complexity, performance, efficiency, implementability, and maintenance-through which researchers can describe, measure, and deploy any algorithms efficiently and effectively. These dimensions must be considered in the context of explicit use cases and transparent methods to ensure that they do not reflect unexpected biases or exacerbate inequities.
Asunto(s)
Investigación Biomédica , Registros Electrónicos de Salud , Algoritmos , Fenotipo , Estándares de ReferenciaRESUMEN
Objectives: Phenotyping is a core task in observational health research utilizing electronic health records (EHRs). Developing an accurate algorithm demands substantial input from domain experts, involving extensive literature review and evidence synthesis. This burdensome process limits scalability and delays knowledge discovery. We investigate the potential for leveraging large language models (LLMs) to enhance the efficiency of EHR phenotyping by generating high-quality algorithm drafts. Materials and Methods: We prompted four LLMs-GPT-4 and GPT-3.5 of ChatGPT, Claude 2, and Bard-in October 2023, asking them to generate executable phenotyping algorithms in the form of SQL queries adhering to a common data model (CDM) for three phenotypes (i.e., type 2 diabetes mellitus, dementia, and hypothyroidism). Three phenotyping experts evaluated the returned algorithms across several critical metrics. We further implemented the top-rated algorithms and compared them against clinician-validated phenotyping algorithms from the Electronic Medical Records and Genomics (eMERGE) network. Results: GPT-4 and GPT-3.5 exhibited significantly higher overall expert evaluation scores in instruction following, algorithmic logic, and SQL executability, when compared to Claude 2 and Bard. Although GPT-4 and GPT-3.5 effectively identified relevant clinical concepts, they exhibited immature capability in organizing phenotyping criteria with the proper logic, leading to phenotyping algorithms that were either excessively restrictive (with low recall) or overly broad (with low positive predictive values). Conclusion: GPT versions 3.5 and 4 are capable of drafting phenotyping algorithms by identifying relevant clinical criteria aligned with a CDM. However, expertise in informatics and clinical experience is still required to assess and further refine generated algorithms.
RESUMEN
OBJECTIVE: Pediatric patients have different diseases and outcomes than adults; however, existing phecodes do not capture the distinctive pediatric spectrum of disease. We aim to develop specialized pediatric phecodes (Peds-Phecodes) to enable efficient, large-scale phenotypic analyses of pediatric patients. MATERIALS AND METHODS: We adopted a hybrid data- and knowledge-driven approach leveraging electronic health records (EHRs) and genetic data from Vanderbilt University Medical Center to modify the most recent version of phecodes to better capture pediatric phenotypes. First, we compared the prevalence of patient diagnoses in pediatric and adult populations to identify disease phenotypes differentially affecting children and adults. We then used clinical domain knowledge to remove phecodes representing phenotypes unlikely to affect pediatric patients and create new phecodes for phenotypes relevant to the pediatric population. We further compared phenome-wide association study (PheWAS) outcomes replicating known pediatric genotype-phenotype associations between Peds-Phecodes and phecodes. RESULTS: The Peds-Phecodes aggregate 15 533 ICD-9-CM codes and 82 949 ICD-10-CM codes into 2051 distinct phecodes. Peds-Phecodes replicated more known pediatric genotype-phenotype associations than phecodes (248 vs 192 out of 687 SNPs, P < .001). DISCUSSION: We introduce Peds-Phecodes, a high-throughput EHR phenotyping tool tailored for use in pediatric populations. We successfully validated the Peds-Phecodes using genetic replication studies. Our findings also reveal the potential use of Peds-Phecodes in detecting novel genotype-phenotype associations for pediatric conditions. We expect that Peds-Phecodes will facilitate large-scale phenomic and genomic analyses in pediatric populations. CONCLUSION: Peds-Phecodes capture higher-quality pediatric phenotypes and deliver superior PheWAS outcomes compared to phecodes.
Asunto(s)
Registros Electrónicos de Salud , Estudio de Asociación del Genoma Completo , Niño , Humanos , Estudios de Asociación Genética , Genómica , Fenotipo , Polimorfismo de Nucleótido SimpleRESUMEN
African Americans have a significantly higher risk of developing chronic kidney disease, especially focal segmental glomerulosclerosis -, than European Americans. Two coding variants (G1 and G2) in the APOL1 gene play a major role in this disparity. While 13% of African Americans carry the high-risk recessive genotypes, only a fraction of these individuals develops FSGS or kidney failure, indicating the involvement of additional disease modifiers. Here, we show that the presence of the APOL1 p.N264K missense variant, when co-inherited with the G2 APOL1 risk allele, substantially reduces the penetrance of the G1G2 and G2G2 high-risk genotypes by rendering these genotypes low-risk. These results align with prior functional evidence showing that the p.N264K variant reduces the toxicity of the APOL1 high-risk alleles. These findings have important implications for our understanding of the mechanisms of APOL1-associated nephropathy, as well as for the clinical management of individuals with high-risk genotypes that include the G2 allele.
Asunto(s)
Glomeruloesclerosis Focal y Segmentaria , Humanos , Glomeruloesclerosis Focal y Segmentaria/genética , Apolipoproteína L1/genética , Predisposición Genética a la Enfermedad , Factores de Riesgo , Genotipo , Apolipoproteínas/genéticaRESUMEN
Background: Two risk variants in the apolipoprotein L1 gene (APOL1) have been associated with increased susceptibility to sepsis in Black patients. However, it remains unclear whether APOL1 high-risk genotypes are associated with occurrence of either sepsis or sepsis-related phenotypes in patients hospitalized with infections, independent of their association with pre-existing severe renal disease. Methods: A retrospective cohort study of 2242 Black patients hospitalized with infections. We assessed whether carriage of APOL1 high-risk genotypes was associated with the risk of sepsis and sepsis-related phenotypes in patients hospitalized with infections. The primary outcome was sepsis; secondary outcomes were short-term mortality, and organ failure related to sepsis. Results: Of 2242 Black patients hospitalized with infections, 565 developed sepsis. Patients with high-risk APOL1 genotypes had a significantly increased risk of sepsis (odds ratio [OR]=1.29 [95% CI, 1.00-1.67; p=0.047]); however, this association was not significant after adjustment for pre-existing severe renal disease (OR = 1.14 [95% CI, 0.88-1.48; p=0.33]), nor after exclusion of those patients with pre-existing severe renal disease (OR = 0.99 [95% CI, 0.70-1.39; p=0.95]). APOL1 high-risk genotypes were significantly associated with the renal dysfunction component of the Sepsis-3 criteria (OR = 1.64 [95% CI, 1.21-2.22; p=0.001]), but not with other sepsis-related organ dysfunction or short-term mortality. The association between high-risk APOL1 genotypes and sepsis-related renal dysfunction was markedly attenuated by adjusting for pre-existing severe renal disease (OR = 1.36 [95% CI, 1.00-1.86; p=0.05]) and was nullified after exclusion of patients with pre-existing severe renal disease (OR = 1.16 [95% CI, 0.74-1.81; p=0.52]). Conclusions: APOL1 high-risk genotypes were associated with an increased risk of sepsis; however, this increased risk was attributable predominantly to pre-existing severe renal disease. Funding: This study was supported by R01GM120523 (QF), R01HL163854 (QF), R35GM131770 (CMS), HL133786 (WQW), and Vanderbilt Faculty Research Scholar Fund (QF). The dataset(s) used for the analyses described were obtained from Vanderbilt University Medical Center's BioVU which is supported by institutional funding, the 1S10RR025141-01 instrumentation award, and by the CTSA grant UL1TR0004from NCATS/NIH. Additional funding provided by the NIH through grants P50GM115305 and U19HL065962. The authors wish to acknowledge the expert technical support of the VANTAGE and VANGARD core facilities, supported in part by the Vanderbilt-Ingram Cancer Center (P30 CA068485) and Vanderbilt Vision Center (P30 EY08126). The funders had no role in design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
When the body is fighting off an infection, the processes it uses to protect itself can sometimes overreact. This results in a condition known as sepsis which can cause life-threatening damage to multiple organs. In the United States, Black patients are 60-80% more likely to develop sepsis compared to individuals who identify as White; differences remain even after accounting for socio-economic status and presence of other illnesses. Recent work has suggested that two variants of the APOL1 gene which are almost exclusively found in people with African ancestry may be a contributing factor to this disparity. These 'high-risk' genetic variants have also been shown to increase the likelihood of kidney diseases. It is therefore possible that the elevated chance of sepsis is not directly linked to these variations of APOL1, but rather is the result of patients already having reduced kidney function. To understand the relationship between APOL1 and sepsis, Jiang et al. analyzed data from patients admitted to Vanderbilt University Medical Centre in the United States between 2000 and 2020. This included 2,242 patients who identified as Black and had been hospitalized with an infection. The analyses showed that 16% of these individuals were carriers of the APOL1 high-risk variants. The high-risk patients were more likely to experience sepsis and demonstrate kidney damage. But other organs commonly damaged by sepsis were not affected more in these individuals compared to the other 84% of patients who did not have these variants. Furthermore, when individuals with pre-existing kidney diseases were removed from this high-risk group, the increased likelihood of sepsis was no longer prominent. These findings suggest that the APOL1 variants do not directly increase the risk of sepsis, and this association is primarily due to patients with these genetic variations being more susceptible to kidney diseases. There are new drugs under development targeting the APOL1 variants. While these may provide protection against kidney diseases, they are unlikely to be successful at preventing or treating sepsis once a patient has been hospitalized with an infection.