RESUMEN
Diagnosis for rare genetic diseases often relies on phenotype-driven methods, which hinge on the accuracy and completeness of the rare disease phenotypes in the underlying annotation knowledgebase. Existing knowledgebases are often manually curated with additional annotations found in published case reports. Despite their potential, real-world data such as electronic health records (EHRs) have not been fully exploited to derive rare disease annotations. Here, we present open annotation for rare diseases (OARD), a real-world-data-derived resource with annotation for rare-disease-related phenotypes. This resource is derived from the EHRs of two academic health institutions containing more than 10 million individuals spanning wide age ranges and different disease subgroups. By leveraging ontology mapping and advanced natural-language-processing (NLP) methods, OARD automatically and efficiently extracts concepts for both rare diseases and their phenotypic traits from billing codes and lab tests as well as over 100 million clinical narratives. The rare disease prevalence derived by OARD is highly correlated with those annotated in the original rare disease knowledgebase. By performing association analysis, we identified more than 1 million novel disease-phenotype association pairs that were previously missed by human annotation, and >60% were confirmed true associations via manual review of a list of sampled pairs. Compared to the manual curated annotation, OARD is 100% data driven and its pipeline can be shared across different institutions. By supporting privacy-preserving sharing of aggregated summary statistics, such as term frequencies and disease-phenotype associations, it fills an important gap to facilitate data-driven research in the rare disease community.
Asunto(s)
Procesamiento de Lenguaje Natural , Enfermedades Raras , Registros Electrónicos de Salud , Humanos , Fenotipo , Enfermedades Raras/genéticaRESUMEN
BACKGROUND: Current hemovigilance methods generally rely on survey data or administrative claims data utilizing billing and revenue codes, each of which has limitations. We used electronic health records (EHR) linked to blood bank data to comprehensively characterize red blood cell (RBC) utilization patterns and trends in three healthcare systems participating in the U.S. Food and Drug Administration Center for Biologics Evaluation and Research Biologics Effectiveness and Safety (BEST) initiative. METHODS: We used Information Standard for Blood and Transplant (ISBT) 128 codes linked to EHR from three healthcare systems data sources to identify and quantify RBC-transfused individuals, RBC transfusion episodes, transfused RBC units, and processing methods per year during 2012-2018. RESULTS: There were 577,822 RBC units transfused among 112,705 patients comprising 345,373 transfusion episodes between 2012 and 2018. Utilization in terms of RBC units and patients increased slightly in one and decreased slightly in the other two healthcare facilities. About 90% of RBC-transfused patients had 1 (~46%) or 2-5 (~42%)transfusion episodes in 2018. Among the small proportion of patients with ≥12 transfusion episodes per year, approximately 60% of episodes included only one RBC unit. All facilities used leukocyte-reduced RBCs during the study period whereas irradiated RBC utilization patterns differed across facilities. DISCUSSION: ISBT 128 codes and EHRs were used to observe patterns of RBC transfusion and modification methods at the unit level and patient level in three healthcare systems participating in the BEST initiative. This study shows that the ISBT 128 coding system in an EHR environment provides a feasible source for hemovigilance activities.
Asunto(s)
Registros Electrónicos de Salud , Transfusión de Eritrocitos , Humanos , Femenino , Masculino , Persona de Mediana Edad , Adulto , Estados Unidos , Eritrocitos , Anciano , Productos Biológicos/uso terapéutico , Bancos de Sangre/normas , Bancos de Sangre/estadística & datos numéricos , AdolescenteRESUMEN
OBJECTIVE: Automated identification of eligible patients is a bottleneck of clinical research. We propose Criteria2Query (C2Q) 3.0, a system that leverages GPT-4 for the semi-automatic transformation of clinical trial eligibility criteria text into executable clinical database queries. MATERIALS AND METHODS: C2Q 3.0 integrated three GPT-4 prompts for concept extraction, SQL query generation, and reasoning. Each prompt was designed and evaluated separately. The concept extraction prompt was benchmarked against manual annotations from 20 clinical trials by two evaluators, who later also measured SQL generation accuracy and identified errors in GPT-generated SQL queries from 5 clinical trials. The reasoning prompt was assessed by three evaluators on four metrics: readability, correctness, coherence, and usefulness, using corrected SQL queries and an open-ended feedback questionnaire. RESULTS: Out of 518 concepts from 20 clinical trials, GPT-4 achieved an F1-score of 0.891 in concept extraction. For SQL generation, 29 errors spanning seven categories were detected, with logic errors being the most common (n = 10; 34.48 %). Reasoning evaluations yielded a high coherence rating, with the mean score being 4.70 but relatively lower readability, with a mean of 3.95. Mean scores of correctness and usefulness were identified as 3.97 and 4.37, respectively. CONCLUSION: GPT-4 significantly improves the accuracy of extracting clinical trial eligibility criteria concepts in C2Q 3.0. Continued research is warranted to ensure the reliability of large language models.
Asunto(s)
Ensayos Clínicos como Asunto , Humanos , Procesamiento de Lenguaje Natural , Programas Informáticos , Selección de PacienteRESUMEN
INTRODUCTION: Electronic Health Records (EHR) are a useful data source for research, but their usability is hindered by measurement errors. This study investigated an automatic error detection algorithm for adult height and weight measurements in EHR for the All of Us Research Program (All of Us). METHODS: We developed reference charts for adult heights and weights that were stratified on participant sex. Our analysis included 4,076,534 height and 5,207,328 wt measurements from â¼ 150,000 participants. Errors were identified using modified standard deviation scores, differences from their expected values, and significant changes between consecutive measurements. We evaluated our method with chart-reviewed heights (8,092) and weights (9,039) from 250 randomly selected participants and compared it with the current cleaning algorithm in All of Us. RESULTS: The proposed algorithm classified 1.4 % of height and 1.5 % of weight errors in the full cohort. Sensitivity was 90.4 % (95 % CI: 79.0-96.8 %) for heights and 65.9 % (95 % CI: 56.9-74.1 %) for weights. Precision was 73.4 % (95 % CI: 60.9-83.7 %) for heights and 62.9 (95 % CI: 54.0-71.1 %) for weights. In comparison, the current cleaning algorithm has inferior performance in sensitivity (55.8 %) and precision (16.5 %) for height errors while having higher precision (94.0 %) and lower sensitivity (61.9 %) for weight errors. DISCUSSION: Our proposed algorithm outperformed in detecting height errors compared to weights. It can serve as a valuable addition to the current All of Us cleaning algorithm for identifying erroneous height values.
Asunto(s)
Algoritmos , Estatura , Peso Corporal , Registros Electrónicos de Salud , Humanos , Masculino , Adulto , Femenino , Persona de Mediana Edad , Estados Unidos , Valores de Referencia , Anciano , Adulto JovenRESUMEN
OBJECTIVE: More than one third of appropriately treated patients with epilepsy have continued seizures despite two or more medication trials, meeting criteria for drug-resistant epilepsy (DRE). Accurate and reliable identification of patients with DRE in observational data would enable large-scale, real-world comparative effectiveness research and improve access to specialized epilepsy care. In the present study, we aim to develop and compare the performance of computable phenotypes for DRE using the Observational Medical Outcomes Partnership (OMOP) Common Data Model. METHODS: We randomly sampled 600 patients from our academic medical center's electronic health record (EHR)-derived OMOP database meeting previously validated criteria for epilepsy (January 2015-August 2021). Two reviewers manually classified patients as having DRE, drug-responsive epilepsy, undefined drug responsiveness, or no epilepsy as of the last EHR encounter in the study period based on consensus definitions. Demographic characteristics and codes for diagnoses, antiseizure medications (ASMs), and procedures were tested for association with DRE. Algorithms combining permutations of these factors were applied to calculate sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) for DRE. The F1 score was used to compare overall performance. RESULTS: Among 412 patients with source record-confirmed epilepsy, 62 (15.0%) had DRE, 163 (39.6%) had drug-responsive epilepsy, 124 (30.0%) had undefined drug responsiveness, and 63 (15.3%) had insufficient records. The best performing phenotype for DRE in terms of the F1 score was the presence of ≥1 intractable epilepsy code and ≥2 unique non-gabapentinoid ASM exposures each with ≥90-day drug era (sensitivity = .661, specificity = .937, PPV = .594, NPV = .952, F1 score = .626). Several phenotypes achieved higher sensitivity at the expense of specificity and vice versa. SIGNIFICANCE: OMOP algorithms can identify DRE in EHR-derived data with varying tradeoffs between sensitivity and specificity. These computable phenotypes can be applied across the largest international network of standardized clinical databases for further validation, reproducible observational research, and improving access to appropriate care.
Asunto(s)
Epilepsia Refractaria , Epilepsia , Humanos , Registros Electrónicos de Salud , Epilepsia Refractaria/diagnóstico , Epilepsia Refractaria/tratamiento farmacológico , Bases de Datos Factuales , Recolección de Datos , Algoritmos , Epilepsia/diagnóstico , Epilepsia/tratamiento farmacológicoRESUMEN
BACKGROUND: We investigated whether we could use influenza data to develop prediction models for COVID-19 to increase the speed at which prediction models can reliably be developed and validated early in a pandemic. We developed COVID-19 Estimated Risk (COVER) scores that quantify a patient's risk of hospital admission with pneumonia (COVER-H), hospitalization with pneumonia requiring intensive services or death (COVER-I), or fatality (COVER-F) in the 30-days following COVID-19 diagnosis using historical data from patients with influenza or flu-like symptoms and tested this in COVID-19 patients. METHODS: We analyzed a federated network of electronic medical records and administrative claims data from 14 data sources and 6 countries containing data collected on or before 4/27/2020. We used a 2-step process to develop 3 scores using historical data from patients with influenza or flu-like symptoms any time prior to 2020. The first step was to create a data-driven model using LASSO regularized logistic regression, the covariates of which were used to develop aggregate covariates for the second step where the COVER scores were developed using a smaller set of features. These 3 COVER scores were then externally validated on patients with 1) influenza or flu-like symptoms and 2) confirmed or suspected COVID-19 diagnosis across 5 databases from South Korea, Spain, and the United States. Outcomes included i) hospitalization with pneumonia, ii) hospitalization with pneumonia requiring intensive services or death, and iii) death in the 30 days after index date. RESULTS: Overall, 44,507 COVID-19 patients were included for model validation. We identified 7 predictors (history of cancer, chronic obstructive pulmonary disease, diabetes, heart disease, hypertension, hyperlipidemia, kidney disease) which combined with age and sex discriminated which patients would experience any of our three outcomes. The models achieved good performance in influenza and COVID-19 cohorts. For COVID-19 the AUC ranges were, COVER-H: 0.69-0.81, COVER-I: 0.73-0.91, and COVER-F: 0.72-0.90. Calibration varied across the validations with some of the COVID-19 validations being less well calibrated than the influenza validations. CONCLUSIONS: This research demonstrated the utility of using a proxy disease to develop a prediction model. The 3 COVER models with 9-predictors that were developed using influenza data perform well for COVID-19 patients for predicting hospitalization, intensive services, and fatality. The scores showed good discriminatory performance which transferred well to the COVID-19 population. There was some miscalibration in the COVID-19 validations, which is potentially due to the difference in symptom severity between the two diseases. A possible solution for this is to recalibrate the models in each location before use.
Asunto(s)
COVID-19 , Gripe Humana , Neumonía , Prueba de COVID-19 , Humanos , Gripe Humana/epidemiología , SARS-CoV-2 , Estados UnidosRESUMEN
INTRODUCTION: Efforts to characterize variability in epilepsy treatment pathways are limited by the large number of possible antiseizure medication (ASM) regimens and sequences, heterogeneity of patients, and challenges of measuring confounding variables and outcomes across institutions. The Observational Health Data Science and Informatics (OHDSI) collaborative is an international data network representing over 1 billion patient records using common data standards. However, few studies have applied OHDSI's Common Data Model (CDM) to the population with epilepsy and none have validated relevant concepts. The goals of this study were to demonstrate the feasibility of characterizing adult patients with epilepsy and ASM treatment pathways using the CDM in an electronic health record (EHR)-derived database. METHODS: We validated a phenotype algorithm for epilepsy in adults using the CDM in an EHR-derived database (2001-2020) against source records and a prospectively maintained database of patients with confirmed epilepsy. We obtained the frequency of all antecedent conditions and procedures for patients meeting the epilepsy phenotype criteria and characterized ASM exposure sequences over time and by age and sex. RESULTS: The phenotype algorithm identified epilepsy with 73.0-85.0% positive predictive value and 86.3% sensitivity. Many patients had neurologic conditions and diagnoses antecedent to meeting epilepsy criteria. Levetiracetam incrementally replaced phenytoin as the most common first-line agent, but significant heterogeneity remained, particularly in second-line and subsequent agents. Drug sequences included up to 8 unique ingredients and a total of 1,235 unique pathways were observed. CONCLUSIONS: Despite the availability of additional ASMs in the last 2 decades and accumulated guidelines and evidence, ASM use varies significantly in practice, particularly for second-line and subsequent agents. Multi-center OHDSI studies have the potential to better characterize the full extent of variability and support observational comparative effectiveness research, but additional work is needed to validate covariates and outcomes.
Asunto(s)
Registros Electrónicos de Salud , Epilepsia , Bases de Datos Factuales , Epilepsia/diagnóstico , Epilepsia/tratamiento farmacológico , Estudios de Factibilidad , Humanos , LevetiracetamRESUMEN
OBJECTIVE: Patients with autoimmune diseases were advised to shield to avoid coronavirus disease 2019 (COVID-19), but information on their prognosis is lacking. We characterized 30-day outcomes and mortality after hospitalization with COVID-19 among patients with prevalent autoimmune diseases, and compared outcomes after hospital admissions among similar patients with seasonal influenza. METHODS: A multinational network cohort study was conducted using electronic health records data from Columbia University Irving Medical Center [USA, Optum (USA), Department of Veterans Affairs (USA), Information System for Research in Primary Care-Hospitalization Linked Data (Spain) and claims data from IQVIA Open Claims (USA) and Health Insurance and Review Assessment (South Korea). All patients with prevalent autoimmune diseases, diagnosed and/or hospitalized between January and June 2020 with COVID-19, and similar patients hospitalized with influenza in 2017-18 were included. Outcomes were death and complications within 30 days of hospitalization. RESULTS: We studied 133 589 patients diagnosed and 48 418 hospitalized with COVID-19 with prevalent autoimmune diseases. Most patients were female, aged ≥50 years with previous comorbidities. The prevalence of hypertension (45.5-93.2%), chronic kidney disease (14.0-52.7%) and heart disease (29.0-83.8%) was higher in hospitalized vs diagnosed patients with COVID-19. Compared with 70 660 hospitalized with influenza, those admitted with COVID-19 had more respiratory complications including pneumonia and acute respiratory distress syndrome, and higher 30-day mortality (2.2-4.3% vs 6.32-24.6%). CONCLUSION: Compared with influenza, COVID-19 is a more severe disease, leading to more complications and higher mortality.
Asunto(s)
Enfermedades Autoinmunes/mortalidad , Enfermedades Autoinmunes/virología , COVID-19/mortalidad , Hospitalización/estadística & datos numéricos , Gripe Humana/mortalidad , Adulto , Anciano , Anciano de 80 o más Años , COVID-19/inmunología , Estudios de Cohortes , Femenino , Humanos , Gripe Humana/inmunología , Masculino , Persona de Mediana Edad , Prevalencia , Pronóstico , República de Corea/epidemiología , SARS-CoV-2 , España/epidemiología , Estados Unidos/epidemiología , Adulto JovenRESUMEN
OBJECTIVES: Concern has been raised in the rheumatology community regarding recent regulatory warnings that HCQ used in the coronavirus disease 2019 pandemic could cause acute psychiatric events. We aimed to study whether there is risk of incident depression, suicidal ideation or psychosis associated with HCQ as used for RA. METHODS: We performed a new-user cohort study using claims and electronic medical records from 10 sources and 3 countries (Germany, UK and USA). RA patients ≥18 years of age and initiating HCQ were compared with those initiating SSZ (active comparator) and followed up in the short (30 days) and long term (on treatment). Study outcomes included depression, suicide/suicidal ideation and hospitalization for psychosis. Propensity score stratification and calibration using negative control outcomes were used to address confounding. Cox models were fitted to estimate database-specific calibrated hazard ratios (HRs), with estimates pooled where I2 <40%. RESULTS: A total of 918 144 and 290 383 users of HCQ and SSZ, respectively, were included. No consistent risk of psychiatric events was observed with short-term HCQ (compared with SSZ) use, with meta-analytic HRs of 0.96 (95% CI 0.79, 1.16) for depression, 0.94 (95% CI 0.49, 1.77) for suicide/suicidal ideation and 1.03 (95% CI 0.66, 1.60) for psychosis. No consistent long-term risk was seen, with meta-analytic HRs of 0.94 (95% CI 0.71, 1.26) for depression, 0.77 (95% CI 0.56, 1.07) for suicide/suicidal ideation and 0.99 (95% CI 0.72, 1.35) for psychosis. CONCLUSION: HCQ as used to treat RA does not appear to increase the risk of depression, suicide/suicidal ideation or psychosis compared with SSZ. No effects were seen in the short or long term. Use at a higher dose or for different indications needs further investigation. TRIAL REGISTRATION: Registered with EU PAS (reference no. EUPAS34497; http://www.encepp.eu/encepp/viewResource.htm? id=34498). The full study protocol and analysis source code can be found at https://github.com/ohdsi-studies/Covid19EstimationHydroxychloroquine2.
Asunto(s)
Antirreumáticos/efectos adversos , Tratamiento Farmacológico de COVID-19 , Depresión/inducido químicamente , Depresión/epidemiología , Hidroxicloroquina/efectos adversos , Psicosis Inducidas por Sustancias/epidemiología , Psicosis Inducidas por Sustancias/etiología , Ideación Suicida , Suicidio/estadística & datos numéricos , Adolescente , Adulto , Anciano , Antirreumáticos/uso terapéutico , Artritis Reumatoide/tratamiento farmacológico , Estudios de Cohortes , Femenino , Alemania , Humanos , Hidroxicloroquina/uso terapéutico , Masculino , Persona de Mediana Edad , Medición de Riesgo , Reino Unido , Estados Unidos , Adulto JovenRESUMEN
OBJECTIVES: Large language models (LLMs) like Generative pre-trained transformer (ChatGPT) are powerful algorithms that have been shown to produce human-like text from input data. Several potential clinical applications of this technology have been proposed and evaluated by biomedical informatics experts. However, few have surveyed health care providers for their opinions about whether the technology is fit for use. METHODS: We distributed a validated mixed-methods survey to gauge practicing clinicians' comfort with LLMs for a breadth of tasks in clinical practice, research, and education, which were selected from the literature. RESULTS: A total of 30 clinicians fully completed the survey. Of the 23 tasks, 16 were rated positively by more than 50% of the respondents. Based on our qualitative analysis, health care providers considered LLMs to have excellent synthesis skills and efficiency. However, our respondents had concerns that LLMs could generate false information and propagate training data bias.Our survey respondents were most comfortable with scenarios that allow LLMs to function in an assistive role, like a physician extender or trainee. CONCLUSION: In a mixed-methods survey of clinicians about LLM use, health care providers were encouraging of having LLMs in health care for many tasks, and especially in assistive roles. There is a need for continued human-centered development of both LLMs and artificial intelligence in general.
Asunto(s)
Algoritmos , Inteligencia Artificial , Humanos , Instituciones de Salud , Personal de Salud , LenguajeRESUMEN
Importance: Interdisciplinary practice parameters recommend that patients with drug-resistant epilepsy (DRE) undergo comprehensive neurodiagnostic evaluation, including presurgical assessment. Reporting from specialized centers suggests long delays to referral and underuse of surgery; however, longitudinal data are limited to characterize neurodiagnostic evaluation among patients with DRE in more diverse US settings and populations. Objective: To examine the rate and factors associated with neurodiagnostic studies and comprehensive evaluation among patients with DRE within 3 US cohorts. Design, Setting, and Participants: A retrospective cross-sectional study was conducted using the Observational Medical Outcomes Partnership Common Data Model including US multistate Medicaid data, commercial claims data, and Columbia University Medical Center (CUMC) electronic health record data. Patients meeting a validated computable phenotype algorithm for DRE between January 1, 2015, and April 1, 2020, were included. No eligible participants were excluded. Exposure: Demographic and clinical variables were queried. Main Outcomes and Measures: The proportion of patients receiving a composite proxy for comprehensive neurodiagnostic evaluation, including (1) magnetic resonance or other advanced brain imaging, (2) video electroencephalography, and (3) neuropsychological evaluation within 2 years of meeting the inclusion criteria. Results: A total of 33â¯542 patients with DRE were included in the Medicaid cohort, 22â¯496 in the commercial insurance cohort, and 2741 in the CUMC database. A total of 31â¯516 patients (53.6%) were women. The proportion of patients meeting the comprehensive evaluation main outcome in the Medicaid cohort was 4.5% (n = 1520); in the commercial insurance cohort, 8.0% (n = 1796); and in the CUMC cohort, 14.3% (n = 393). Video electroencephalography (24.9% Medicaid, 28.4% commercial, 63.2% CUMC) and magnetic resonance imaging of the brain (35.6% Medicaid, 43.4% commercial, 52.6% CUMC) were performed more regularly than neuropsychological evaluation (13.0% Medicaid, 16.6% commercial, 19.2% CUMC) or advanced imaging (3.2% Medicaid, 5.4% commercial, 13.1% CUMC). Factors independently associated with greater odds of evaluation across all 3 data sets included the number of inpatient and outpatient nonemergency epilepsy visits and focal rather than generalized epilepsy. Conclusions and Relevance: The findings of this study suggest there is a gap in the use of diagnostic studies to evaluate patients with DRE. Care setting, insurance type, frequency of nonemergency visits, and epilepsy type are all associated with evaluation. A common data model can be used to measure adherence with best practices across a variety of observational data sources.
Asunto(s)
Epilepsia Refractaria , Humanos , Femenino , Masculino , Adulto , Epilepsia Refractaria/diagnóstico , Estudios Transversales , Estudios Retrospectivos , Persona de Mediana Edad , Adulto Joven , Estados Unidos , Electroencefalografía , Adolescente , Imagen por Resonancia Magnética , Neuroimagen , Medicaid/estadística & datos numéricosRESUMEN
PURPOSE: The specific aims of this paper are to (1) develop and operationalize an electronic health record (EHR) data quality framework, (2) apply the dimensions of the framework to the phenotype and treatment pathways of ductal carcinoma in situ (DCIS) using All of Us Research Program data, and (3) propose and apply a checklist to evaluate the application of the framework. METHODS: We developed a framework of five data quality dimensions (DQD; completeness, concordance, conformance, plausibility, and temporality). Participants signed a consent and Health Insurance Portability and Accountability Act authorization to share EHR data and responded to demographic questions in the Basics questionnaire. We evaluated the internal characteristics of the data and compared data with external benchmarks with descriptive and inferential statistics. We developed a DQD checklist to evaluate concept selection, internal verification, and external validity for each DQD. The Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) concept ID codes for DCIS were used to select a cohort of 2,209 females 18 years and older. RESULTS: Using the proposed DQD checklist criteria, (1) concepts were selected and internally verified for conformance; (2) concepts were selected and internally verified for completeness; (3) concepts were selected, internally verified, and externally validated for concordance; (4) concepts were selected, internally verified, and externally validated for plausibility; and (5) concepts were selected, internally verified, and externally validated for temporality. CONCLUSION: This assessment and evaluation provided insights into data quality for the DCIS phenotype using EHR data from the All of Us Research Program. The review demonstrates that salient clinical measures can be selected, applied, and operationalized within a conceptual framework and evaluated for fitness for use by applying a proposed checklist.
Asunto(s)
Neoplasias de la Mama , Carcinoma Intraductal no Infiltrante , Exactitud de los Datos , Registros Electrónicos de Salud , Humanos , Registros Electrónicos de Salud/normas , Femenino , Carcinoma Intraductal no Infiltrante/terapia , Carcinoma Intraductal no Infiltrante/patología , Carcinoma Intraductal no Infiltrante/epidemiología , Neoplasias de la Mama/terapia , Neoplasias de la Mama/epidemiología , Neoplasias de la Mama/diagnóstico , Estados Unidos , Persona de Mediana Edad , Adulto , AncianoRESUMEN
Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to the proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance. Utilizing a benchmark dataset, MedReview, consisting of 8161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the performance of open-source models was all improved after fine-tuning. The performance of fine-tuned LongT5 is close to GPT-3.5 with zero-shot settings. Furthermore, smaller fine-tuned models sometimes even demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were manifested in both a human evaluation and a larger-scale GPT4-simulated evaluation.
RESUMEN
Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance in summarizing medical evidence. Utilizing a benchmark dataset, MedReview, consisting of 8,161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the fine-tuned LLMs obtained an increase of 9.89 in ROUGE-L (95% confidence interval: 8.94-10.81), 13.21 in METEOR score (95% confidence interval: 12.05-14.37), and 15.82 in CHRF score (95% confidence interval: 13.89-16.44). The performance of fine-tuned LongT5 is close to GPT-3.5 with zero-shot settings. Furthermore, smaller fine-tuned models sometimes even demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were also manifested in both human and GPT4-simulated evaluations. Our results can be applied to guide model selection for tasks demanding particular domain knowledge, such as medical evidence summarization.
RESUMEN
OBJECTIVES: Chart review as the current gold standard for phenotype evaluation cannot support observational research on electronic health records and claims data sources at scale. We aimed to evaluate the ability of structured data to support efficient and interpretable phenotype evaluation as an alternative to chart review. MATERIALS AND METHODS: We developed Knowledge-Enhanced Electronic Profile Review (KEEPER) as a phenotype evaluation tool that extracts patient's structured data elements relevant to a phenotype and presents them in a standardized fashion following clinical reasoning principles. We evaluated its performance (interrater agreement, intermethod agreement, accuracy, and review time) compared to manual chart review for 4 conditions using randomized 2-period, 2-sequence crossover design. RESULTS: Case ascertainment with KEEPER was twice as fast compared to manual chart review. 88.1% of the patients were classified concordantly using charts and KEEPER, but agreement varied depending on the condition. Missing data and differences in interpretation accounted for most of the discrepancies. Pairs of clinicians agreed in case ascertainment in 91.2% of the cases when using KEEPER compared to 76.3% when using charts. Patient classification aligned with the gold standard in 88.1% and 86.9% of the cases respectively. CONCLUSION: Structured data can be used for efficient and interpretable phenotype evaluation if they are limited to relevant subset and organized according to the clinical reasoning principles. A system that implements these principles can achieve noninferior performance compared to chart review at a fraction of time.
Asunto(s)
Registros Electrónicos de Salud , Humanos , FenotipoRESUMEN
With the burgeoning development of computational phenotypes, it is increasingly difficult to identify the right phenotype for the right tasks. This study uses a mixed-methods approach to develop and evaluate a novel metadata framework for retrieval of and reusing computational phenotypes. Twenty active phenotyping researchers from 2 large research networks, Electronic Medical Records and Genomics and Observational Health Data Sciences and Informatics, were recruited to suggest metadata elements. Once consensus was reached on 39 metadata elements, 47 new researchers were surveyed to evaluate the utility of the metadata framework. The survey consisted of 5-Likert multiple-choice questions and open-ended questions. Two more researchers were asked to use the metadata framework to annotate 8 type-2 diabetes mellitus phenotypes. More than 90% of the survey respondents rated metadata elements regarding phenotype definition and validation methods and metrics positively with a score of 4 or 5. Both researchers completed annotation of each phenotype within 60 min. Our thematic analysis of the narrative feedback indicates that the metadata framework was effective in capturing rich and explicit descriptions and enabling the search for phenotypes, compliance with data standards, and comprehensive validation metrics. Current limitations were its complexity for data collection and the entailed human costs.
RESUMEN
Measurement concepts are essential to observational healthcare research; however, a lack of concept harmonization limits the quality of research that can be done on multisite research networks. We developed five methods that used a combination of automated, semi-automated and manual approaches for generating measurement concept sets. We validated our concept sets by calculating their frequencies in cohorts from the Columbia University Irving Medical Center (CUIMC) database. For heart transplant patients, the preoperative frequencies of basic metabolic panel concept sets, which we generated by a semi-automated approach, were greater than 99%. We also made concept sets for lumbar puncture and coagulation panels, by automated and manual methods respectively.
Asunto(s)
Almacenamiento y Recuperación de la Información , Logical Observation Identifiers Names and Codes , Bases de Datos Factuales , Humanos , Systematized Nomenclature of MedicineRESUMEN
Easy access to large quantities of accurate health data is required to understand medical and scientific information in real-time; evaluate public health measures before, during, and after times of crisis; and prevent medical errors. Introducing a system in the USA that allows for efficient access to such health data and ensures auditability of data facts, while avoiding data silos, will require fundamental changes in current practices. Here, we recommend the implementation of standardized data collection and transmission systems, universal identifiers for individual patients and end users, a reference standard infrastructure to support calibration and integration of laboratory results from equivalent tests, and modernized working practices. Requiring comprehensive and binding standards, rather than incentivizing voluntary and often piecemeal efforts for data exchange, will allow us to achieve the analytical information environment that patients need.
RESUMEN
Background: Characterization studies of COVID-19 patients with chronic obstructive pulmonary disease (COPD) are limited in size and scope. The aim of the study is to provide a large-scale characterization of COVID-19 patients with COPD. Methods: We included thirteen databases contributing data from January-June 2020 from North America (US), Europe and Asia. We defined two cohorts of patients with COVID-19 namely a 'diagnosed' and 'hospitalized' cohort. We followed patients from COVID-19 index date to 30 days or death. We performed descriptive analysis and reported the frequency of characteristics and outcomes among COPD patients with COVID-19. Results: The study included 934,778 patients in the diagnosed COVID-19 cohort and 177,201 in the hospitalized COVID-19 cohort. Observed COPD prevalence in the diagnosed cohort ranged from 3.8% (95%CI 3.5-4.1%) in French data to 22.7% (95%CI 22.4-23.0) in US data, and from 1.9% (95%CI 1.6-2.2) in South Korean to 44.0% (95%CI 43.1-45.0) in US data, in the hospitalized cohorts. COPD patients in the hospitalized cohort had greater comorbidity than those in the diagnosed cohort, including hypertension, heart disease, diabetes and obesity. Mortality was higher in COPD patients in the hospitalized cohort and ranged from 7.6% (95%CI 6.9-8.4) to 32.2% (95%CI 28.0-36.7) across databases. ARDS, acute renal failure, cardiac arrhythmia and sepsis were the most common outcomes among hospitalized COPD patients. Conclusion: COPD patients with COVID-19 have high levels of COVID-19-associated comorbidities and poor COVID-19 outcomes. Further research is required to identify patients with COPD at high risk of worse outcomes.
RESUMEN
Purpose: Routinely collected real world data (RWD) have great utility in aiding the novel coronavirus disease (COVID-19) pandemic response. Here we present the international Observational Health Data Sciences and Informatics (OHDSI) Characterizing Health Associated Risks and Your Baseline Disease In SARS-COV-2 (CHARYBDIS) framework for standardisation and analysis of COVID-19 RWD. Patients and Methods: We conducted a descriptive retrospective database study using a federated network of data partners in the United States, Europe (the Netherlands, Spain, the UK, Germany, France and Italy) and Asia (South Korea and China). The study protocol and analytical package were released on 11th June 2020 and are iteratively updated via GitHub. We identified three non-mutually exclusive cohorts of 4,537,153 individuals with a clinical COVID-19 diagnosis or positive test, 886,193 hospitalized with COVID-19, and 113,627 hospitalized with COVID-19 requiring intensive services. Results: We aggregated over 22,000 unique characteristics describing patients with COVID-19. All comorbidities, symptoms, medications, and outcomes are described by cohort in aggregate counts and are readily available online. Globally, we observed similarities in the USA and Europe: more women diagnosed than men but more men hospitalized than women, most diagnosed cases between 25 and 60 years of age versus most hospitalized cases between 60 and 80 years of age. South Korea differed with more women than men hospitalized. Common comorbidities included type 2 diabetes, hypertension, chronic kidney disease and heart disease. Common presenting symptoms were dyspnea, cough and fever. Symptom data availability was more common in hospitalized cohorts than diagnosed. Conclusion: We constructed a global, multi-centre view to describe trends in COVID-19 progression, management and evolution over time. By characterising baseline variability in patients and geography, our work provides critical context that may otherwise be misconstrued as data quality issues. This is important as we perform studies on adverse events of special interest in COVID-19 vaccine surveillance.