RESUMEN
ABSTRACT: Patients with chronic lymphocytic leukemia (CLL) and non-Hodgkin lymphoma (NHL) can develop hypogammaglobulinemia, a form of secondary immune deficiency (SID), from the disease and treatments. Patients with hypogammaglobulinemia with recurrent infections may benefit from immunoglobulin replacement therapy (IgRT). This study evaluated patterns of immunoglobulin G (IgG) testing and the effectiveness of IgRT in real-world patients with CLL or NHL. A retrospective, longitudinal study was conducted among adult patients diagnosed with CLL or NHL. Clinical data from the Massachusetts General Brigham Research Patient Data Registry were used. IgG testing, infections, and antimicrobial use were compared before vs 3, 6, and 12 months after IgRT initiation. Generalized estimating equation logistic regression models were used to estimate odds ratios, 95% confidence intervals, and P values. The study population included 17 192 patients (CLL: n = 3960; median age, 68 years; NHL: n = 13 232; median age, 64 years). In the CLL and NHL cohorts, 67% and 51.2% had IgG testing, and 6.5% and 4.7% received IgRT, respectively. After IgRT initiation, the proportion of patients with hypogammaglobulinemia, the odds of infections or severe infections, and associated antimicrobial use, decreased significantly. Increased frequency of IgG testing was associated with a significantly lower likelihood of severe infection. In conclusion, in real-world patients with CLL or NHL, IgRT was associated with significant reductions in hypogammaglobulinemia, infections, severe infections, and associated antimicrobials. Optimizing IgG testing and IgRT are warranted for the comprehensive management of SID in patients with CLL or NHL.
Asunto(s)
Inmunoglobulina G , Leucemia Linfocítica Crónica de Células B , Linfoma no Hodgkin , Humanos , Leucemia Linfocítica Crónica de Células B/complicaciones , Leucemia Linfocítica Crónica de Células B/terapia , Anciano , Persona de Mediana Edad , Masculino , Femenino , Inmunoglobulina G/sangre , Linfoma no Hodgkin/terapia , Linfoma no Hodgkin/complicaciones , Estudios Retrospectivos , Infecciones/etiología , Agammaglobulinemia/complicaciones , Agammaglobulinemia/terapia , Agammaglobulinemia/etiología , Resultado del Tratamiento , Estudios Longitudinales , Anciano de 80 o más Años , Adulto , Inmunización Pasiva/métodosRESUMEN
OBJECTIVE: Intracranial aneurysms (IA) and aortic aneurysms (AA) are both abnormal dilations of arteries with familial predisposition and have been proposed to share co-prevalence and pathophysiology. Associations of IA and non-aortic peripheral aneurysms are less well-studied. The goal of the study was to understand the patterns of aortic and peripheral (extracranial) aneurysms in patients with IA, and risk factors associated with the development of these aneurysms. METHODS: 4701 patients were included in our retrospective analysis of all patients with intracranial aneurysms at our institution over the past 26 years. Patient demographics, comorbidities, and aneurysmal locations were analyzed. Univariate and multivariate analyses were performed to study associations with and without extracranial aneurysms. RESULTS: A total of 3.4% of patients (161 of 4701) with IA had at least one extracranial aneurysm. 2.8% had thoracic or abdominal aortic aneurysms. Age, male sex, hypertension, coronary artery disease, history of ischemic cerebral infarction, connective tissues disease, and family history of extracranial aneurysms in a 1st degree relative were associated with the presence of extracranial aneurysms and a higher number of extracranial aneurysms. In addition, family history of extracranial aneurysms in a second degree relative is associated with the presence of extracranial aneurysms and atrial fibrillation is associated with a higher number of extracranial aneurysms. CONCLUSION: Significant comorbidities are associated with extracranial aneurysms in patients with IA. Family history of extracranial aneurysms has the strongest association and suggests that IA patients with a family history of extracranial aneurysms may benefit from screening.
Asunto(s)
Aneurisma Intracraneal , Humanos , Masculino , Femenino , Aneurisma Intracraneal/epidemiología , Aneurisma Intracraneal/complicaciones , Persona de Mediana Edad , Estudios Retrospectivos , Anciano , Factores de Riesgo , Adulto , Aneurisma de la Aorta/epidemiología , Aneurisma de la Aorta/genética , Aneurisma de la Aorta/diagnóstico por imagen , Anciano de 80 o más AñosRESUMEN
Scalable identification of patients with the post-acute sequelae of COVID-19 (PASC) is challenging due to a lack of reproducible precision phenotyping algorithms and the suboptimal accuracy, demographic biases, and underestimation of the PASC diagnosis code (ICD-10 U09.9). In a retrospective case-control study, we developed a precision phenotyping algorithm for identifying research cohorts of PASC patients, defined as a diagnosis of exclusion. We used longitudinal electronic health records (EHR) data from over 295 thousand patients from 14 hospitals and 20 community health centers in Massachusetts. The algorithm employs an attention mechanism to exclude sequelae that prior conditions can explain. We performed independent chart reviews to tune and validate our precision phenotyping algorithm. Our PASC phenotyping algorithm improves precision and prevalence estimation and reduces bias in identifying Long COVID patients compared to the U09.9 diagnosis code. Our algorithm identified a PASC research cohort of over 24 thousand patients (compared to about 6 thousand when using the U09.9 diagnosis code), with a 79.9 percent precision (compared to 77.8 percent from the U09.9 diagnosis code). Our estimated prevalence of PASC was 22.8 percent, which is close to the national estimates for the region. We also provide an in-depth analysis outlining the clinical attributes, encompassing identified lingering effects by organ, comorbidity profiles, and temporal differences in the risk of PASC. The PASC phenotyping method presented in this study boasts superior precision, accurately gauges the prevalence of PASC without underestimating it, and exhibits less bias in pinpointing Long COVID patients. The PASC cohort derived from our algorithm will serve as a springboard for delving into Long COVID's genetic, metabolomic, and clinical intricacies, surmounting the constraints of recent PASC cohort studies, which were hampered by their limited size and available outcome data.
RESUMEN
Background: Characterizing Post-Acute Sequelae of COVID (SARS-CoV-2 Infection), or PASC has been challenging due to the multitude of sub-phenotypes, temporal attributes, and definitions. Scalable characterization of PASC sub-phenotypes can enhance screening capacities, disease management, and treatment planning. Methods: We conducted a retrospective multi-centre observational cohort study, leveraging longitudinal electronic health record (EHR) data of 30,422 patients from three healthcare systems in the Consortium for the Clinical Characterization of COVID-19 by EHR (4CE). From the total cohort, we applied a deductive approach on 12,424 individuals with follow-up data and developed a distributed representation learning process for providing augmented definitions for PASC sub-phenotypes. Findings: Our framework characterized seven PASC sub-phenotypes. We estimated that on average 15.7% of the hospitalized COVID-19 patients were likely to suffer from at least one PASC symptom and almost 5.98%, on average, had multiple symptoms. Joint pain and dyspnea had the highest prevalence, with an average prevalence of 5.45% and 4.53%, respectively. Interpretation: We provided a scalable framework to every participating healthcare system for estimating PASC sub-phenotypes prevalence and temporal attributes, thus developing a unified model that characterizes augmented sub-phenotypes across the different systems. Funding: Authors are supported by National Institute of Allergy and Infectious Diseases, National Institute on Aging, National Center for Advancing Translational Sciences, National Medical Research Council, National Institute of Neurological Disorders and Stroke, European Union, National Institutes of Health, National Center for Advancing Translational Sciences.
RESUMEN
OBJECTIVE: Patients who receive most care within a single healthcare system (colloquially called a "loyalty cohort" since they typically return to the same providers) have mostly complete data within that organization's electronic health record (EHR). Loyalty cohorts have low data missingness, which can unintentionally bias research results. Using proxies of routine care and healthcare utilization metrics, we compute a per-patient score that identifies a loyalty cohort. MATERIALS AND METHODS: We implemented a computable program for the widely adopted i2b2 platform that identifies loyalty cohorts in EHRs based on a machine-learning model, which was previously validated using linked claims data. We developed a novel validation approach, which tests, using only EHR data, whether patients returned to the same healthcare system after the training period. We evaluated these tools at 3 institutions using data from 2017 to 2019. RESULTS: Loyalty cohort calculations to identify patients who returned during a 1-year follow-up yielded a mean area under the receiver operating characteristic curve of 0.77 using the original model and 0.80 after calibrating the model at individual sites. Factors such as multiple medications or visits contributed significantly at all sites. Screening tests' contributions (eg, colonoscopy) varied across sites, likely due to coding and population differences. DISCUSSION: This open-source implementation of a "loyalty score" algorithm had good predictive power. Enriching research cohorts by utilizing these low-missingness patients is a way to obtain the data completeness necessary for accurate causal analysis. CONCLUSION: i2b2 sites can use this approach to select cohorts with mostly complete EHR data.
Asunto(s)
Algoritmos , Registros Electrónicos de Salud , Humanos , Aprendizaje Automático , Atención a la Salud , ElectrónicaRESUMEN
Rationale: Patients with chronic obstructive pulmonary disease (COPD) and type 2 diabetes (T2D) have worse clinical outcomes compared with patients without metabolic dysregulation. GLP-1 (glucagon-like peptide 1) receptor agonists (GLP-1RAs) reduce asthma exacerbation risk and improve FVC in patients with COPD. Objectives: To determine whether GLP-1RA use is associated with reduced COPD exacerbation rates, and severe and moderate exacerbation risk, compared with other T2D therapies. Methods: A retrospective, observational, electronic health records-based study was conducted using an active comparator, new-user design of 1,642 patients with COPD in a U.S. health system from 2012 to 2022. The COPD cohort was identified using a previously validated machine learning algorithm that includes a natural language processing tool. Exposures were defined as prescriptions for GLP-1RAs (reference group), DPP-4 (dipeptidyl peptidase 4) inhibitors (DPP-4is), SGLT2 (sodium-glucose cotransporter 2) inhibitors, or sulfonylureas. Measurements and Main Results: Unadjusted COPD exacerbation counts were lower in GLP-1RA users. Adjusted exacerbation rates were significantly higher in DPP-4i (incidence rate ratio, 1.48 [95% confidence interval, 1.08-2.04]; P = 0.02) and sulfonylurea (incidence rate ratio, 2.09 [95% confidence interval, 1.62-2.69]; P < 0.0001) users compared with GLP-1RA users. GLP-1RA use was also associated with significantly reduced risk of severe exacerbations compared with DPP-4i and sulfonylurea use, and of moderate exacerbations compared with sulfonylurea use. After adjustment for clinical covariates, moderate exacerbation risk was also lower in GLP-1RA users compared with DPP-4i users. No statistically significant difference in exacerbation outcomes was seen between GLP-1RA and SGLT2 inhibitor users. Conclusions: Prospective studies of COPD exacerbations in patients with comorbid T2D are warranted. Additional research may elucidate the mechanisms underlying these observed associations with T2D medications.
Asunto(s)
Diabetes Mellitus Tipo 2 , Inhibidores de la Dipeptidil-Peptidasa IV , Enfermedad Pulmonar Obstructiva Crónica , Humanos , Diabetes Mellitus Tipo 2/complicaciones , Diabetes Mellitus Tipo 2/tratamiento farmacológico , Hipoglucemiantes/uso terapéutico , Agonistas Receptor de Péptidos Similares al Glucagón , Estudios Retrospectivos , Inhibidores de la Dipeptidil-Peptidasa IV/uso terapéutico , Estudios Prospectivos , Compuestos de Sulfonilurea/uso terapéutico , Enfermedad Pulmonar Obstructiva Crónica/complicaciones , Enfermedad Pulmonar Obstructiva Crónica/tratamiento farmacológico , Enfermedad Pulmonar Obstructiva Crónica/inducido químicamenteRESUMEN
Physical and psychological symptoms lasting months following an acute COVID-19 infection are now recognized as post-acute sequelae of COVID-19 (PASC). Accurate tools for identifying such patients could enhance screening capabilities for the recruitment for clinical trials, improve the reliability of disease estimates, and allow for more accurate downstream cohort analysis. In this retrospective cohort study, we analyzed the EHR of hospitalized COVID-19 patients across three healthcare systems to develop a pipeline for better identifying patients with persistent PASC symptoms (dyspnea, fatigue, or joint pain) after their SARS-CoV-2 infection. We implemented distributed representation learning powered by the Machine Learning for modeling Health Outcomes (MLHO) to identify novel EHR features that could suggest PASC symptoms outside of typical diagnosis codes. MLHO applies an entropy-based feature selection and boosting algorithms for representation mining. These improved definitions were then used for estimating PASC among hospitalized patients. 30,422 hospitalized patients were diagnosed with COVID-19 across three healthcare systems between March 13, 2020 and February 28, 2021. The mean age of the population was 62.3 years (SD, 21.0 years) and 15,124 (49.7%) were female. We implemented the distributed representation learning technique to augment PASC definitions. These definitions were found to have positive predictive values of 0.73, 0.74, and 0.91 for dyspnea, fatigue, and joint pain, respectively. We estimated that 25 percent (CI 95%: 6-48), 11 percent (CI 95%: 6-15), and 13 percent (CI 95%: 8-17) of hospitalized COVID-19 patients will have dyspnea, fatigue, and joint pain, respectively, 3 months or longer after a COVID-19 diagnosis. We present a validated framework for screening and identifying patients with PASC in the EHR and then use the tool to estimate its prevalence among hospitalized COVID-19 patients.
RESUMEN
BACKGROUND: Alzheimer's Disease (AD) is a complex clinical phenotype with unprecedented social and economic tolls on an ageing global population. Real-world data (RWD) from electronic health records (EHRs) offer opportunities to accelerate precision drug development and scale epidemiological research on AD. A precise characterization of AD cohorts is needed to address the noise abundant in RWD. METHODS: We conducted a retrospective cohort study to develop and test computational models for AD cohort identification using clinical data from 8 Massachusetts healthcare systems. We mined temporal representations from EHR data using the transitive sequential pattern mining algorithm (tSPM) to train and validate our models. We then tested our models against a held-out test set from a review of medical records to adjudicate the presence of AD. We trained two classes of Machine Learning models, using Gradient Boosting Machine (GBM), to compare the utility of AD diagnosis records versus the tSPM temporal representations (comprising sequences of diagnosis and medication observations) from electronic medical records for characterizing AD cohorts. FINDINGS: In a group of 4985 patients, we identified 219 tSPM temporal representations (i.e., transitive sequences) of medical records for constructing the best classification models. The models with sequential features improved AD classification by a magnitude of 3-16 percent over the use of AD diagnosis codes alone. The computed cohort included 663 patients, 35 of whom had no record of AD. Six groups of tSPM sequences were identified for characterizing the AD cohorts. INTERPRETATION: We present sequential patterns of diagnosis and medication codes from electronic medical records, as digital markers of Alzheimer's Disease. Classification algorithms developed on sequential patterns can replace standard features from EHRs to enrich phenotype modelling. FUNDING: National Institutes of Health: the National Institute on Aging (RF1AG074372) and the National Institute of Allergy and Infectious Diseases (R01AI165535).
Asunto(s)
Enfermedad de Alzheimer , Humanos , Enfermedad de Alzheimer/diagnóstico , Estudios Retrospectivos , Algoritmos , Aprendizaje Automático , Registros Electrónicos de SaludRESUMEN
This cohort study uses hospitalization and 30-day mortality risks to create a temporal profile of the severity of COVID-19 in Massachusetts from July 2021 to December 2022.
Asunto(s)
COVID-19 , Humanos , Massachusetts/epidemiología , SARS-CoV-2RESUMEN
BACKGROUND: In electronic health records, patterns of missing laboratory test results could capture patients' course of disease as well as ââreflect clinician's concerns or worries for possible conditions. These patterns are often understudied and overlooked. This study aims to identify informative patterns of missingness among laboratory data collected across 15 healthcare system sites in three countries for COVID-19 inpatients. METHODS: We collected and analyzed demographic, diagnosis, and laboratory data for 69,939 patients with positive COVID-19 PCR tests across three countries from 1 January 2020 through 30 September 2021. We analyzed missing laboratory measurements across sites, missingness stratification by demographic variables, temporal trends of missingness, correlations between labs based on missingness indicators over time, and clustering of groups of labs based on their missingness/ordering pattern. RESULTS: With these analyses, we identified mapping issues faced in seven out of 15 sites. We also identified nuances in data collection and variable definition for the various sites. Temporal trend analyses may support the use of laboratory test result missingness patterns in identifying severe COVID-19 patients. Lastly, using missingness patterns, we determined relationships between various labs that reflect clinical behaviors. CONCLUSION: In this work, we use computational approaches to relate missingness patterns to hospital treatment capacity and highlight the heterogeneity of looking at COVID-19 over time and at multiple sites, where there might be different phases, policies, etc. Changes in missingness could suggest a change in a patient's condition, and patterns of missingness among laboratory measurements could potentially identify clinical outcomes. This allows sites to consider missing data as informative to analyses and help researchers identify which sites are better poised to study particular questions.
Asunto(s)
COVID-19 , Registros Electrónicos de Salud , Humanos , Recolección de Datos , Registros , Análisis por ConglomeradosRESUMEN
PURPOSE: Assessing the risk of common, complex diseases requires consideration of clinical risk factors as well as monogenic and polygenic risks, which in turn may be reflected in family history. Returning risks to individuals and providers may influence preventive care or use of prophylactic therapies for those individuals at high genetic risk. METHODS: To enable integrated genetic risk assessment, the eMERGE (electronic MEdical Records and GEnomics) network is enrolling 25,000 diverse individuals in a prospective cohort study across 10 sites. The network developed methods to return cross-ancestry polygenic risk scores, monogenic risks, family history, and clinical risk assessments via a genome-informed risk assessment (GIRA) report and will assess uptake of care recommendations after return of results. RESULTS: GIRAs include summary care recommendations for 11 conditions, education pages, and clinical laboratory reports. The return of high-risk GIRA to individuals and providers includes guidelines for care and lifestyle recommendations. Assembling the GIRA required infrastructure and workflows for ingesting and presenting content from multiple sources. Recruitment began in February 2022. CONCLUSION: Return of a novel report for communicating monogenic, polygenic, and family history-based risk factors will inform the benefits of integrated genetic risk assessment for routine health care.
Asunto(s)
Genoma , Genómica , Humanos , Estudios Prospectivos , Genómica/métodos , Factores de Riesgo , Medición de RiesgoRESUMEN
OBJECTIVE: High BMI is associated with many comorbidities and mortality. This study aimed to elucidate the overall clinical risk of obesity using a genome- and phenome-wide approach. METHODS: This study performed a phenome-wide association study of BMI using a clinical cohort of 736,726 adults. This was followed by genetic association studies using two separate cohorts: one consisting of 65,174 adults in the Electronic Medical Records and Genomics (eMERGE) Network and another with 405,432 participants in the UK Biobank. RESULTS: Class 3 obesity was associated with 433 phenotypes, representing 59.3% of all billing codes in individuals with severe obesity. A genome-wide polygenic risk score for BMI, accounting for 7.5% of variance in BMI, was associated with 296 clinical diseases, including strong associations with type 2 diabetes, sleep apnea, hypertension, and chronic liver disease. In all three cohorts, 199 phenotypes were associated with class 3 obesity and polygenic risk for obesity, including novel associations such as increased risk of renal failure, venous insufficiency, and gastroesophageal reflux. CONCLUSIONS: This combined genomic and phenomic systematic approach demonstrated that obesity has a strong genetic predisposition and is associated with a considerable burden of disease across all disease classes.
Asunto(s)
Diabetes Mellitus Tipo 2 , Fenómica , Humanos , Registros Electrónicos de Salud , Estudio de Asociación del Genoma Completo , Diabetes Mellitus Tipo 2/epidemiología , Diabetes Mellitus Tipo 2/genética , Polimorfismo de Nucleótido Simple , Genómica , Predisposición Genética a la Enfermedad , Obesidad/epidemiología , Obesidad/genética , Fenotipo , Costo de EnfermedadRESUMEN
Importance: The SARS-CoV-2 Omicron subvariant, BA.2, may be less severe than previous variants; however, confounding factors make interpreting the intrinsic severity challenging. Objective: To compare the adjusted risks of mortality, hospitalization, intensive care unit admission, and invasive ventilation between the BA.2 subvariant and the Omicron and Delta variants, after accounting for multiple confounders. Design, Setting, and Participants: This was a retrospective cohort study that applied an entropy balancing approach. Patients in a multicenter inpatient and outpatient system in New England with COVID-19 between March 3, 2020, and June 20, 2022, were identified. Exposures: Cases were assigned as being exposed to the Delta (B.1.617.2) variant, the Omicron (B.1.1.529) variant, or the Omicron BA.2 lineage subvariants. Main Outcomes and Measures: The primary study outcome planned before analysis was risk of 30-day mortality. Secondary outcomes included the risks of hospitalization, invasive ventilation, and intensive care unit admissions. Results: Of 102â¯315 confirmed COVID-19 cases (mean [SD] age, 44.2 [21.6] years; 63â¯482 women [62.0%]), 20â¯770 were labeled as Delta variants, 52â¯605 were labeled as the Omicron B.1.1.529 variant, and 28â¯940 were labeled as Omicron BA.2 subvariants. Patient cases were excluded if they occurred outside the prespecified temporal windows associated with the variants or had minimal longitudinal data in the Mass General Brigham system before COVID-19. Mortality rates were 0.7% for Delta (B.1.617.2), 0.4% for Omicron (B.1.1.529), and 0.3% for Omicron (BA.2). The adjusted odds ratio of mortality from the Delta variant compared with the Omicron BA.2 subvariants was 2.07 (95% CI, 1.04-4.10) and that of the original Omicron variant compared with the Omicron BA.2 subvariant was 2.20 (95% CI, 1.56-3.11). For all outcomes, the Omicron BA.2 subvariants were significantly less severe than that of the Omicron and Delta variants. Conclusions and Relevance: In this cohort study, after having accounted for a variety of confounding factors associated with SARS-CoV-2 outcomes, the Omicron BA.2 subvariant was found to be intrinsically less severe than both the Delta and Omicron variants. With respect to these variants, the severity profile of SARS-CoV-2 appears to be diminishing after taking into account various factors including therapeutics, vaccinations, and prior infections.
Asunto(s)
COVID-19 , SARS-CoV-2 , Humanos , Femenino , Adulto , COVID-19/epidemiología , Estudios de Cohortes , Estudios Retrospectivos , New England/epidemiologíaRESUMEN
MOTIVATION: The i2b2 platform is used at major academic health institutions and research consortia for querying for electronic health data. However, a major obstacle for wider utilization of the platform is the complexity of data loading that entails a steep curve of learning the platform's complex data schemas. To address this problem, we have developed the i2b2-etl package that simplifies the data loading process, which will facilitate wider deployment and utilization of the platform. RESULTS: We have implemented i2b2-etl as a Python application that imports ontology and patient data using simplified input file schemas and provides inbuilt record number de-identification and data validation. We describe a real-world deployment of i2b2-etl for a population-management initiative at MassGeneral Brigham. AVAILABILITY AND IMPLEMENTATION: i2b2-etl is a free, open-source application implemented in Python available under the Mozilla 2 license. The application can be downloaded as compiled docker images. A live demo is available at https://i2b2clinical.org/demo-i2b2etl/ (username: demo, password: Etl@2021). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Registros Electrónicos de Salud , Almacenamiento y Recuperación de la Información , Biología , Bases de Datos Factuales , Humanos , InformáticaRESUMEN
OBJECTIVE: This study aimed is to: (1) extend the Integrating the Biology and the Bedside (i2b2) data and application models to include medical imaging appropriate use criteria, enabling it to serve as a platform to monitor local impact of the Protecting Access to Medicare Act's (PAMA) imaging clinical decision support (CDS) requirements, and (2) validate the i2b2 extension using data from the Medicare Imaging Demonstration (MID) CDS implementation. MATERIALS AND METHODS: This study provided a reference implementation and assessed its validity and reliability using data from the MID, the federal government's predecessor to PAMA's imaging CDS program. The Star Schema was extended to describe the interactions of imaging ordering providers with the CDS. New ontologies were added to enable mapping medical imaging appropriateness data to i2b2 schema. z-Ratio for testing the significance of the difference between 2 independent proportions was utilized. RESULTS: The reference implementation used 26 327 orders for imaging examinations which were persisted to the modified i2b2 schema. As an illustration of the analytical capabilities of the Web Client, we report that 331/1192 or 28.1% of imaging orders were deemed appropriate by the CDS system at the end of the intervention period (September 2013), an increase from 162/1223 or 13.2% for the first month of the baseline period, December 2011 (P = .0212), consistent with previous studies. CONCLUSIONS: The i2b2 platform can be extended to monitor local impact of PAMA's appropriateness of imaging ordering CDS requirements.
Asunto(s)
Sistemas de Apoyo a Decisiones Clínicas , Anciano , Diagnóstico por Imagen , Humanos , Medicare , Monitoreo Fisiológico , Reproducibilidad de los Resultados , Estados UnidosRESUMEN
OBJECTIVE: For multi-center heterogeneous Real-World Data (RWD) with time-to-event outcomes and high-dimensional features, we propose the SurvMaximin algorithm to estimate Cox model feature coefficients for a target population by borrowing summary information from a set of health care centers without sharing patient-level information. MATERIALS AND METHODS: For each of the centers from which we want to borrow information to improve the prediction performance for the target population, a penalized Cox model is fitted to estimate feature coefficients for the center. Using estimated feature coefficients and the covariance matrix of the target population, we then obtain a SurvMaximin estimated set of feature coefficients for the target population. The target population can be an entire cohort comprised of all centers, corresponding to federated learning, or a single center, corresponding to transfer learning. RESULTS: Simulation studies and a real-world international electronic health records application study, with 15 participating health care centers across three countries (France, Germany, and the U.S.), show that the proposed SurvMaximin algorithm achieves comparable or higher accuracy compared with the estimator using only the information of the target site and other existing methods. The SurvMaximin estimator is robust to variations in sample sizes and estimated feature coefficients between centers, which amounts to significantly improved estimates for target sites with fewer observations. CONCLUSIONS: The SurvMaximin method is well suited for both federated and transfer learning in the high-dimensional survival analysis setting. SurvMaximin only requires a one-time summary information exchange from participating centers. Estimated regression vectors can be very heterogeneous. SurvMaximin provides robust Cox feature coefficient estimates without outcome information in the target population and is privacy-preserving.
Asunto(s)
Algoritmos , Registros Electrónicos de Salud , Humanos , Privacidad , Modelos de Riesgos Proporcionales , Análisis de SupervivenciaRESUMEN
Background Models predicting atrial fibrillation (AF) risk, such as Cohorts for Heart and Aging Research in Genomic Epidemiology AF (CHARGE-AF), have not performed as well in electronic health records. Natural language processing (NLP) may improve models by using narrative electronic health record text. Methods and Results From a primary care network, we included patients aged ≥65 years with visits between 2003 and 2013 in development (n=32 960) and internal validation cohorts (n=13 992). An external validation cohort from a separate network from 2015 to 2020 included 39 051 patients. Model features were defined using electronic health record codified data and narrative data with NLP. We developed 2 models to predict 5-year AF incidence using (1) codified+NLP data and (2) codified data only and evaluated model performance. The analysis included 2839 incident AF cases in the development cohort and 1057 and 2226 cases in internal and external validation cohorts, respectively. The C-statistic was greater (P<0.001) in codified+NLP model (0.744 [95% CI, 0.735-0.753]) compared with codified-only (0.730 [95% CI, 0.720-0.739]) in the development cohort. In internal validation, the C-statistic of codified+NLP was modestly higher (0.735 [95% CI, 0.720-0.749]) compared with codified-only (0.729 [95% CI, 0.715-0.744]; P=0.06) and CHARGE-AF (0.717 [95% CI, 0.703-0.731]; P=0.002). Codified+NLP and codified-only were well calibrated, whereas CHARGE-AF underestimated AF risk. In external validation, the C-statistic of codified+NLP (0.750 [95% CI, 0.740-0.760]) remained higher (P<0.001) than codified-only (0.738 [95% CI, 0.727-0.748]) and CHARGE-AF (0.735 [95% CI, 0.725-0.746]). Conclusions Estimation of 5-year risk of AF can be modestly improved using NLP to incorporate narrative electronic health record data.
Asunto(s)
Fibrilación Atrial , Procesamiento de Lenguaje Natural , Fibrilación Atrial/diagnóstico , Fibrilación Atrial/epidemiología , Estudios de Cohortes , Registros Electrónicos de Salud , Humanos , Incidencia , Medición de Riesgo/métodosRESUMEN
OBJECTIVE: The growing availability of electronic health records (EHR) data opens opportunities for integrative analysis of multi-institutional EHR to produce generalizable knowledge. A key barrier to such integrative analyses is the lack of semantic interoperability across different institutions due to coding differences. We propose a Multiview Incomplete Knowledge Graph Integration (MIKGI) algorithm to integrate information from multiple sources with partially overlapping EHR concept codes to enable translations between healthcare systems. METHODS: The MIKGI algorithm combines knowledge graph information from (i) embeddings trained from the co-occurrence patterns of medical codes within each EHR system and (ii) semantic embeddings of the textual strings of all medical codes obtained from the Self-Aligning Pretrained BERT (SAPBERT) algorithm. Due to the heterogeneity in the coding across healthcare systems, each EHR source provides partial coverage of the available codes. MIKGI synthesizes the incomplete knowledge graphs derived from these multi-source embeddings by minimizing a spherical loss function that combines the pairwise directional similarities of embeddings computed from all available sources. MIKGI outputs harmonized semantic embedding vectors for all EHR codes, which improves the quality of the embeddings and enables direct assessment of both similarity and relatedness between any pair of codes from multiple healthcare systems. RESULTS: With EHR co-occurrence data from Veteran Affairs (VA) healthcare and Mass General Brigham (MGB), MIKGI algorithm produces high quality embeddings for a variety of downstream tasks including detecting known similar or related entity pairs and mapping VA local codes to the relevant EHR codes used at MGB. Based on the cosine similarity of the MIKGI trained embeddings, the AUC was 0.918 for detecting similar entity pairs and 0.809 for detecting related pairs. For cross-institutional medical code mapping, the top 1 and top 5 accuracy were 91.0% and 97.5% when mapping medication codes at VA to RxNorm medication codes at MGB; 59.1% and 75.8% when mapping VA local laboratory codes to LOINC hierarchy. When trained with 500 labels, the lab code mapping attained top 1 and 5 accuracy at 77.7% and 87.9%. MIKGI also attained best performance in selecting VA local lab codes for desired laboratory tests and COVID-19 related features for COVID EHR studies. Compared to existing methods, MIKGI attained the most robust performance with accuracy the highest or near the highest across all tasks. CONCLUSIONS: The proposed MIKGI algorithm can effectively integrate incomplete summary data from biomedical text and EHR data to generate harmonized embeddings for EHR codes for knowledge graph modeling and cross-institutional translation of EHR codes.
Asunto(s)
COVID-19 , Registros Electrónicos de Salud , Algoritmos , Humanos , Logical Observation Identifiers Names and Codes , Reconocimiento de Normas Patrones AutomatizadasRESUMEN
Analysis of health data typically requires development of queries using structured query language (SQL) by a data-analyst. As the SQL queries are manually created, they are prone to errors. In addition, accurate implementation of the queries depends on effective communication with clinical experts, that further makes the analysis error prone. As a potential resolution, we explore an alternative approach wherein a graphical interface that automatically generates the SQL queries is used to perform the analysis. The latter allows clinical experts to directly perform complex queries on the data, despite their unfamiliarity with SQL syntax. The interface provides an intuitive understanding of the query logic which makes the analysis transparent and comprehensible to the clinical study-staff, thereby enhancing the transparency and validity of the analysis. This study demonstrates the feasibility of using a user-friendly interface that automatically generate SQL for analysis of health data. It outlines challenges that will be useful for designing user-friendly tools to improve transparency and reproducibility of data analysis.