RESUMEN
Post-Acute Sequelae of SARS-CoV-2 infection (PASC), also known as Long-COVID, encompasses a variety of complex and varied outcomes following COVID-19 infection that are still poorly understood. We clustered over 600 million condition diagnoses from 14 million patients available through the National COVID Cohort Collaborative (N3C), generating hundreds of highly detailed clinical phenotypes. Assessing patient clinical trajectories using these clusters allowed us to identify individual conditions and phenotypes strongly increased after acute infection. We found many conditions increased in COVID-19 patients compared to controls, and using a novel method to associate patients with clusters over time, we additionally found phenotypes specific to patient sex, age, wave of infection, and PASC diagnosis status. While many of these results reflect known PASC symptoms, the resolution provided by this unprecedented data scale suggests avenues for improved diagnostics and mechanistic understanding of this multifaceted disease.
RESUMEN
BACKGROUND: A wealth of clinically relevant information is only obtainable within unstructured clinical narratives, leading to great interest in clinical natural language processing (NLP). While a multitude of approaches to NLP exist, current algorithm development approaches have limitations that can slow the development process. These limitations are exacerbated when the task is emergent, as is the case currently for NLP extraction of signs and symptoms of COVID-19 and postacute sequelae of SARS-CoV-2 infection (PASC). OBJECTIVE: This study aims to highlight the current limitations of existing NLP algorithm development approaches that are exacerbated by NLP tasks surrounding emergent clinical concepts and to illustrate our approach to addressing these issues through the use case of developing an NLP system for the signs and symptoms of COVID-19 and PASC. METHODS: We used 2 preexisting studies on PASC as a baseline to determine a set of concepts that should be extracted by NLP. This concept list was then used in conjunction with the Unified Medical Language System to autonomously generate an expanded lexicon to weakly annotate a training set, which was then reviewed by a human expert to generate a fine-tuned NLP algorithm. The annotations from a fully human-annotated test set were then compared with NLP results from the fine-tuned algorithm. The NLP algorithm was then deployed to 10 additional sites that were also running our NLP infrastructure. Of these 10 sites, 5 were used to conduct a federated evaluation of the NLP algorithm. RESULTS: An NLP algorithm consisting of 12,234 unique normalized text strings corresponding to 2366 unique concepts was developed to extract COVID-19 or PASC signs and symptoms. An unweighted mean dictionary coverage of 77.8% was found for the 5 sites. CONCLUSIONS: The evolutionary and time-critical nature of the PASC NLP task significantly complicates existing approaches to NLP algorithm development. In this work, we present a hybrid approach using the Open Health Natural Language Processing Toolkit aimed at addressing these needs with a dictionary-based weak labeling step that minimizes the need for additional expert annotation while still preserving the fine-tuning capabilities of expert involvement.
RESUMEN
Post-Acute Sequelae of SARS-CoV-2 infection (PASC), also known as Long-COVID, encompasses a variety of complex and varied outcomes following COVID-19 infection that are still poorly understood. We clustered over 600 million condition diagnoses from 14 million patients available through the National COVID Cohort Collaborative (N3C), generating hundreds of highly detailed clinical phenotypes. Assessing patient clinical trajectories using these clusters allowed us to identify individual conditions and phenotypes strongly increased after acute infection. We found many conditions increased in COVID-19 patients compared to controls, and using a novel method to associate patients with clusters over time, we additionally found phenotypes specific to patient sex, age, wave of infection, and PASC diagnosis status. While many of these results reflect known PASC symptoms, the resolution provided by this unprecedented data scale suggests avenues for improved diagnostics and mechanistic understanding of this multifaceted disease.
RESUMEN
BACKGROUND: Pancreatic ductal adenocarcinoma (PDAC) is an aggressive tumor. Prognosis is poor and survival is low in patients diagnosed with this disease, with a survival rate of ~12% at 5 years. Immunotherapy, including adoptive T cell transfer therapy, has not impacted the outcomes in patients with PDAC, due in part to the hostile tumor microenvironment (TME) which limits T cell trafficking and persistence. We posit that murine models serve as useful tools to study the fate of T cell therapy. Currently, genetically engineered mouse models (GEMMs) for PDAC are considered a "gold-standard" as they recapitulate many aspects of human disease. However, these models have limitations, including marked tumor variability across individual mice and the cost of colony maintenance. METHODS: Using flow cytometry and immunohistochemistry, we characterized the immunological features and trafficking patterns of adoptively transferred T cells in orthotopic PDAC (C57BL/6) models using two mouse cell lines, KPC-Luc and MT-5, isolated from C57BL/6 KPC-GEMM (KrasLSL-G12D/+p53-/- and KrasLSL-G12D/+p53LSL-R172H/+, respectively). RESULTS: The MT-5 orthotopic model best recapitulates the cellular and stromal features of the TME in the PDAC GEMM. In contrast, far more host immune cells infiltrate the KPC-Luc tumors, which have less stroma, although CD4+ and CD8+ T cells were similarly detected in the MT-5 tumors compared with KPC-GEMM in mice. Interestingly, we found that chimeric antigen receptor (CAR) T cells redirected to recognize mesothelin on these tumors that signal via CD3ζ and 41BB (Meso-41BBζ-CAR T cells) infiltrated the tumors of mice bearing stroma-devoid KPC-Luc orthotopic tumors, but not MT-5 tumors. CONCLUSIONS: Our data establish for the first time a reproducible and realistic clinical system useful for modeling stroma-rich and stroma-devoid PDAC tumors. These models shall serve an indepth study of how to overcome barriers that limit antitumor activity of adoptively transferred T cells.
Asunto(s)
Carcinoma Ductal Pancreático , Neoplasias Pancreáticas , Humanos , Animales , Ratones , Ratones Endogámicos C57BL , Proteínas Proto-Oncogénicas p21(ras) , Linfocitos T CD8-positivos , Proteína p53 Supresora de Tumor , Neoplasias Pancreáticas/terapia , Carcinoma Ductal Pancreático/terapia , Microambiente TumoralRESUMEN
Introduction: With persistent incidence, incomplete vaccination rates, confounding respiratory illnesses, and few therapeutic interventions available, COVID-19 continues to be a burden on the pediatric population. During a surge, it is difficult for hospitals to direct limited healthcare resources effectively. While the overwhelming majority of pediatric infections are mild, there have been life-threatening exceptions that illuminated the need to proactively identify pediatric patients at risk of severe COVID-19 and other respiratory infectious diseases. However, a nationwide capability for developing validated computational tools to identify pediatric patients at risk using real-world data does not exist. Methods: HHS ASPR BARDA sought, through the power of competition in a challenge, to create computational models to address two clinically important questions using the National COVID Cohort Collaborative: (1) Of pediatric patients who test positive for COVID-19 in an outpatient setting, who are at risk for hospitalization? (2) Of pediatric patients who test positive for COVID-19 and are hospitalized, who are at risk for needing mechanical ventilation or cardiovascular interventions? Results: This challenge was the first, multi-agency, coordinated computational challenge carried out by the federal government as a response to a public health emergency. Fifty-five computational models were evaluated across both tasks and two winners and three honorable mentions were selected. Conclusion: This challenge serves as a framework for how the government, research communities, and large data repositories can be brought together to source solutions when resources are strapped during a pandemic.
RESUMEN
Bulk analyses of pancreatic ductal adenocarcinoma (PDAC) samples are complicated by the tumor microenvironment (TME), i.e. signals from fibroblasts, endocrine, exocrine, and immune cells. Despite this, we and others have established tumor and stroma subtypes with prognostic significance. However, understanding of underlying signals driving distinct immune and stromal landscapes is still incomplete. Here we integrate 92 single cell RNA-seq samples from seven independent studies to build a reproducible PDAC atlas with a focus on tumor-TME interdependence. Patients with activated stroma are synonymous with higher myofibroblastic and immunogenic fibroblasts, and furthermore show increased M2-like macrophages and regulatory T-cells. Contrastingly, patients with 'normal' stroma show M1-like recruitment, elevated effector and exhausted T-cells. To aid interoperability of future studies, we provide a pretrained cell type classifier and an atlas of subtype-based signaling factors that we also validate in mouse data. Ultimately, this work leverages the heterogeneity among single-cell studies to create a comprehensive view of the orchestra of signaling interactions governing PDAC.
Asunto(s)
Carcinoma Ductal Pancreático , Neoplasias Pancreáticas , Animales , Ratones , Microambiente Tumoral , Neoplasias Pancreáticas/genética , Carcinoma Ductal Pancreático/genética , FibroblastosRESUMEN
Gene expression analysis of samples with mixed cell types only provides limited insight to the characteristics of specific tissues. In silico deconvolution can be applied to extract cell type specific expression, thus avoiding prohibitively expensive techniques such as cell sorting or single-cell sequencing. Non-negative matrix factorization (NMF) is a deconvolution method shown to be useful for gene expression data, in part due to its constraint of non-negativity. Unlike other methods, NMF provides the capability to deconvolve without prior knowledge of the components of the model. However, NMF is not guaranteed to provide a globally unique solution. In this work, we present FaStaNMF, a method that balances achieving global stability of the NMF results, which is essential for inter-experiment and inter-lab reproducibility, with accuracy and speed. Results: FaStaNMF was applied to four datasets with known ground truth, created based on publicly available data or by using our simulation infrastructure, RNAGinesis. We assessed FaStaNMF on three criteria - speed, accuracy, and stability, and it favorably compared to the standard approach of achieving reproduceable results with NMF. We expect that FaStaNMF can be applied successfully to a wide array of biological data, such as different tumor/immune and other disease microenvironments.
RESUMEN
Long COVID, or complications arising from COVID-19 weeks after infection, has become a central concern for public health experts. The United States National Institutes of Health founded the RECOVER initiative to better understand long COVID. We used electronic health records available through the National COVID Cohort Collaborative to characterize the association between SARS-CoV-2 vaccination and long COVID diagnosis. Among patients with a COVID-19 infection between August 1, 2021 and January 31, 2022, we defined two cohorts using distinct definitions of long COVID-a clinical diagnosis (n = 47,404) or a previously described computational phenotype (n = 198,514)-to compare unvaccinated individuals to those with a complete vaccine series prior to infection. Evidence of long COVID was monitored through June or July of 2022, depending on patients' data availability. We found that vaccination was consistently associated with lower odds and rates of long COVID clinical diagnosis and high-confidence computationally derived diagnosis after adjusting for sex, demographics, and medical history.
Asunto(s)
COVID-19 , Síndrome Post Agudo de COVID-19 , Estados Unidos/epidemiología , Humanos , COVID-19/epidemiología , COVID-19/prevención & control , Vacunas contra la COVID-19 , Estudios de Cohortes , SARS-CoV-2 , VacunaciónRESUMEN
BACKGROUND: AKI is associated with mortality in patients hospitalized with coronavirus disease 2019 (COVID-19); however, its incidence, geographic distribution, and temporal trends since the start of the pandemic are understudied. METHODS: Electronic health record data were obtained from 53 health systems in the United States in the National COVID Cohort Collaborative. We selected hospitalized adults diagnosed with COVID-19 between March 6, 2020, and January 6, 2022. AKI was determined with serum creatinine and diagnosis codes. Time was divided into 16-week periods (P1-6) and geographical regions into Northeast, Midwest, South, and West. Multivariable models were used to analyze the risk factors for AKI or mortality. RESULTS: Of a total cohort of 336,473, 129,176 (38%) patients had AKI. Fifty-six thousand three hundred and twenty-two (17%) lacked a diagnosis code but had AKI based on the change in serum creatinine. Similar to patients coded for AKI, these patients had higher mortality compared with those without AKI. The incidence of AKI was highest in P1 (47%; 23,097/48,947), lower in P2 (37%; 12,102/32,513), and relatively stable thereafter. Compared with the Midwest, the Northeast, South, and West had higher adjusted odds of AKI in P1. Subsequently, the South and West regions continued to have the highest relative AKI odds. In multivariable models, AKI defined by either serum creatinine or diagnostic code and the severity of AKI was associated with mortality. CONCLUSIONS: The incidence and distribution of COVID-19-associated AKI changed since the first wave of the pandemic in the United States. PODCAST: This article contains a podcast at https://dts.podtrac.com/redirect.mp3/www.asn-online.org/media/podcast/CJASN/2023_08_08_CJN0000000000000192.mp3.
Asunto(s)
Lesión Renal Aguda , COVID-19 , Adulto , Humanos , COVID-19/complicaciones , COVID-19/epidemiología , Estudios Retrospectivos , Creatinina , Factores de Riesgo , Lesión Renal Aguda/diagnóstico , Mortalidad HospitalariaRESUMEN
Machine learning (ML)-driven computable phenotypes are among the most challenging to share and reproduce. Despite this difficulty, the urgent public health considerations around Long COVID make it especially important to ensure the rigor and reproducibility of Long COVID phenotyping algorithms such that they can be made available to a broad audience of researchers. As part of the NIH Researching COVID to Enhance Recovery (RECOVER) Initiative, researchers with the National COVID Cohort Collaborative (N3C) devised and trained an ML-based phenotype to identify patients highly probable to have Long COVID. Supported by RECOVER, N3C and NIH's All of Us study partnered to reproduce the output of N3C's trained model in the All of Us data enclave, demonstrating model extensibility in multiple environments. This case study in ML-based phenotype reuse illustrates how open-source software best practices and cross-site collaboration can de-black-box phenotyping algorithms, prevent unnecessary rework, and promote open science in informatics.
Asunto(s)
Boxeo , COVID-19 , Salud Poblacional , Humanos , Registros Electrónicos de Salud , Síndrome Post Agudo de COVID-19 , Reproducibilidad de los Resultados , Aprendizaje Automático , FenotipoRESUMEN
STUDY OBJECTIVES: Obstructive sleep apnea (OSA) has been associated with more severe acute coronavirus disease-2019 (COVID-19) outcomes. We assessed OSA as a potential risk factor for Post-Acute Sequelae of SARS-CoV-2 (PASC). METHODS: We assessed the impact of preexisting OSA on the risk for probable PASC in adults and children using electronic health record data from multiple research networks. Three research networks within the REsearching COVID to Enhance Recovery initiative (PCORnet Adult, PCORnet Pediatric, and the National COVID Cohort Collaborative [N3C]) employed a harmonized analytic approach to examine the risk of probable PASC in COVID-19-positive patients with and without a diagnosis of OSA prior to pandemic onset. Unadjusted odds ratios (ORs) were calculated as well as ORs adjusted for age group, sex, race/ethnicity, hospitalization status, obesity, and preexisting comorbidities. RESULTS: Across networks, the unadjusted OR for probable PASC associated with a preexisting OSA diagnosis in adults and children ranged from 1.41 to 3.93. Adjusted analyses found an attenuated association that remained significant among adults only. Multiple sensitivity analyses with expanded inclusion criteria and covariates yielded results consistent with the primary analysis. CONCLUSIONS: Adults with preexisting OSA were found to have significantly elevated odds of probable PASC. This finding was consistent across data sources, approaches for identifying COVID-19-positive patients, and definitions of PASC. Patients with OSA may be at elevated risk for PASC after SARS-CoV-2 infection and should be monitored for post-acute sequelae.
Asunto(s)
COVID-19 , Apnea Obstructiva del Sueño , Adulto , Humanos , Niño , COVID-19/complicaciones , COVID-19/diagnóstico , COVID-19/epidemiología , Registros Electrónicos de Salud , Síndrome Post Agudo de COVID-19 , SARS-CoV-2 , Progresión de la Enfermedad , Factores de Riesgo , Apnea Obstructiva del Sueño/complicaciones , Apnea Obstructiva del Sueño/diagnóstico , Apnea Obstructiva del Sueño/epidemiologíaRESUMEN
Pancreatic ductal adenocarcinoma (PDAC) is an aggressive disease for which potent therapies have limited efficacy. Several studies have described the transcriptomic landscape of PDAC tumors to provide insight into potentially actionable gene expression signatures to improve patient outcomes. Despite centralization efforts from multiple organizations and increased transparency requirements from funding agencies and publishers, analysis of public PDAC data remains difficult. Bioinformatic pitfalls litter public transcriptomic data, such as subtle inclusion of low-purity and non-adenocarcinoma cases. These pitfalls can introduce non-specificity to gene signatures without appropriate data curation, which can negatively impact findings. To reduce barriers to analysis, we have created pdacR ( http://pdacR.bmi.stonybrook.edu , github.com/rmoffitt/pdacR), an open-source software package and web-tool with annotated datasets from landmark studies and an interface for user-friendly analysis in clustering, differential expression, survival, and dimensionality reduction. Using this tool, we present a multi-dataset analysis of PDAC transcriptomics that confirms the basal-like/classical model over alternatives.
Asunto(s)
Carcinoma Ductal Pancreático , Neoplasias Pancreáticas , Humanos , Pronóstico , Neoplasias Pancreáticas/patología , Carcinoma Ductal Pancreático/genética , Carcinoma Ductal Pancreático/patología , Perfilación de la Expresión Génica , Neoplasias PancreáticasRESUMEN
Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful for assessing associations between patients' predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases, whose removal may introduce severe bias. Several multiple imputation algorithms have been proposed to attempt to recover the missing information under an assumed missingness mechanism. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithm works best in a given scenario. Furthermore, the selection of each algorithm's parameters and data-related modeling choices are also both crucial and challenging. In this paper we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. Extensive experiments show that our approach can effectively highlight the most promising and performant missing-data handling strategy for our case study. Moreover, our methodology allowed a better understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.
Asunto(s)
COVID-19 , Humanos , Algoritmos , Proyectos de Investigación , Sesgo , ProbabilidadRESUMEN
Importance: Characterizing the effect of vaccination on long COVID allows for better healthcare recommendations. Objective: To determine if, and to what degree, vaccination prior to COVID-19 is associated with eventual long COVID onset, among those a documented COVID-19 infection. Design Settings and Participants: Retrospective cohort study of adults with evidence of COVID-19 between August 1, 2021 and January 31, 2022 based on electronic health records from eleven healthcare institutions taking part in the NIH Researching COVID to Enhance Recovery (RECOVER) Initiative, a project of the National Covid Cohort Collaborative (N3C). Exposures: Pre-COVID-19 receipt of a complete vaccine series versus no pre-COVID-19 vaccination. Main Outcomes and Measures: Two approaches to the identification of long COVID were used. In the clinical diagnosis cohort (n=47,752), ICD-10 diagnosis codes or evidence of a healthcare encounter at a long COVID clinic were used. In the model-based cohort (n=199,498), a computable phenotype was used. The association between pre-COVID vaccination and long COVID was estimated using IPTW-adjusted logistic regression and Cox proportional hazards. Results: In both cohorts, when adjusting for demographics and medical history, pre-COVID vaccination was associated with a reduced risk of long COVID (clinic-based cohort: HR, 0.66; 95% CI, 0.55-0.80; OR, 0.69; 95% CI, 0.59-0.82; model-based cohort: HR, 0.62; 95% CI, 0.56-0.69; OR, 0.70; 95% CI, 0.65-0.75). Conclusions and Relevance: Long COVID has become a central concern for public health experts. Prior studies have considered the effect of vaccination on the prevalence of future long COVID symptoms, but ours is the first to thoroughly characterize the association between vaccination and clinically diagnosed or computationally derived long COVID. Our results bolster the growing consensus that vaccines retain protective effects against long COVID even in breakthrough infections. Key Points: Question: Does vaccination prior to COVID-19 onset change the risk of long COVID diagnosis?Findings: Four observational analyses of EHRs showed a statistically significant reduction in long COVID risk associated with pre-COVID vaccination (first cohort: HR, 0.66; 95% CI, 0.55-0.80; OR, 0.69; 95% CI, 0.59-0.82; second cohort: HR, 0.62; 95% CI, 0.56-0.69; OR, 0.70; 95% CI, 0.65-0.75).Meaning: Vaccination prior to COVID onset has a protective association with long COVID even in the case of breakthrough infections.
RESUMEN
Background: Acute kidney injury (AKI) is associated with mortality in patients hospitalized with COVID-19, however, its incidence, geographic distribution, and temporal trends since the start of the pandemic are understudied. Methods: Electronic health record data were obtained from 53 health systems in the United States (US) in the National COVID Cohort Collaborative (N3C). We selected hospitalized adults diagnosed with COVID-19 between March 6th, 2020, and January 6th, 2022. AKI was determined with serum creatinine (SCr) and diagnosis codes. Time were divided into 16-weeks (P1-6) periods and geographical regions into Northeast, Midwest, South, and West. Multivariable models were used to analyze the risk factors for AKI or mortality. Results: Out of a total cohort of 306,061, 126,478 (41.0 %) patients had AKI. Among these, 17.9% lacked a diagnosis code but had AKI based on the change in SCr. Similar to patients coded for AKI, these patients had higher mortality compared to those without AKI. The incidence of AKI was highest in P1 (49.3%), reduced in P2 (40.6%), and relatively stable thereafter. Compared to the Midwest, the Northeast, South, and West had higher adjusted AKI incidence in P1, subsequently, the South and West regions continued to have the highest relative incidence. In multivariable models, AKI defined by either SCr or diagnostic code, and the severity of AKI was associated with mortality. Conclusions: Uncoded cases of COVID-19-associated AKI are common and associated with mortality. The incidence and distribution of COVID-19-associated AKI have changed since the first wave of the pandemic in the US.
RESUMEN
Tumor-infiltrating lymphocytes (TILs) have been established as a robust prognostic biomarker in breast cancer, with emerging utility in predicting treatment response in the adjuvant and neoadjuvant settings. In this study, the role of TILs in predicting overall survival and progression-free interval was evaluated in two independent cohorts of breast cancer from the Cancer Genome Atlas (TCGA BRCA) and the Carolina Breast Cancer Study (UNC CBCS). We utilized machine learning and computer vision algorithms to characterize TIL infiltrates in digital whole-slide images (WSIs) of breast cancer stained with hematoxylin and eosin (H&E). Multiple parameters were used to characterize the global abundance and spatial features of TIL infiltrates. Univariate and multivariate analyses show that large aggregates of peritumoral and intratumoral TILs (forests) were associated with longer survival, whereas the absence of intratumoral TILs (deserts) is associated with increased risk of recurrence. Patients with two or more high-risk spatial features were associated with significantly shorter progression-free interval (PFI). This study demonstrates the practical utility of Pathomics in evaluating the clinical significance of the abundance and spatial patterns of distribution of TIL infiltrates as important biomarkers in breast cancer.
RESUMEN
OBJECTIVE: The goals of this study were to harmonize data from electronic health records (EHRs) into common units, and impute units that were missing. MATERIALS AND METHODS: The National COVID Cohort Collaborative (N3C) table of laboratory measurement data-over 3.1 billion patient records and over 19 000 unique measurement concepts in the Observational Medical Outcomes Partnership (OMOP) common-data-model format from 55 data partners. We grouped ontologically similar OMOP concepts together for 52 variables relevant to COVID-19 research, and developed a unit-harmonization pipeline comprised of (1) selecting a canonical unit for each measurement variable, (2) arriving at a formula for conversion, (3) obtaining clinical review of each formula, (4) applying the formula to convert data values in each unit into the target canonical unit, and (5) removing any harmonized value that fell outside of accepted value ranges for the variable. For data with missing units for all the results within a lab test for a data partner, we compared values with pooled values of all data partners, using the Kolmogorov-Smirnov test. RESULTS: Of the concepts without missing values, we harmonized 88.1% of the values, and imputed units for 78.2% of records where units were absent (41% of contributors' records lacked units). DISCUSSION: The harmonization and inference methods developed herein can serve as a resource for initiatives aiming to extract insight from heterogeneous EHR collections. Unique properties of centralized data are harnessed to enable unit inference. CONCLUSION: The pipeline we developed for the pooled N3C data enables use of measurements that would otherwise be unavailable for analysis.
Asunto(s)
COVID-19 , Registros Electrónicos de Salud , Estudios de Cohortes , Recolección de Datos , HumanosRESUMEN
OBJECTIVE: In response to COVID-19, the informatics community united to aggregate as much clinical data as possible to characterize this new disease and reduce its impact through collaborative analytics. The National COVID Cohort Collaborative (N3C) is now the largest publicly available HIPAA limited dataset in US history with over 6.4 million patients and is a testament to a partnership of over 100 organizations. MATERIALS AND METHODS: We developed a pipeline for ingesting, harmonizing, and centralizing data from 56 contributing data partners using 4 federated Common Data Models. N3C data quality (DQ) review involves both automated and manual procedures. In the process, several DQ heuristics were discovered in our centralized context, both within the pipeline and during downstream project-based analysis. Feedback to the sites led to many local and centralized DQ improvements. RESULTS: Beyond well-recognized DQ findings, we discovered 15 heuristics relating to source Common Data Model conformance, demographics, COVID tests, conditions, encounters, measurements, observations, coding completeness, and fitness for use. Of 56 sites, 37 sites (66%) demonstrated issues through these heuristics. These 37 sites demonstrated improvement after receiving feedback. DISCUSSION: We encountered site-to-site differences in DQ which would have been challenging to discover using federated checks alone. We have demonstrated that centralized DQ benchmarking reveals unique opportunities for DQ improvement that will support improved research analytics locally and in aggregate. CONCLUSION: By combining rapid, continual assessment of DQ with a large volume of multisite data, it is possible to support more nuanced scientific questions with the scale and rigor that they require.
Asunto(s)
COVID-19 , Estudios de Cohortes , Exactitud de los Datos , Health Insurance Portability and Accountability Act , Humanos , Estados UnidosRESUMEN
BACKGROUND: Numerous publications describe the clinical manifestations of post-acute sequelae of SARS-CoV-2 (PASC or "long COVID"), but they are difficult to integrate because of heterogeneous methods and the lack of a standard for denoting the many phenotypic manifestations. Patient-led studies are of particular importance for understanding the natural history of COVID-19, but integration is hampered because they often use different terms to describe the same symptom or condition. This significant disparity in patient versus clinical characterization motivated the proposed ontological approach to specifying manifestations, which will improve capture and integration of future long COVID studies. METHODS: The Human Phenotype Ontology (HPO) is a widely used standard for exchange and analysis of phenotypic abnormalities in human disease but has not yet been applied to the analysis of COVID-19. FUNDING: We identified 303 articles published before April 29, 2021, curated 59 relevant manuscripts that described clinical manifestations in 81 cohorts three weeks or more following acute COVID-19, and mapped 287 unique clinical findings to HPO terms. We present layperson synonyms and definitions that can be used to link patient self-report questionnaires to standard medical terminology. Long COVID clinical manifestations are not assessed consistently across studies, and most manifestations have been reported with a wide range of synonyms by different authors. Across at least 10 cohorts, authors reported 31 unique clinical features corresponding to HPO terms; the most commonly reported feature was Fatigue (median 45.1%) and the least commonly reported was Nausea (median 3.9%), but the reported percentages varied widely between studies. INTERPRETATION: Translating long COVID manifestations into computable HPO terms will improve analysis, data capture, and classification of long COVID patients. If researchers, clinicians, and patients share a common language, then studies can be compared/pooled more effectively. Furthermore, mapping lay terminology to HPO will help patients assist clinicians and researchers in creating phenotypic characterizations that are computationally accessible, thereby improving the stratification, diagnosis, and treatment of long COVID. FUNDING: U24TR002306; UL1TR001439; P30AG024832; GBMF4552; R01HG010067; UL1TR002535; K23HL128909; UL1TR002389; K99GM145411.
Asunto(s)
COVID-19/complicaciones , COVID-19/patología , COVID-19/diagnóstico , Humanos , SARS-CoV-2 , Síndrome Post Agudo de COVID-19RESUMEN
Importance: The National COVID Cohort Collaborative (N3C) is a centralized, harmonized, high-granularity electronic health record repository that is the largest, most representative COVID-19 cohort to date. This multicenter data set can support robust evidence-based development of predictive and diagnostic tools and inform clinical care and policy. Objectives: To evaluate COVID-19 severity and risk factors over time and assess the use of machine learning to predict clinical severity. Design, Setting, and Participants: In a retrospective cohort study of 1â¯926â¯526 US adults with SARS-CoV-2 infection (polymerase chain reaction >99% or antigen <1%) and adult patients without SARS-CoV-2 infection who served as controls from 34 medical centers nationwide between January 1, 2020, and December 7, 2020, patients were stratified using a World Health Organization COVID-19 severity scale and demographic characteristics. Differences between groups over time were evaluated using multivariable logistic regression. Random forest and XGBoost models were used to predict severe clinical course (death, discharge to hospice, invasive ventilatory support, or extracorporeal membrane oxygenation). Main Outcomes and Measures: Patient demographic characteristics and COVID-19 severity using the World Health Organization COVID-19 severity scale and differences between groups over time using multivariable logistic regression. Results: The cohort included 174â¯568 adults who tested positive for SARS-CoV-2 (mean [SD] age, 44.4 [18.6] years; 53.2% female) and 1â¯133â¯848 adult controls who tested negative for SARS-CoV-2 (mean [SD] age, 49.5 [19.2] years; 57.1% female). Of the 174â¯568 adults with SARS-CoV-2, 32â¯472 (18.6%) were hospitalized, and 6565 (20.2%) of those had a severe clinical course (invasive ventilatory support, extracorporeal membrane oxygenation, death, or discharge to hospice). Of the hospitalized patients, mortality was 11.6% overall and decreased from 16.4% in March to April 2020 to 8.6% in September to October 2020 (P = .002 for monthly trend). Using 64 inputs available on the first hospital day, this study predicted a severe clinical course using random forest and XGBoost models (area under the receiver operating curve = 0.87 for both) that were stable over time. The factor most strongly associated with clinical severity was pH; this result was consistent across machine learning methods. In a separate multivariable logistic regression model built for inference, age (odds ratio [OR], 1.03 per year; 95% CI, 1.03-1.04), male sex (OR, 1.60; 95% CI, 1.51-1.69), liver disease (OR, 1.20; 95% CI, 1.08-1.34), dementia (OR, 1.26; 95% CI, 1.13-1.41), African American (OR, 1.12; 95% CI, 1.05-1.20) and Asian (OR, 1.33; 95% CI, 1.12-1.57) race, and obesity (OR, 1.36; 95% CI, 1.27-1.46) were independently associated with higher clinical severity. Conclusions and Relevance: This cohort study found that COVID-19 mortality decreased over time during 2020 and that patient demographic characteristics and comorbidities were associated with higher clinical severity. The machine learning models accurately predicted ultimate clinical severity using commonly collected clinical data from the first 24 hours of a hospital admission.