RESUMEN
Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to the proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance. Utilizing a benchmark dataset, MedReview, consisting of 8161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the performance of open-source models was all improved after fine-tuning. The performance of fine-tuned LongT5 is close to GPT-3.5 with zero-shot settings. Furthermore, smaller fine-tuned models sometimes even demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were manifested in both a human evaluation and a larger-scale GPT4-simulated evaluation.
RESUMEN
Background: To achieve scientific goals, researchers often require integration of data from a primary electronic health record (EHR) system and one or more ancillary EHR systems used during the same patient care encounter. Although studies have demonstrated approaches for linking patient identity records across different EHR systems, little is known about linking patient encounter records across primary and ancillary EHR systems. Objectives: We compared a patients-first approach versus an encounters-first approach for linking patient encounter records across multiple EHR systems. Methods: We conducted a retrospective observational study of 348,904 patients with 533,283 encounters from 2010 to 2020 across our institution's primary EHR system and an ancillary EHR system used in perioperative settings. For the patients-first approach and the encounters-first approach, we measured the number of patient and encounter links created as well as runtime. Results: While the patients-first approach linked 43% of patients and 49% of encounters, the encounters-first approach linked 98% of patients and 100% of encounters. The encounters-first approach was 20 times faster than the patients-first approach for linking patients and 33% slower for linking encounters. Conclusion: Findings suggest that common patient and encounter identifiers shared among EHR systems via automated interfaces may be clinically useful but not "research-ready" and thus require an encounters-first linkage approach to enable secondary use for scientific purposes. Based on our search, this study is among the first to demonstrate approaches for linking patient encounters across multiple EHR systems. Enterprise data warehouse for research efforts elsewhere may benefit from an encounters-first approach.
RESUMEN
OBJECTIVES: Healthcare organizations, including Clinical and Translational Science Awards (CTSA) hubs funded by the National Institutes of Health, seek to enable secondary use of electronic health record (EHR) data through an enterprise data warehouse for research (EDW4R), but optimal approaches are unknown. In this qualitative study, our goal was to understand EDW4R impact, sustainability, demand management, and accessibility. MATERIALS AND METHODS: We engaged a convenience sample of informatics leaders from CTSA hubs (n = 21) for semi-structured interviews and completed a directed content analysis of interview transcripts. RESULTS: EDW4R have created institutional capacity for single- and multi-center studies, democratized access to EHR data for investigators from multiple disciplines, and enabled the learning health system. Bibliometrics have been challenging due to investigator non-compliance, but one hub's requirement to link all study protocols with funding records enabled quantifying an EDW4R's multi-million dollar impact. Sustainability of EDW4R has relied on multiple funding sources with a general shift away from the CTSA grant toward institutional and industry support. To address EDW4R demand, institutions have expanded staff, used different governance approaches, and provided investigator self-service tools. EDW4R accessibility can benefit from improved tools incorporating user-centered design, increased data literacy among scientists, expansion of informaticians in the workforce, and growth of team science. DISCUSSION: As investigator demand for EDW4R has increased, approaches to tracking impact, ensuring sustainability, and improving accessibility of EDW4R resources have varied. CONCLUSION: This study adds to understanding of how informatics leaders seek to support investigators using EDW4R across the CTSA consortium and potentially elsewhere.
Asunto(s)
Registros Electrónicos de Salud , Investigación Biomédica Traslacional , Estados Unidos , Data Warehousing , Humanos , Entrevistas como Asunto , National Institutes of Health (U.S.) , Investigación CualitativaRESUMEN
BACKGROUND: A commercial federated network called TriNetX has connected electronic health record (EHR) data from academic medical centers (AMCs) with biopharmaceutical sponsors in a privacy-preserving manner to promote sponsor-initiated clinical trials. Little is known about how AMCs have implemented TriNetX to support clinical trials. FINDINGS: At our AMC over a six-year period, TriNetX integrated into existing institutional workflows enabled 402 requests for sponsor-initiated clinical trials, 14 % (n = 56) of which local investigators expressed interest in conducting. Although clinical trials administrators indicated TriNetX yielded unique study opportunities, measurement of impact of institutional participation in the network was challenging due to lack of a common trial identifier shared across TriNetX, sponsor, and our institution. CONCLUSION: To the best of our knowledge, this study is among the first to describe integration of a federated network of EHR data into institutional workflows for sponsor-initiated clinical trials. This case report may inform efforts at other institutions.
Asunto(s)
Centros Médicos Académicos , Registros Electrónicos de Salud , HumanosRESUMEN
OBJECTIVE: Generation of automated clinical notes has been posited as a strategy to mitigate physician burnout. In particular, an automated narrative summary of a patient's hospital stay could supplement the hospital course section of the discharge summary that inpatient physicians document in electronic health record (EHR) systems. In the current study, we developed and evaluated an automated method for summarizing the hospital course section using encoder-decoder sequence-to-sequence transformer models. MATERIALS AND METHODS: We fine-tuned BERT and BART models and optimized for factuality through constraining beam search, which we trained and tested using EHR data from patients admitted to the neurology unit of an academic medical center. RESULTS: The approach demonstrated good ROUGE scores with an R-2 of 13.76. In a blind evaluation, 2 board-certified physicians rated 62% of the automated summaries as meeting the standard of care, which suggests the method may be useful clinically. DISCUSSION AND CONCLUSION: To our knowledge, this study is among the first to demonstrate an automated method for generating a discharge summary hospital course that approaches a quality level of what a physician would write.
Asunto(s)
Registros Electrónicos de Salud , Alta del Paciente , Humanos , Programas Informáticos , Pacientes Internos , HospitalesRESUMEN
Enterprise data warehouses for research (EDW4R) is a critical component of National Institutes of Health Clinical and Translational Science Award (CTSA) hubs. EDW4R operations have unique needs that require specialized skills and collaborations across multiple domains which limit the ability to apply existing models of information technology (IT) performance. Because of this uniqueness, we developed a new EDW4R maturity model based on prior qualitative study of operational practices for supporting EDW4Rs at CTSA hubs. In a pilot study, respondents from fifteen CTSA hubs completed the novel EDW4R maturity index survey by rating 33 maturity statements across 6 categories using a 5-point Likert scale. Of the six categories, respondents rated workforce as most mature (4.17 [3.67-4.42]) and relationship with enterprise IT as the least mature (3.00 [2.80-3.80]). Our pilot of a novel maturity index shows a baseline quantitative measure of EDW4R functions across fifteen CTSA hubs. The maturity index may be useful to faculty and staff currently leading an EDW4R by creating opportunities to explore the index in local context and comparison to other institutions.
RESUMEN
OBJECTIVES: Health care systems are primarily collecting patient-reported outcomes (PROs) for research and clinical care using proprietary, institution- and disease-specific tools for remote assessment. The purpose of this study was to conduct a Reach, Effectiveness, Adoption, Implementation, and Maintenance (RE-AIM) evaluation of a scalable electronic PRO (ePRO) reporting and visualization system in a single-arm study. METHODS: The "mi.symptoms" ePRO system was designed using gerontechnological design principles to ensure high usability among older adults. The system enables longitudinal reporting of disease-agnostic ePROs and includes patient-facing PRO visualizations. We conducted an evaluation of the implementation of the system guided by the RE-AIM framework. Quantitative data were analyzed using basic descriptive statistics, and qualitative data were analyzed using directed content analysis. RESULTS: Reach-the total reach of the study was 70 participants (median age: 69, 31% female, 17% Black or African American, 27% reported not having enough financial resources). Effectiveness-half (51%) of participants completed the 2-week follow-up survey and 36% completed all follow-up surveys. Adoption-the desire for increased self-knowledge, the value of tracking symptoms, and altruism motivated participants to adopt the tool. Implementation-the predisposing factor was access to, and comfort with, computers. Three enabling factors were incorporation into routines, multimodal nudges, and ease of use. Maintenance-reinforcing factors were perceived usefulness of viewing symptom reports with the tool and understanding the value of sustained symptom tracking in general. CONCLUSION: Challenges in ePRO reporting, particularly sustained patient engagement, remain. Nonetheless, freely available, scalable, disease-agnostic systems may pave the road toward inclusion of a more diverse range of health systems and patients in ePRO collection and use.
Asunto(s)
Medición de Resultados Informados por el Paciente , Programas Informáticos , Humanos , Femenino , Anciano , Masculino , Atención a la Salud , Encuestas y Cuestionarios , ElectrónicaRESUMEN
Obtaining reliable data on patient mortality is a critical challenge facing observational researchers seeking to conduct studies using real-world data. As these analyses are conducted more broadly using newly-available sources of real-world evidence, missing data can serve as a rate-limiting factor. We conducted a comparison of mortality data sources from different stakeholder perspectives - academic medical center (AMC) informatics service providers, AMC research coordinators, industry analytics professionals, and academics - to understand the strengths and limitations of differing mortality data sources: locally generated data from sites conducting research, data provided by governmental sources, and commercially available data sets. Researchers seeking to conduct observational studies using extant data should consider these factors in sourcing outcomes data for their populations of interest.
Asunto(s)
Centros Médicos Académicos , Fuentes de Información , HumanosRESUMEN
Optimal solutions for abstractive summarization of electronic health record content have yet to be discovered. Although studies have applied state-of-the-art transformers in the clinical domain to radiology reports and information extraction, little is known of transformers' performance with the hospital course section of the discharge summary. This paper compares two summarization approaches for automating the hospital course section within the discharge summary: (1) a truncation approach that uses all clinical notes and (2) a day-to-day approach that segments the notes per clinical day. We pair both approaches with different transformer encoder-decoder based-models - BART, BERT2GPT2, ClinicalBERT2GPT2, and ClinicalBERT2ClinicalBERT and evaluate the transformers that work best for each approach using ROUGE metrics. The results demonstrate that the day-to-day approach can overcome the limitations of longform document summarization for the patient clinical record.
RESUMEN
OBJECTIVE: Both academic medical centers and biomedical research sponsors need to understand impact of scientific funding to determine value. For the National Institutes of Health (NIH) Clinical and Translational Science Award (CTSA) hubs, tracking research activities can be complex, often involving multiple institutions and continually changing federal reporting requirements. Existing research administrative systems are institution-specific and tend to focus only on parts of a greater whole. The goal of this case report is to describe a comprehensive data model that addresses this gap. MATERIALS AND METHODS: Web-based Center Administrative Management Program (WebCAMP) has been developed over a period of over 15 years in the context of CTSA hubs, with the recent addition of T32 programs. Its data model centers around the key concepts of people, projects, resources (inputs), and outcomes (outputs). RESULTS: The WebCAMP data model and associated toolset for biomedical research administration integrates multiple components of the research enterprise, has been used by our CTSA hub for over 15 years and has been adopted by more than 20 other CTSA hubs. DISCUSSION: To the best of our knowledge, this study is among the first to describe a comprehensive data model for biomedical research administration. Opportunities for future work include improved grant tracking through the development of a universal identifier that spans public and private funders, and a more generic outcomes tracking model able to rapidly incorporate new outcome types. CONCLUSION: We propose that the WebCAMP data model, or a derivative of it, could serve as a future standard for research administrative data warehousing.
Asunto(s)
Distinciones y Premios , Investigación Biomédica , Centros Médicos Académicos , Humanos , National Institutes of Health (U.S.) , Investigación Biomédica Traslacional , Estados UnidosRESUMEN
OBJECTIVE: Among National Institutes of Health Clinical and Translational Science Award (CTSA) hubs, effective approaches for enterprise data warehouses for research (EDW4R) development, maintenance, and sustainability remain unclear. The goal of this qualitative study was to understand CTSA EDW4R operations within the broader contexts of academic medical centers and technology. MATERIALS AND METHODS: We performed a directed content analysis of transcripts generated from semistructured interviews with informatics leaders from 20 CTSA hubs. RESULTS: Respondents referred to services provided by health system, university, and medical school information technology (IT) organizations as "enterprise information technology (IT)." Seventy-five percent of respondents stated that the team providing EDW4R service at their hub was separate from enterprise IT; strong relationships between EDW4R teams and enterprise IT were critical for success. Managing challenges of EDW4R staffing was made easier by executive leadership support. Data governance appeared to be a work in progress, as most hubs reported complex and incomplete processes, especially for commercial data sharing. Although nearly all hubs (n = 16) described use of cloud computing for specific projects, only 2 hubs reported using a cloud-based EDW4R. Respondents described EDW4R cloud migration facilitators, barriers, and opportunities. DISCUSSION: Descriptions of approaches to how EDW4R teams at CTSA hubs work with enterprise IT organizations, manage workforces, make decisions about data, and approach cloud computing provide insights for institutions seeking to leverage patient data for research. CONCLUSION: Identification of EDW4R best practices is challenging, and this study helps identify a breadth of viable options for CTSA hubs to consider when implementing EDW4R services.
Asunto(s)
Data Warehousing , Investigación Biomédica Traslacional , Nube Computacional , Humanos , Tecnología de la Información , Recursos HumanosRESUMEN
OBJECTIVES: Computable social risk factor phenotypes derived from routinely collected structured electronic health record (EHR) or health information exchange (HIE) data may represent a feasible and robust approach to measuring social factors. This study convened an expert panel to identify and assess the quality of individual EHR and HIE structured data elements that could be used as components in future computable social risk factor phenotypes. STUDY DESIGN: Technical expert panel. METHODS: A 2-round Delphi technique included 17 experts with an in-depth knowledge of available EHR and/or HIE data. The first-round identification sessions followed a nominal group approach to generate candidate data elements that may relate to socioeconomics, cultural context, social relationships, and community context. In the second-round survey, panelists rated each data element according to overall data quality and likelihood of systematic differences in quality across populations (ie, bias). RESULTS: Panelists identified a total of 89 structured data elements. About half of the data elements (n = 45) were related to socioeconomic characteristics. The panelists identified a diverse set of data elements. Elements used in reimbursement-related processes were generally rated as higher quality. Panelists noted that several data elements may be subject to implicit bias or reflect biased systems of care, which may limit their utility in measuring social factors. CONCLUSIONS: Routinely collected structured data within EHR and HIE systems may reflect patient social risk factors. Identifying and assessing available data elements serves as a foundational step toward developing future computable social factor phenotypes.
Asunto(s)
Intercambio de Información en Salud , Técnica Delphi , Registros Electrónicos de Salud , Humanos , Factores de RiesgoRESUMEN
OBJECTIVE: Obtaining electronic patient data, especially from electronic health record (EHR) systems, for clinical and translational research is difficult. Multiple research informatics systems exist but navigating the numerous applications can be challenging for scientists. This article describes Architecture for Research Computing in Health (ARCH), our institution's approach for matching investigators with tools and services for obtaining electronic patient data. MATERIALS AND METHODS: Supporting the spectrum of studies from populations to individuals, ARCH delivers a breadth of scientific functions-including but not limited to cohort discovery, electronic data capture, and multi-institutional data sharing-that manifest in specific systems-such as i2b2, REDCap, and PCORnet. Through a consultative process, ARCH staff align investigators with tools with respect to study design, data sources, and cost. Although most ARCH services are available free of charge, advanced engagements require fee for service. RESULTS: Since 2016 at Weill Cornell Medicine, ARCH has supported over 1200 unique investigators through more than 4177 consultations. Notably, ARCH infrastructure enabled critical coronavirus disease 2019 response activities for research and patient care. DISCUSSION: ARCH has provided a technical, regulatory, financial, and educational framework to support the biomedical research enterprise with electronic patient data. Collaboration among informaticians, biostatisticians, and clinicians has been critical to rapid generation and analysis of EHR data. CONCLUSION: A suite of tools and services, ARCH helps match investigators with informatics systems to reduce time to science. ARCH has facilitated research at Weill Cornell Medicine and may provide a model for informatics and research leaders to support scientists elsewhere.
Asunto(s)
Investigación Biomédica , COVID-19 , Registros Electrónicos de Salud , Electrónica , Humanos , Almacenamiento y Recuperación de la Información , InvestigadoresRESUMEN
Background: In the global effort to prevent death by suicide, many academic medical institutions are implementing natural language processing (NLP) approaches to detect suicidality from unstructured clinical text in electronic health records (EHRs), with the hope of targeting timely, preventative interventions to individuals most at risk of suicide. Despite the international need, the development of these NLP approaches in EHRs has been largely local and not shared across healthcare systems. Methods: In this study, we developed a process to share NLP approaches that were individually developed at King's College London (KCL), UK and Weill Cornell Medicine (WCM), US - two academic medical centers based in different countries with vastly different healthcare systems. We tested and compared the algorithms' performance on manually annotated clinical notes (KCL: n = 4,911 and WCM = 837). Results: After a successful technical porting of the NLP approaches, our quantitative evaluation determined that independently developed NLP approaches can detect suicidality at another healthcare organization with a different EHR system, clinical documentation processes, and culture, yet do not achieve the same level of success as at the institution where the NLP algorithm was developed (KCL approach: F1-score 0.85 vs. 0.68, WCM approach: F1-score 0.87 vs. 0.72). Limitations: Independent NLP algorithm development and patient cohort selection at the two institutions comprised direct comparability. Conclusions: Shared use of these NLP approaches is a critical step forward towards improving data-driven algorithms for early suicide risk identification and timely prevention.
RESUMEN
OBJECTIVE: In response to COVID-19, the informatics community united to aggregate as much clinical data as possible to characterize this new disease and reduce its impact through collaborative analytics. The National COVID Cohort Collaborative (N3C) is now the largest publicly available HIPAA limited dataset in US history with over 6.4 million patients and is a testament to a partnership of over 100 organizations. MATERIALS AND METHODS: We developed a pipeline for ingesting, harmonizing, and centralizing data from 56 contributing data partners using 4 federated Common Data Models. N3C data quality (DQ) review involves both automated and manual procedures. In the process, several DQ heuristics were discovered in our centralized context, both within the pipeline and during downstream project-based analysis. Feedback to the sites led to many local and centralized DQ improvements. RESULTS: Beyond well-recognized DQ findings, we discovered 15 heuristics relating to source Common Data Model conformance, demographics, COVID tests, conditions, encounters, measurements, observations, coding completeness, and fitness for use. Of 56 sites, 37 sites (66%) demonstrated issues through these heuristics. These 37 sites demonstrated improvement after receiving feedback. DISCUSSION: We encountered site-to-site differences in DQ which would have been challenging to discover using federated checks alone. We have demonstrated that centralized DQ benchmarking reveals unique opportunities for DQ improvement that will support improved research analytics locally and in aggregate. CONCLUSION: By combining rapid, continual assessment of DQ with a large volume of multisite data, it is possible to support more nuanced scientific questions with the scale and rigor that they require.
Asunto(s)
COVID-19 , Estudios de Cohortes , Exactitud de los Datos , Health Insurance Portability and Accountability Act , Humanos , Estados UnidosRESUMEN
INTRODUCTION: Data extraction from electronic health record (EHR) systems occurs through manual abstraction, automated extraction, or a combination of both. While each method has its strengths and weaknesses, both are necessary for retrospective observational research as well as sudden clinical events, like the COVID-19 pandemic. Assessing the strengths, weaknesses, and potentials of these methods is important to continue to understand optimal approaches to extracting clinical data. We set out to assess automated and manual techniques for collecting medication use data in patients with COVID-19 to inform future observational studies that extract data from the electronic health record (EHR). MATERIALS AND METHODS: For 4,123 COVID-positive patients hospitalized and/or seen in the emergency department at an academic medical center between 03/03/2020 and 05/15/2020, we compared medication use data of 25 medications or drug classes collected through manual abstraction and automated extraction from the EHR. Quantitatively, we assessed concordance using Cohen's kappa to measure interrater reliability, and qualitatively, we audited observed discrepancies to determine causes of inconsistencies. RESULTS: For the 16 inpatient medications, 11 (69%) demonstrated moderate or better agreement; 7 of those demonstrated strong or almost perfect agreement. For 9 outpatient medications, 3 (33%) demonstrated moderate agreement, but none achieved strong or almost perfect agreement. We audited 12% of all discrepancies (716/5,790) and, in those audited, observed three principal categories of error: human error in manual abstraction (26%), errors in the extract-transform-load (ETL) or mapping of the automated extraction (41%), and abstraction-query mismatch (33%). CONCLUSION: Our findings suggest many inpatient medications can be collected reliably through automated extraction, especially when abstraction instructions are designed with data architecture in mind. We discuss quality issues, concerns, and improvements for institutions to consider when crafting an approach. During crises, institutions must decide how to allocate limited resources. We show that automated extraction of medications is feasible and make recommendations on how to improve future iterations.
Asunto(s)
COVID-19 , Preparaciones Farmacéuticas , Recolección de Datos , Registros Electrónicos de Salud , Humanos , Pandemias , Reproducibilidad de los Resultados , Estudios Retrospectivos , SARS-CoV-2RESUMEN
PURPOSE: Typically stored as unstructured notes, surgical pathology reports contain data elements valuable to cancer research that require labor-intensive manual extraction. Although studies have described natural language processing (NLP) of surgical pathology reports to automate information extraction, efforts have focused on specific cancer subtypes rather than across multiple oncologic domains. To address this gap, we developed and evaluated an NLP method to extract tumor staging and diagnosis information across multiple cancer subtypes. METHODS: The NLP pipeline was implemented on an open-source framework called Leo. We used a total of 555,681 surgical pathology reports of 329,076 patients to develop the pipeline and evaluated our approach on subsets of reports from patients with breast, prostate, colorectal, and randomly selected cancer subtypes. RESULTS: Averaged across all four cancer subtypes, the NLP pipeline achieved an accuracy of 1.00 for International Classification of Diseases, Tenth Revision codes, 0.89 for T staging, 0.90 for N staging, and 0.97 for M staging. It achieved an F1 score of 1.00 for International Classification of Diseases, Tenth Revision codes, 0.88 for T staging, 0.90 for N staging, and 0.24 for M staging. CONCLUSION: The NLP pipeline was developed to extract tumor staging and diagnosis information across multiple cancer subtypes to support the research enterprise in our institution. Although it was not possible to demonstrate generalizability of our NLP pipeline to other institutions, other institutions may find value in adopting a similar NLP approach-and reusing code available at GitHub-to support the oncology research enterprise with elements extracted from surgical pathology reports.
Asunto(s)
Patología Quirúrgica , Humanos , Almacenamiento y Recuperación de la Información , Masculino , Procesamiento de Lenguaje Natural , Estadificación de Neoplasias , Informe de InvestigaciónRESUMEN
OBJECTIVE: Social determinants of health (SDoH) are nonclinical dispositions that impact patient health risks and clinical outcomes. Leveraging SDoH in clinical decision-making can potentially improve diagnosis, treatment planning, and patient outcomes. Despite increased interest in capturing SDoH in electronic health records (EHRs), such information is typically locked in unstructured clinical notes. Natural language processing (NLP) is the key technology to extract SDoH information from clinical text and expand its utility in patient care and research. This article presents a systematic review of the state-of-the-art NLP approaches and tools that focus on identifying and extracting SDoH data from unstructured clinical text in EHRs. MATERIALS AND METHODS: A broad literature search was conducted in February 2021 using 3 scholarly databases (ACL Anthology, PubMed, and Scopus) following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. A total of 6402 publications were initially identified, and after applying the study inclusion criteria, 82 publications were selected for the final review. RESULTS: Smoking status (n = 27), substance use (n = 21), homelessness (n = 20), and alcohol use (n = 15) are the most frequently studied SDoH categories. Homelessness (n = 7) and other less-studied SDoH (eg, education, financial problems, social isolation and support, family problems) are mostly identified using rule-based approaches. In contrast, machine learning approaches are popular for identifying smoking status (n = 13), substance use (n = 9), and alcohol use (n = 9). CONCLUSION: NLP offers significant potential to extract SDoH data from narrative clinical notes, which in turn can aid in the development of screening tools, risk prediction models, and clinical decision support systems.
Asunto(s)
Registros Electrónicos de Salud , Procesamiento de Lenguaje Natural , Manejo de Datos , Humanos , Aprendizaje Automático , Determinantes Sociales de la SaludRESUMEN
COVID-19-associated respiratory failure offers the unprecedented opportunity to evaluate the differential host response to a uniform pathogenic insult. Understanding whether there are distinct subphenotypes of severe COVID-19 may offer insight into its pathophysiology. Sequential Organ Failure Assessment (SOFA) score is an objective and comprehensive measurement that measures dysfunction severity of six organ systems, i.e., cardiovascular, central nervous system, coagulation, liver, renal, and respiration. Our aim was to identify and characterize distinct subphenotypes of COVID-19 critical illness defined by the post-intubation trajectory of SOFA score. Intubated COVID-19 patients at two hospitals in New York city were leveraged as development and validation cohorts. Patients were grouped into mild, intermediate, and severe strata by their baseline post-intubation SOFA. Hierarchical agglomerative clustering was performed within each stratum to detect subphenotypes based on similarities amongst SOFA score trajectories evaluated by Dynamic Time Warping. Distinct worsening and recovering subphenotypes were identified within each stratum, which had distinct 7-day post-intubation SOFA progression trends. Patients in the worsening suphenotypes had a higher mortality than those in the recovering subphenotypes within each stratum (mild stratum, 29.7% vs. 10.3%, p = 0.033; intermediate stratum, 29.3% vs. 8.0%, p = 0.002; severe stratum, 53.7% vs. 22.2%, p < 0.001). Pathophysiologic biomarkers associated with progression were distinct at each stratum, including findings suggestive of inflammation in low baseline severity of illness versus hemophagocytic lymphohistiocytosis in higher baseline severity of illness. The findings suggest that there are clear worsening and recovering subphenotypes of COVID-19 respiratory failure after intubation, which are more predictive of outcomes than baseline severity of illness. Distinct progression biomarkers at differential baseline severity of illness suggests a heterogeneous pathobiology in the progression of COVID-19 respiratory failure.