RESUMO
This paper presents a comprehensive workflow for integrating revolving events into the transitive sequential pattern mining (tSPM+) algorithm and Machine Learning for Health Outcomes (MLHO) framework, emphasizing best practices and pitfalls in its application. We emphasize feature engineering and visualization techniques, demonstrating their efficacy in capturing temporal relationships. Applied to an EGFR lung cancer cohort, our approach showcases reliable temporal insights even in a small dataset. This work highlights the importance of temporal nuances in healthcare data analysis, paving the way for improved disease understanding and patient care.
Assuntos
Algoritmos , Mineração de Dados , Neoplasias Pulmonares , Aprendizado de Máquina , Neoplasias Pulmonares/terapia , Humanos , Mineração de Dados/métodos , Fluxo de TrabalhoRESUMO
Scalable identification of patients with the post-acute sequelae of COVID-19 (PASC) is challenging due to a lack of reproducible precision phenotyping algorithms and the suboptimal accuracy, demographic biases, and underestimation of the PASC diagnosis code (ICD-10 U09.9). In a retrospective case-control study, we developed a precision phenotyping algorithm for identifying research cohorts of PASC patients, defined as a diagnosis of exclusion. We used longitudinal electronic health records (EHR) data from over 295 thousand patients from 14 hospitals and 20 community health centers in Massachusetts. The algorithm employs an attention mechanism to exclude sequelae that prior conditions can explain. We performed independent chart reviews to tune and validate our precision phenotyping algorithm. Our PASC phenotyping algorithm improves precision and prevalence estimation and reduces bias in identifying Long COVID patients compared to the U09.9 diagnosis code. Our algorithm identified a PASC research cohort of over 24 thousand patients (compared to about 6 thousand when using the U09.9 diagnosis code), with a 79.9 percent precision (compared to 77.8 percent from the U09.9 diagnosis code). Our estimated prevalence of PASC was 22.8 percent, which is close to the national estimates for the region. We also provide an in-depth analysis outlining the clinical attributes, encompassing identified lingering effects by organ, comorbidity profiles, and temporal differences in the risk of PASC. The PASC phenotyping method presented in this study boasts superior precision, accurately gauges the prevalence of PASC without underestimating it, and exhibits less bias in pinpointing Long COVID patients. The PASC cohort derived from our algorithm will serve as a springboard for delving into Long COVID's genetic, metabolomic, and clinical intricacies, surmounting the constraints of recent PASC cohort studies, which were hampered by their limited size and available outcome data.