Pesquisa | BVS IEC

SEETrials: Leveraging Large Language Models for Safety and Efficacy Extraction in Oncology Clinical Trials.

Lee, Kyeryoung; Paek, Hunki; Huang, Liang-Chin; Hilton, C Beau; Datta, Surabhi; Higashi, Josh; Ofoegbu, Nneka; Wang, Jingqi; Rubinstein, Samuel M; Cowan, Andrew J; Kwok, Mary; Warner, Jeremy L; Xu, Hua; Wang, Xiaoyan.

medRxiv ; 2024 May 13.

Artigo em Inglês | MEDLINE | ID: mdl-38798420

RESUMO

Background: Initial insights into oncology clinical trial outcomes are often gleaned manually from conference abstracts. We aimed to develop an automated system to extract safety and efficacy information from study abstracts with high precision and fine granularity, transforming them into computable data for timely clinical decision-making. Methods: We collected clinical trial abstracts from key conferences and PubMed (2012-2023). The SEETrials system was developed with four modules: preprocessing, prompt modeling, knowledge ingestion and postprocessing. We evaluated the system's performance qualitatively and quantitatively and assessed its generalizability across different cancer types- multiple myeloma (MM), breast, lung, lymphoma, and leukemia. Furthermore, the efficacy and safety of innovative therapies, including CAR-T, bispecific antibodies, and antibody-drug conjugates (ADC), in MM were analyzed across a large scale of clinical trial studies. Results: SEETrials achieved high precision (0.958), recall (sensitivity) (0.944), and F1 score (0.951) across 70 data elements present in the MM trial studies Generalizability tests on four additional cancers yielded precision, recall, and F1 scores within the 0.966-0.986 range. Variation in the distribution of safety and efficacy-related entities was observed across diverse therapies, with certain adverse events more common in specific treatments. Comparative performance analysis using overall response rate (ORR) and complete response (CR) highlighted differences among therapies: CAR-T (ORR: 88%, 95% CI: 84-92%; CR: 95%, 95% CI: 53-66%), bispecific antibodies (ORR: 64%, 95% CI: 55-73%; CR: 27%, 95% CI: 16-37%), and ADC (ORR: 51%, 95% CI: 37-65%; CR: 26%, 95% CI: 1-51%). Notable study heterogeneity was identified (>75% I 2 heterogeneity index scores) across several outcome entities analyzed within therapy subgroups. Conclusion: SEETrials demonstrated highly accurate data extraction and versatility across different therapeutics and various cancer domains. Its automated processing of large datasets facilitates nuanced data comparisons, promoting the swift and effective dissemination of clinical insights.

Enhancing Early Detection of Cognitive Decline in the Elderly: A Comparative Study Utilizing Large Language Models in Clinical Notes.

Du, Xinsong; Novoa-Laurentiev, John; Plasaek, Joseph M; Chuang, Ya-Wen; Wang, Liqin; Marshall, Gad; Mueller, Stephanie K; Chang, Frank; Datta, Surabhi; Paek, Hunki; Lin, Bin; Wei, Qiang; Wang, Xiaoyan; Wang, Jingqi; Ding, Hao; Manion, Frank J; Du, Jingcheng; Bates, David W; Zhou, Li.

medRxiv ; 2024 May 06.

Artigo em Inglês | MEDLINE | ID: mdl-38633810

RESUMO

Background: Large language models (LLMs) have shown promising performance in various healthcare domains, but their effectiveness in identifying specific clinical conditions in real medical records is less explored. This study evaluates LLMs for detecting signs of cognitive decline in real electronic health record (EHR) clinical notes, comparing their error profiles with traditional models. The insights gained will inform strategies for performance enhancement. Methods: This study, conducted at Mass General Brigham in Boston, MA, analyzed clinical notes from the four years prior to a 2019 diagnosis of mild cognitive impairment in patients aged 50 and older. We used a randomly annotated sample of 4,949 note sections, filtered with keywords related to cognitive functions, for model development. For testing, a random annotated sample of 1,996 note sections without keyword filtering was utilized. We developed prompts for two LLMs, Llama 2 and GPT-4, on HIPAA-compliant cloud-computing platforms using multiple approaches (e.g., both hard and soft prompting and error analysis-based instructions) to select the optimal LLM-based method. Baseline models included a hierarchical attention-based neural network and XGBoost. Subsequently, we constructed an ensemble of the three models using a majority vote approach. Results: GPT-4 demonstrated superior accuracy and efficiency compared to Llama 2, but did not outperform traditional models. The ensemble model outperformed the individual models, achieving a precision of 90.3%, a recall of 94.2%, and an F1-score of 92.2%. Notably, the ensemble model showed a significant improvement in precision, increasing from a range of 70%-79% to above 90%, compared to the best-performing single model. Error analysis revealed that 63 samples were incorrectly predicted by at least one model; however, only 2 cases (3.2%) were mutual errors across all models, indicating diverse error profiles among them. Conclusions: LLMs and traditional machine learning models trained using local EHR data exhibited diverse error profiles. The ensemble of these models was found to be complementary, enhancing diagnostic performance. Future research should investigate integrating LLMs with smaller, localized models and incorporating medical data and domain knowledge to enhance performance on specific tasks.

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA