Búsqueda | Portal Regional de la BVS

Evaluating Large Language Models in Extracting Cognitive Exam Dates and Scores.

Zhang, Hao; Jethani, Neil; Jones, Simon; Genes, Nicholas; Major, Vincent J; Jaffe, Ian S; Cardillo, Anthony B; Heilenbach, Noah; Ali, Nadia Fazal; Bonanni, Luke J; Clayburn, Andrew J; Khera, Zain; Sadler, Erica C; Prasad, Jaideep; Schlacter, Jamie; Liu, Kevin; Silva, Benjamin; Montgomery, Sophie; Kim, Eric J; Lester, Jacob; Hill, Theodore M; Avoricani, Alba; Chervonski, Ethan; Davydov, James; Small, William; Chakravartty, Eesha; Grover, Himanshu; Dodson, John A; Brody, Abraham A; Aphinyanaphongs, Yindalon; Masurkar, Arjun; Razavian, Narges.

medRxiv ; 2024 Feb 13.

Artículo en Inglés | MEDLINE | ID: mdl-38405784

RESUMEN

Importance: Large language models (LLMs) are crucial for medical tasks. Ensuring their reliability is vital to avoid false results. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Objective: Evaluate ChatGPT and LlaMA-2 performance in extracting MMSE and CDR scores, including their associated dates. Methods: Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. Results: For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. Conclusions: In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.

Novel Note Templates to Enhance Signal and Reduce Noise in Medical Documentation: Prospective Improvement Study.

Feldman, Jonah; Goodman, Adam; Hochman, Katherine; Chakravartty, Eesha; Austrian, Jonathan; Iturrate, Eduardo; Bosworth, Brian; Saxena, Archana; Moussa, Marwa; Chenouda, Dina; Volpicelli, Frank; Adler, Nicole; Weisstuch, Joseph; Testa, Paul.

JMIR Form Res ; 7: e41223, 2023 Apr 12.

Artículo en Inglés | MEDLINE | ID: mdl-36821760

RESUMEN

BACKGROUND: The introduction of electronic workflows has allowed for the flow of raw uncontextualized clinical data into medical documentation. As a result, many electronic notes have become replete of "noise" and deplete clinically significant "signals." There is an urgent need to develop and implement innovative approaches in electronic clinical documentation that improve note quality and reduce unnecessary bloating. OBJECTIVE: This study aims to describe the development and impact of a novel set of templates designed to change the flow of information in medical documentation. METHODS: This is a multihospital nonrandomized prospective improvement study conducted on the inpatient general internal medicine service across 3 hospital campuses at the New York University Langone Health System. A group of physician leaders representing each campus met biweekly for 6 months. The output of these meetings included (1) a conceptualization of the note bloat problem as a dysfunction in information flow, (2) a set of guiding principles for organizational documentation improvement, (3) the design and build of novel electronic templates that reduced the flow of extraneous information into provider notes by providing link outs to best practice data visualizations, and (4) a documentation improvement curriculum for inpatient medicine providers. Prior to go-live, pragmatic usability testing was performed with the new progress note template, and the overall user experience was measured using the System Usability Scale (SUS). Primary outcome measures after go-live include template utilization rate and note length in characters. RESULTS: In usability testing among 22 medicine providers, the new progress note template averaged a usability score of 90.6 out of 100 on the SUS. A total of 77% (17/22) of providers strongly agreed that the new template was easy to use, and 64% (14/22) strongly agreed that they would like to use the template frequently. In the 3 months after template implementation, general internal medicine providers wrote 67% (51,431/76,647) of all inpatient notes with the new templates. During this period, the organization saw a 46% (2768/6191), 47% (3505/7819), and 32% (3427/11,226) reduction in note length for general medicine progress notes, consults, and history and physical notes, respectively, when compared to a baseline measurement period prior to interventions. CONCLUSIONS: A bundled intervention that included the deployment of novel templates for inpatient general medicine providers significantly reduced average note length on the clinical service. Templates designed to reduce the flow of extraneous information into provider notes performed well during usability testing, and these templates were rapidly adopted across all hospital campuses. Further research is needed to assess the impact of novel templates on note quality, provider efficiency, and patient outcomes.

RESUMEN

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA