Enhancing Early Detection of Cognitive Decline in the Elderly: A Comparative Study Utilizing Large Language Models in Clinical Notes.

Du, Xinsong; Novoa-Laurentiev, John; Plasaek, Joseph M; Chuang, Ya-Wen; Wang, Liqin; Marshall, Gad; Mueller, Stephanie K; Chang, Frank; Datta, Surabhi; Paek, Hunki; Lin, Bin; Wei, Qiang; Wang, Xiaoyan; Wang, Jingqi; Ding, Hao; Manion, Frank J; Du, Jingcheng; Bates, David W; Zhou, Li

Du, Xinsong; Novoa-Laurentiev, John; Plasaek, Joseph M; Chuang, Ya-Wen; Wang, Liqin; Marshall, Gad; Mueller, Stephanie K; Chang, Frank; Datta, Surabhi; Paek, Hunki; Lin, Bin; Wei, Qiang; Wang, Xiaoyan; Wang, Jingqi; Ding, Hao; Manion, Frank J; Du, Jingcheng; Bates, David W; Zhou, Li.

Affiliation

Du X; Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, Massachusetts 02115.
Novoa-Laurentiev J; Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115.
Plasaek JM; Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, Massachusetts 02115.
Chuang YW; Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, Massachusetts 02115.
Wang L; Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115.
Marshall G; Division of Nephrology, Taichung Veterans General Hospital, Taichung, Taiwan, 407219.
Mueller SK; Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, Massachusetts 02115.
Chang F; Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115.
Datta S; Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115.
Paek H; Department of Neurology, Brigham and Women's Hospital, Boston, Massachusetts 02115.
Lin B; Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, Massachusetts 02115.
Wei Q; Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115.
Wang X; Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, Massachusetts 02115.
Wang J; Intelligent Medical Objects, Rosemont, Illinois, 60018.
Ding H; Intelligent Medical Objects, Rosemont, Illinois, 60018.
Manion FJ; Intelligent Medical Objects, Rosemont, Illinois, 60018.
Du J; Intelligent Medical Objects, Rosemont, Illinois, 60018.
Bates DW; Intelligent Medical Objects, Rosemont, Illinois, 60018.
Zhou L; Intelligent Medical Objects, Rosemont, Illinois, 60018.

medRxiv ; 2024 May 06.

Article in En | MEDLINE | ID: mdl-38633810

ABSTRACT

ABSTRACT

Background:

Large language models (LLMs) have shown promising performance in various healthcare domains, but their effectiveness in identifying specific clinical conditions in real medical records is less explored. This study evaluates LLMs for detecting signs of cognitive decline in real electronic health record (EHR) clinical notes, comparing their error profiles with traditional models. The insights gained will inform strategies for performance enhancement.

Methods:

This study, conducted at Mass General Brigham in Boston, MA, analyzed clinical notes from the four years prior to a 2019 diagnosis of mild cognitive impairment in patients aged 50 and older. We used a randomly annotated sample of 4,949 note sections, filtered with keywords related to cognitive functions, for model development. For testing, a random annotated sample of 1,996 note sections without keyword filtering was utilized. We developed prompts for two LLMs, Llama 2 and GPT-4, on HIPAA-compliant cloud-computing platforms using multiple approaches (e.g., both hard and soft prompting and error analysis-based instructions) to select the optimal LLM-based method. Baseline models included a hierarchical attention-based neural network and XGBoost. Subsequently, we constructed an ensemble of the three models using a majority vote approach.

Results:

GPT-4 demonstrated superior accuracy and efficiency compared to Llama 2, but did not outperform traditional models. The ensemble model outperformed the individual models, achieving a precision of 90.3%, a recall of 94.2%, and an F1-score of 92.2%. Notably, the ensemble model showed a significant improvement in precision, increasing from a range of 70%-79% to above 90%, compared to the best-performing single model. Error analysis revealed that 63 samples were incorrectly predicted by at least one model; however, only 2 cases (3.2%) were mutual errors across all models, indicating diverse error profiles among them.

Conclusions:

LLMs and traditional machine learning models trained using local EHR data exhibited diverse error profiles. The ensemble of these models was found to be complementary, enhancing diagnostic performance. Future research should investigate integrating LLMs with smaller, localized models and incorporating medical data and domain knowledge to enhance performance on specific tasks.

Key words

Alzheimer Disease; Cognitive Dysfunction; Dementia; Early Diagnosis; Electronic Health Records; Natural Language Processing; Neurobehavioral Manifestations

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Language: En Journal: MedRxiv Year: 2024 Document type: Article

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Language: En Journal: MedRxiv Year: 2024 Document type: Article