Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.

He, Zhe; Bhasuran, Balu; Jin, Qiao; Tian, Shubo; Hanna, Karim; Shavor, Cindy; Arguello, Lisbeth Garcia; Murray, Patrick; Lu, Zhiyong

He, Zhe; Bhasuran, Balu; Jin, Qiao; Tian, Shubo; Hanna, Karim; Shavor, Cindy; Arguello, Lisbeth Garcia; Murray, Patrick; Lu, Zhiyong.

Afiliación

He Z; School of Information, Florida State University, Tallahassee, Florida, USA.
Bhasuran B; School of Information, Florida State University, Tallahassee, Florida, USA.
Jin Q; National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health, Bethesda, Maryland, USA.
Tian S; National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health, Bethesda, Maryland, USA.
Hanna K; Morsani College of Medicine, University of South Florida, Tampa, Florida, USA.
Shavor C; Morsani College of Medicine, University of South Florida, Tampa, Florida, USA.
Arguello LG; Morsani College of Medicine, University of South Florida, Tampa, Florida, USA.
Murray P; Morsani College of Medicine, University of South Florida, Tampa, Florida, USA.
Lu Z; National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health, Bethesda, Maryland, USA.

ArXiv ; 2024 Jan 23.

Article en En | MEDLINE | ID: mdl-38529075

ABSTRACT

ABSTRACT

Background:

Even though patients have easy access to their electronic health records and lab test results data through patient portals, lab results are often confusing and hard to understand. Many patients turn to online forums or question and answering (Q&A) sites to seek advice from their peers. However, the quality of answers from social Q&A on health-related questions varies significantly, and not all the responses are accurate or reliable. Large language models (LLMs) such as ChatGPT have opened a promising avenue for patients to get their questions answered.

Objective:

We aim to assess the feasibility of using LLMs to generate relevant, accurate, helpful, and unharmful responses to lab test-related questions asked by patients and to identify potential issues that can be mitigated with augmentation approaches.

Methods:

We first collected lab test results related question and answer data from Yahoo! Answers and selected 53 Q&A pairs for this study. Using the LangChain framework and ChatGPT web portal, we generated responses to the 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini. We first assessed the similarity of their answers using standard QA similarity-based evaluation metrics including ROUGE, BLEU, METEOR, BERTScore. We also utilized an LLM-based evaluator to judge whether a target model has higher quality in terms of relevance, correctness, helpfulness, and safety than the baseline model. Finally, we performed a manual evaluation with medical experts for all the responses of seven selected questions on the same four aspects.

Results:

Regarding the similarity of the responses from 4 LLMs, where GPT-4 output was used as the reference answer, the responses from LLaMa 2 are the most similar ones, followed by LLaMa 2, ORCA_mini, and MedAlpaca. Human answers from Yahoo data were scored lowest and thus least similar to GPT-4-generated answers. The results of Win Rate and medical expert evaluation both showed that GPT-4's responses achieved better scores than all the other LLM responses and human responses on all the four aspects (relevance, correctness, helpfulness, and safety). However, LLM responses occasionally also suffer from lack of interpretation in one's medical context, incorrect statements, and lack of references.

Conclusions:

By evaluating LLMs in generating responses to patients' lab test results related questions, we find that compared to other three LLMs and human answer from the Q&A website, GPT-4's responses are more accurate, helpful, relevant, and safer. However, there are cases that GPT-4 responses are inaccurate and not individualized. We identified a number of ways to improve the quality of LLM responses including prompt engineering, prompt augmentation, retrieval augmented generation, and response evaluation.

Palabras clave

ChatGPT; Generative AI; Lab Test Results; Large language models; Natural Language Processing; Patient Education

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Contexto en salud: 1_ASSA2030 Problema de salud: 1_sistemas_informacao_saude Idioma: En Revista: ArXiv Año: 2024 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google