Reasoning with large language models for medical question answering.

Lucas, Mary M; Yang, Justin; Pomeroy, Jon K; Yang, Christopher C

Lucas, Mary M; Yang, Justin; Pomeroy, Jon K; Yang, Christopher C.

Affiliation

Lucas MM; College of Computing and Informatics, Drexel University, Philadelphia, PA 19104, United States.
Yang J; Department of Computer Science, University of Maryland, College Park, MD 20742, United States.
Pomeroy JK; College of Computing and Informatics, Drexel University, Philadelphia, PA 19104, United States.
Yang CC; Penn Medicine, Philadelphia, PA 19104, United States.

J Am Med Inform Assoc ; 31(9): 1964-1975, 2024 Sep 01.

Article in En | MEDLINE | ID: mdl-38960731

ABSTRACT

ABSTRACT

OBJECTIVES:

To investigate approaches of reasoning with large language models (LLMs) and to propose a new prompting approach, ensemble reasoning, to improve medical question answering performance with refined reasoning and reduced inconsistency. MATERIALS AND

METHODS:

We used multiple choice questions from the USMLE Sample Exam question files on 2 closed-source commercial and 1 open-source clinical LLM to evaluate our proposed approach ensemble reasoning.

RESULTS:

On GPT-3.5 turbo and Med42-70B, our proposed ensemble reasoning approach outperformed zero-shot chain-of-thought with self-consistency on Steps 1, 2, and 3 questions (+3.44%, +4.00%, and +2.54%) and (2.3%, 5.00%, and 4.15%), respectively. With GPT-4 turbo, there were mixed results with ensemble reasoning again outperforming zero-shot chain-of-thought with self-consistency on Step 1 questions (+1.15%). In all cases, the results demonstrated improved consistency of responses with our approach. A qualitative analysis of the reasoning from the model demonstrated that the ensemble reasoning approach produces correct and helpful reasoning.

CONCLUSION:

The proposed iterative ensemble reasoning has the potential to improve the performance of LLMs in medical question answering tasks, particularly with the less powerful LLMs like GPT-3.5 turbo and Med42-70B, which may suggest that this is a promising approach for LLMs with lower capabilities. Additionally, the findings show that our approach helps to refine the reasoning generated by the LLM and thereby improve consistency even with the more powerful GPT-4 turbo. We also identify the potential and need for human-artificial intelligence teaming to improve the reasoning beyond the limits of the model.

Subject(s)

Natural Language Processing; Humans; Educational Measurement/methods

Key words

artificial intelligence; clinical reasoning; large language model; machine reasoning

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google

Full text: 1 Database: MEDLINE Main subject: Natural Language Processing Limits: Humans Language: En Journal: J Am Med Inform Assoc Journal subject: INFORMATICA MEDICA Year: 2024 Type: Article Affiliation country: United States

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google