Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study.

Herrmann-Werner, Anne; Festl-Wietek, Teresa; Holderried, Friederike; Herschbach, Lea; Griewatz, Jan; Masters, Ken; Zipfel, Stephan; Mahling, Moritz

Herrmann-Werner, Anne; Festl-Wietek, Teresa; Holderried, Friederike; Herschbach, Lea; Griewatz, Jan; Masters, Ken; Zipfel, Stephan; Mahling, Moritz.

Afiliación

Herrmann-Werner A; Tübingen Institute for Medical Education, Faculty of Medicine, University of Tübingen, Tübingen, Germany.
Festl-Wietek T; Department of Psychosomatic Medicine and Psychotherapy, University Hospital Tübingen, Tübingen, Germany.
Holderried F; Tübingen Institute for Medical Education, Faculty of Medicine, University of Tübingen, Tübingen, Germany.
Herschbach L; Tübingen Institute for Medical Education, Faculty of Medicine, University of Tübingen, Tübingen, Germany.
Griewatz J; University Department of Anesthesiology and Intensive Care Medicine, University Hospital Tübingen, Tübingen, Germany.
Masters K; Tübingen Institute for Medical Education, Faculty of Medicine, University of Tübingen, Tübingen, Germany.
Zipfel S; Tübingen Institute for Medical Education, Faculty of Medicine, University of Tübingen, Tübingen, Germany.
Mahling M; Medical Education and Informatics Department, College of Medicine and Health Sciences, Sultan Qaboos University, Muscat, Oman.

J Med Internet Res ; 26: e52113, 2024 Jan 23.

Article en En | MEDLINE | ID: mdl-38261378

ABSTRACT

ABSTRACT

BACKGROUND:

Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to "hallucinations" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy.

OBJECTIVE:

This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions.

METHODS:

We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy.

RESULTS:

GPT-4's performance in answering exam questions yielded a high success rate 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P=.002 for the detailed prompt and P<.001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the "pass" threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the "remember" (29/68) and "understand" (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines.

CONCLUSIONS:

GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.

Asunto(s)

Educación Médica; Medicina; Medicina Psicosomática; Humanos; Proyectos de Investigación

Palabras clave

Bloom's taxonomy; ChatGPT; GPT-4; Generative Pre-trained Transformer 4; LLM; MCQ; NLP; answer; artificial intelligence; assessment; classification; error; exam; examination; generative; language model; learning outcome; medical education; medical exam; multiple-choice question; natural language processing; psychosomatic; question; response; taxonomy

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Base de datos: MEDLINE Asunto principal: Medicina Psicosomática / Educación Médica / Medicina Tipo de estudio: Prognostic_studies / Qualitative_research Idioma: En Revista: J Med Internet Res Asunto de la revista: INFORMATICA MEDICA Año: 2024 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google