Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study.

Meyer, Annika; Riese, Janik; Streichert, Thomas

Meyer, Annika; Riese, Janik; Streichert, Thomas.

Afiliação

Meyer A; Institute for Clinical Chemistry, University Hospital Cologne, Cologne, Germany.
Riese J; Department of General Surgery, Visceral, Thoracic and Vascular Surgery, University Hospital Greifswald, Greifswald, Germany.
Streichert T; Institute for Clinical Chemistry, University Hospital Cologne, Cologne, Germany.

JMIR Med Educ ; 10: e50965, 2024 Feb 08.

Article em En | MEDLINE | ID: mdl-38329802

ABSTRACT

ABSTRACT

BACKGROUND:

The potential of artificial intelligence (AI)-based large language models, such as ChatGPT, has gained significant attention in the medical field. This enthusiasm is driven not only by recent breakthroughs and improved accessibility, but also by the prospect of democratizing medical knowledge and promoting equitable health care. However, the performance of ChatGPT is substantially influenced by the input language, and given the growing public trust in this AI tool compared to that in traditional sources of information, investigating its medical accuracy across different languages is of particular importance.

OBJECTIVE:

This study aimed to compare the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination.

METHODS:

To assess GPT-3.5's and GPT-4's medical proficiency, we used 937 original multiple-choice questions from 3 written German medical licensing examinations in October 2021, April 2022, and October 2022.

RESULTS:

GPT-4 achieved an average score of 85% and ranked in the 92.8th, 99.5th, and 92.6th percentiles among medical students who took the same examinations in October 2021, April 2022, and October 2022, respectively. This represents a substantial improvement of 27% compared to GPT-3.5, which only passed 1 out of the 3 examinations. While GPT-3.5 performed well in psychiatry questions, GPT-4 exhibited strengths in internal medicine and surgery but showed weakness in academic research.

CONCLUSIONS:

The study results highlight ChatGPT's remarkable improvement from moderate (GPT-3.5) to high competency (GPT-4) in answering medical licensing examination questions in German. While GPT-4's predecessor (GPT-3.5) was imprecise and inconsistent, it demonstrates considerable potential to improve medical education and patient care, provided that medically trained users critically evaluate its results. As the replacement of search engines by AI tools seems possible in the future, further studies with nonprofessional questions are needed to assess the safety and accuracy of ChatGPT for the general population.

Assuntos

Inteligência Artificial; Educação Médica; Avaliação Educacional; Estudantes de Medicina; Humanos; Idioma

Palavras-chave

ChatGPT; LLM; artificial intelligence; general population; improvement; large language model; licensing exam; licensing examination; licensure examination; medical accuracy; medical education; medical examinations; medical exams; patient care; public trust; trust

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Estudantes de Medicina / Inteligência Artificial / Educação Médica / Avaliação Educacional Tipo de estudo: Observational_studies / Prognostic_studies Limite: Humans Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google