Mixed methods assessment of the influence of demographics on medical advice of ChatGPT.

Andreadis, Katerina; Newman, Devon R; Twan, Chelsea; Shunk, Amelia; Mann, Devin M; Stevens, Elizabeth R

Andreadis, Katerina; Newman, Devon R; Twan, Chelsea; Shunk, Amelia; Mann, Devin M; Stevens, Elizabeth R.

Afiliación

Andreadis K; Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States.
Newman DR; Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States.
Twan C; Brown University, Providence, RI 02912, United States.
Shunk A; Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States.
Mann DM; Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States.
Stevens ER; Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States.

J Am Med Inform Assoc ; 31(9): 2002-2009, 2024 Sep 01.

Article en En | MEDLINE | ID: mdl-38679900

ABSTRACT

ABSTRACT

OBJECTIVES:

To evaluate demographic biases in diagnostic accuracy and health advice between generative artificial intelligence (AI) (ChatGPT GPT-4) and traditional symptom checkers like WebMD. MATERIALS AND

METHODS:

Combination symptom and demographic vignettes were developed for 27 most common symptom complaints. Standardized prompts, written from a patient perspective, with varying demographic permutations of age, sex, and race/ethnicity were entered into ChatGPT (GPT-4) between July and August 2023. In total, 3 runs of 540 ChatGPT prompts were compared to the corresponding WebMD Symptom Checker output using a mixed-methods approach. In addition to diagnostic correctness, the associated text generated by ChatGPT was analyzed for readability (using Flesch-Kincaid Grade Level) and qualitative aspects like disclaimers and demographic tailoring.

RESULTS:

ChatGPT matched WebMD in 91% of diagnoses, with a 24% top diagnosis match rate. Diagnostic accuracy was not significantly different across demographic groups, including age, race/ethnicity, and sex. ChatGPT's urgent care recommendations and demographic tailoring were presented significantly more to 75-year-olds versus 25-year-olds (P < .01) but were not statistically different among race/ethnicity and sex groups. The GPT text was suitable for college students, with no significant demographic variability.

DISCUSSION:

The use of non-health-tailored generative AI, like ChatGPT, for simple symptom-checking functions provides comparable diagnostic accuracy to commercially available symptom checkers and does not demonstrate significant demographic bias in this setting. The text accompanying differential diagnoses, however, suggests demographic tailoring that could potentially introduce bias.

CONCLUSION:

These results highlight the need for continued rigorous evaluation of AI-driven medical platforms, focusing on demographic biases to ensure equitable care.

Asunto(s)

Inteligencia Artificial; Humanos; Femenino; Masculino; Demografía; Factores Sociodemográficos; Adulto

Palabras clave

ChatGPT; artificial intelligence; bias; digital health; large language model; symptom checker

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Inteligencia Artificial Límite: Adult / Female / Humans / Male Idioma: En Revista: J Am Med Inform Assoc Asunto de la revista: INFORMATICA MEDICA Año: 2024 Tipo del documento: Article País de afiliación: Estados Unidos Pais de publicación: Reino Unido

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google