Pesquisa | BVS Doenças Infecciosas e Parasitárias

Examining ChatGPT Performance on USMLE Sample Items and Implications for Assessment.

Yaneva, Victoria; Baldwin, Peter; Jurich, Daniel P; Swygert, Kimberly; Clauser, Brian E.

Acad Med ; 99(2): 192-197, 2024 Feb 01.

Artigo em Inglês | MEDLINE | ID: mdl-37934828

RESUMO

PURPOSE: In late 2022 and early 2023, reports that ChatGPT could pass the United States Medical Licensing Examination (USMLE) generated considerable excitement, and media response suggested ChatGPT has credible medical knowledge. This report analyzes the extent to which an artificial intelligence (AI) agent's performance on these sample items can generalize to performance on an actual USMLE examination and an illustration is given using ChatGPT. METHOD: As with earlier investigations, analyses were based on publicly available USMLE sample items. Each item was submitted to ChatGPT (version 3.5) 3 times to evaluate stability. Responses were scored following rules that match operational practice, and a preliminary analysis explored the characteristics of items that ChatGPT answered correctly. The study was conducted between February and March 2023. RESULTS: For the full sample of items, ChatGPT scored above 60% correct except for one replication for Step 3. Response success varied across replications for 76 items (20%). There was a modest correspondence with item difficulty wherein ChatGPT was more likely to respond correctly to items found easier by examinees. ChatGPT performed significantly worse ( P < .001) on items relating to practice-based learning. CONCLUSIONS: Achieving 60% accuracy is an approximate indicator of meeting the passing standard, requiring statistical adjustments for comparison. Hence, this assessment can only suggest consistency with the passing standards for Steps 1 and 2 Clinical Knowledge, with further limitations in extrapolating this inference to Step 3. These limitations are due to variances in item difficulty and exclusion of the simulation component of Step 3 from the evaluation-limitations that would apply to any AI system evaluated on the Step 3 sample items. It is crucial to note that responses from large language models exhibit notable variations when faced with repeated inquiries, underscoring the need for expert validation to ensure their utility as a learning tool.

Assuntos

Inteligência Artificial , Conhecimento , Humanos , Simulação por Computador , Idioma , Aprendizagem

Effects and Unforeseen Consequences of Accessing References on a Maintenance of Certification Examination: Findings From a National Study.

Feinberg, Richard A; Jurich, Daniel P; Foster, Lauren M.

Acad Med ; 93(4): 636-641, 2018 04.

Artigo em Inglês | MEDLINE | ID: mdl-29028636

RESUMO

PURPOSE: Increasing criticism of maintenance of certification (MOC) examinations has prompted certifying boards to explore alternative assessment formats. The purpose of this study was to examine the effect of allowing test takers to access reference material while completing their MOC Part III standardized examination. METHOD: Item response data were obtained from 546 physicians who completed a medical subspecialty MOC examination between 2013 and 2016. To investigate whether accessing references was related to better performance, an analysis of covariance was conducted on the MOC examination scores with references (access or no access) as the between-groups factor and scores from the physicians' initial certification examination as a covariate. Descriptive analyses were conducted to investigate how the new feature of accessing references influenced time management within the test day. RESULTS: Physicians scored significantly higher when references were allowed (mean = 534.44, standard error = 6.83) compared with when they were not (mean = 472.75, standard error = 4.87), F(1, 543) = 60.18, P < .001, ω(2) = 0.09. However, accessing references affected pacing behavior; physicians were 13.47 times more likely to finish with less than a minute of test time remaining per section when reference material was accessible. CONCLUSIONS: Permitting references caused an increase in performance, but also a decrease in the perception that the test has sufficient time limits. Implications for allowing references are discussed, including physician time management, impact on the construct assessed by the test, and the importance of providing validity evidence for all test design decisions.

Assuntos

Atitude do Pessoal de Saúde , Médicos , Conselhos de Especialidade Profissional , Análise de Variância , Certificação , Competência Clínica , Educação Médica Continuada , Humanos , Fatores de Tempo , Estados Unidos

The Interaction of Ability Differences and Guessing When Modeling Differential Item Functioning With the Rasch Model: Conventional and Tailored Calibration.

DeMars, Christine E; Jurich, Daniel P.

Educ Psychol Meas ; 75(4): 610-633, 2015 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-29795835

RESUMO

In educational testing, differential item functioning (DIF) statistics must be accurately estimated to ensure the appropriate items are flagged for inspection or removal. This study showed how using the Rasch model to estimate DIF may introduce considerable bias in the results when there are large group differences in ability (impact) and the data follow a three-parameter logistic model. With large group ability differences, difficult non-DIF items appeared to favor the focal group and easy non-DIF items appeared to favor the reference group. Correspondingly, the effect sizes for DIF items were biased. These effects were mitigated when data were coded as missing for item-examinee encounters in which the person measure was considerably lower than the item location. Explanation of these results is provided by illustrating how the item response function becomes differentially distorted by guessing depending on the groups' ability distributions. In terms of practical implications, results suggest that measurement practitioners should not trust the DIF estimates from the Rasch model when there is a large difference in ability and examinees are potentially able to answer items correctly by guessing, unless data from examinees poorly matched to the item difficulty are coded as missing.

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA