Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 11.974
Filtrar
Mais filtros

Coleção SES
Intervalo de ano de publicação
2.
J Cell Physiol ; 239(8): e31258, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-38595027

RESUMO

Qualifying exams and thesis committees are crucial components of a PhD candidate's journey. However, many candidates have trouble navigating these milestones and knowing what to expect. This article provides advice on meeting the requirements of the qualifying exam, understanding its format and components, choosing effective preparation strategies, retaking the qualifying exam, if necessary, and selecting a thesis committee, all while maintaining one's mental health. This comprehensive guide addresses components of the graduate school process that are often neglected.


Assuntos
Educação de Pós-Graduação , Humanos , Educação de Pós-Graduação/métodos , Dissertações Acadêmicas como Assunto , Avaliação Educacional/métodos
3.
Radiology ; 312(3): e240153, 2024 09.
Artigo em Inglês | MEDLINE | ID: mdl-39225605

RESUMO

Background Recent advancements, including image processing capabilities, present new potential applications of large language models such as ChatGPT (OpenAI), a generative pretrained transformer, in radiology. However, baseline performance of ChatGPT in radiology-related tasks is understudied. Purpose To evaluate the performance of GPT-4 with vision (GPT-4V) on radiology in-training examination questions, including those with images, to gauge the model's baseline knowledge in radiology. Materials and Methods In this prospective study, conducted between September 2023 and March 2024, the September 2023 release of GPT-4V was assessed using 386 retired questions (189 image-based and 197 text-only questions) from the American College of Radiology Diagnostic Radiology In-Training Examinations. Nine question pairs were identified as duplicates; only the first instance of each duplicate was considered in ChatGPT's assessment. A subanalysis assessed the impact of different zero-shot prompts on performance. Statistical analysis included χ2 tests of independence to ascertain whether the performance of GPT-4V varied between question types or subspecialty. The McNemar test was used to evaluate performance differences between the prompts, with Benjamin-Hochberg adjustment of the P values conducted to control the false discovery rate (FDR). A P value threshold of less than.05 denoted statistical significance. Results GPT-4V correctly answered 246 (65.3%) of the 377 unique questions, with significantly higher accuracy on text-only questions (81.5%, 159 of 195) than on image-based questions (47.8%, 87 of 182) (χ2 test, P < .001). Subanalysis revealed differences between prompts on text-based questions, where chain-of-thought prompting outperformed long instruction by 6.1% (McNemar, P = .02; FDR = 0.063), basic prompting by 6.8% (P = .009, FDR = 0.044), and the original prompting style by 8.9% (P = .001, FDR = 0.014). No differences were observed between prompts on image-based questions with P values of .27 to >.99. Conclusion While GPT-4V demonstrated a level of competence in text-based questions, it showed deficits interpreting radiologic images. © RSNA, 2024 See also the editorial by Deng in this issue.


Assuntos
Avaliação Educacional , Radiologia , Humanos , Estudos Prospectivos , Radiologia/educação , Avaliação Educacional/métodos , Competência Clínica , Estados Unidos , Internato e Residência , Educação de Pós-Graduação em Medicina/métodos
4.
Radiology ; 311(2): e232715, 2024 05.
Artigo em Inglês | MEDLINE | ID: mdl-38771184

RESUMO

Background ChatGPT (OpenAI) can pass a text-based radiology board-style examination, but its stochasticity and confident language when it is incorrect may limit utility. Purpose To assess the reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 (ChatGPT; OpenAI) through repeated prompting with a radiology board-style examination. Materials and Methods In this exploratory prospective study, 150 radiology board-style multiple-choice text-based questions, previously used to benchmark ChatGPT, were administered to default versions of ChatGPT (GPT-3.5 and GPT-4) on three separate attempts (separated by ≥1 month and then 1 week). Accuracy and answer choices between attempts were compared to assess reliability (accuracy over time) and repeatability (agreement over time). On the third attempt, regardless of answer choice, ChatGPT was challenged three times with the adversarial prompt, "Your answer choice is incorrect. Please choose a different option," to assess robustness (ability to withstand adversarial prompting). ChatGPT was prompted to rate its confidence from 1-10 (with 10 being the highest level of confidence and 1 being the lowest) on the third attempt and after each challenge prompt. Results Neither version showed a difference in accuracy over three attempts: for the first, second, and third attempt, accuracy of GPT-3.5 was 69.3% (104 of 150), 63.3% (95 of 150), and 60.7% (91 of 150), respectively (P = .06); and accuracy of GPT-4 was 80.6% (121 of 150), 78.0% (117 of 150), and 76.7% (115 of 150), respectively (P = .42). Though both GPT-4 and GPT-3.5 had only moderate intrarater agreement (κ = 0.78 and 0.64, respectively), the answer choices of GPT-4 were more consistent across three attempts than those of GPT-3.5 (agreement, 76.7% [115 of 150] vs 61.3% [92 of 150], respectively; P = .006). After challenge prompt, both changed responses for most questions, though GPT-4 did so more frequently than GPT-3.5 (97.3% [146 of 150] vs 71.3% [107 of 150], respectively; P < .001). Both rated "high confidence" (≥8 on the 1-10 scale) for most initial responses (GPT-3.5, 100% [150 of 150]; and GPT-4, 94.0% [141 of 150]) as well as for incorrect responses (ie, overconfidence; GPT-3.5, 100% [59 of 59]; and GPT-4, 77% [27 of 35], respectively; P = .89). Conclusion Default GPT-3.5 and GPT-4 were reliably accurate across three attempts, but both had poor repeatability and robustness and were frequently overconfident. GPT-4 was more consistent across attempts than GPT-3.5 but more influenced by an adversarial prompt. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Ballard in this issue.


Assuntos
Inteligência Artificial , Competência Clínica , Avaliação Educacional , Radiologia , Humanos , Avaliação Educacional/métodos , Estudos Prospectivos , Reprodutibilidade dos Testes , Conselhos de Especialidade Profissional
5.
J Gen Intern Med ; 39(10): 1795-1802, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-38289461

RESUMO

BACKGROUND: While some prior studies of work-based assessment (WBA) numeric ratings have not shown gender differences, they have been unable to account for the true performance of the resident or explore narrative differences by gender. OBJECTIVE: To explore gender differences in WBA ratings as well as narrative comments (when scripted performance was known). DESIGN: Secondary analysis of WBAs obtained from a randomized controlled trial of a longitudinal rater training intervention in 2018-2019. Participating faculty (n = 77) observed standardized resident-patient encounters and subsequently completed rater assessment forms (RAFs). SUBJECTS: Participating faculty in longitudinal rater training. MAIN MEASURES: Gender differences in mean entrustment ratings (4-point scale) were assessed with multivariable regression (adjusted for scripted performance, rater and resident demographics, and the interaction between study arm and time period [pre- versus post-intervention]). Using pre-specified natural language processing categories (masculine, feminine, agentic, and communal words), multivariable linear regression was used to determine associations of word use in the narrative comments with resident gender, race, and skill level, faculty demographics, and interaction between the study arm and the time period (pre- versus post-intervention). KEY RESULTS: Across 1527 RAFs, there were significant differences in entrustment ratings between women and men standardized residents (2.29 versus 2.54, respectively, p < 0.001) after correction for resident skill level. As compared to men, feminine terms were more common for comments of what the resident did poorly among women residents (ß 0.45, CI 0.12-0.78, p 0.01). This persisted despite adjusting for the faculty's entrustment ratings. There were no other significant linguistic differences by gender. CONCLUSIONS: Contrasting prior studies, we found entrustment rating differences in a simulated WBA which persisted after adjusting for the resident's scripted performance. There were also linguistic differences by gender after adjusting for entrustment ratings, with feminine terms being used more frequently in comments about women in some, but not all narrative comments.


Assuntos
Competência Clínica , Internato e Residência , Humanos , Feminino , Masculino , Competência Clínica/normas , Fatores Sexuais , Narração , Adulto , Avaliação Educacional/métodos
6.
World J Urol ; 42(1): 445, 2024 Jul 26.
Artigo em Inglês | MEDLINE | ID: mdl-39060792

RESUMO

BACKGROUND AND OBJECTIVE: In the transformative era of artificial intelligence, its integration into various spheres, especially healthcare, has been promising. The objective of this study was to analyze the performance of ChatGPT, as open-source Large Language Model (LLM), in its different versions on the recent European Board of Urology (EBU) in-service assessment questions. DESIGN AND SETTING: We asked multiple choice questions of the official EBU test books to ChatGPT-3.5 and ChatGPT-4 for the following exams: exam 1 (2017-2018), exam 2 (2019-2020) and exam 3 (2021-2022). Exams were passed with ≥60% correct answers. RESULTS: ChatGPT-4 provided significantly more correct answers in all exams compared to the prior version 3.5 (exam 1: ChatGPT-3.5 64.3% vs. ChatGPT-4 81.6%; exam 2: 64.5% vs. 80.5%; exam 3: 56% vs. 77%, p < 0.001, respectively). Test exam 3 was the only exam ChatGPT-3.5 did not pass. Within the different subtopics, there were no significant differences of provided correct answers by ChatGPT-3.5. Concerning ChatGPT-4, the percentage in test exam 3 was significantly decreased in the subtopics Incontinence (exam 1: 81.6% vs. exam 3: 53.6%; p = 0.026) and Transplantation (exam 1: 77.8% vs. exam 3: 0%; p = 0.020). CONCLUSION: Our findings indicate that ChatGPT, especially ChatGPT-4, has the general ability to answer complex medical questions and might pass FEBU exams. Nevertheless, there is still the indispensable need for human validation of LLM answers, especially concerning health care issues.


Assuntos
Urologia , Europa (Continente) , Avaliação Educacional/métodos , Conselhos de Especialidade Profissional , Humanos
7.
World J Urol ; 42(1): 250, 2024 Apr 23.
Artigo em Inglês | MEDLINE | ID: mdl-38652322

RESUMO

PURPOSE: To compare ChatGPT-4 and ChatGPT-3.5's performance on Taiwan urology board examination (TUBE), focusing on answer accuracy, explanation consistency, and uncertainty management tactics to minimize score penalties from incorrect responses across 12 urology domains. METHODS: 450 multiple-choice questions from TUBE(2020-2022) were presented to two models. Three urologists assessed correctness and consistency of each response. Accuracy quantifies correct answers; consistency assesses logic and coherence in explanations out of total responses, alongside a penalty reduction experiment with prompt variations. Univariate logistic regression was applied for subgroup comparison. RESULTS: ChatGPT-4 showed strengths in urology, achieved an overall accuracy of 57.8%, with annual accuracies of 64.7% (2020), 58.0% (2021), and 50.7% (2022), significantly surpassing ChatGPT-3.5 (33.8%, OR = 2.68, 95% CI [2.05-3.52]). It could have passed the TUBE written exams if solely based on accuracy but failed in the final score due to penalties. ChatGPT-4 displayed a declining accuracy trend over time. Variability in accuracy across 12 urological domains was noted, with more frequently updated knowledge domains showing lower accuracy (53.2% vs. 62.2%, OR = 0.69, p = 0.05). A high consistency rate of 91.6% in explanations across all domains indicates reliable delivery of coherent and logical information. The simple prompt outperformed strategy-based prompts in accuracy (60% vs. 40%, p = 0.016), highlighting ChatGPT's limitations in its inability to accurately self-assess uncertainty and a tendency towards overconfidence, which may hinder medical decision-making. CONCLUSIONS: ChatGPT-4's high accuracy and consistent explanations in urology board examination demonstrate its potential in medical information processing. However, its limitations in self-assessment and overconfidence necessitate caution in its application, especially for inexperienced users. These insights call for ongoing advancements of urology-specific AI tools.


Assuntos
Avaliação Educacional , Urologia , Taiwan , Avaliação Educacional/métodos , Competência Clínica , Humanos , Conselhos de Especialidade Profissional
8.
J Surg Res ; 299: 329-335, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38788470

RESUMO

INTRODUCTION: Chat Generative Pretrained Transformer (ChatGPT) is a large language model capable of generating human-like text. This study sought to evaluate ChatGPT's performance on Surgical Council on Resident Education (SCORE) self-assessment questions. METHODS: General surgery multiple choice questions were randomly selected from the SCORE question bank. ChatGPT (GPT-3.5, April-May 2023) evaluated questions and responses were recorded. RESULTS: ChatGPT correctly answered 123 of 200 questions (62%). ChatGPT scored lowest on biliary (2/8 questions correct, 25%), surgical critical care (3/10, 30%), general abdomen (1/3, 33%), and pancreas (1/3, 33%) topics. ChatGPT scored higher on biostatistics (4/4 correct, 100%), fluid/electrolytes/acid-base (4/4, 100%), and small intestine (8/9, 89%) questions. ChatGPT answered questions with thorough and structured support for its answers. It scored 56% on ethics questions and provided coherent explanations regarding end-of-life discussions, communication with coworkers and patients, and informed consent. For many questions answered incorrectly, ChatGPT provided cogent, yet factually incorrect descriptions, including anatomy and steps of operations. In two instances, it gave a correct explanation but chose the wrong answer. It did not answer two questions, stating it needed additional information to determine the next best step in treatment. CONCLUSIONS: ChatGPT answered 62% of SCORE questions correctly. It performed better at questions requiring standard recall but struggled with higher-level questions that required complex clinical decision making, despite providing detailed responses behind its rationale. Due to its mediocre performance on this question set and sometimes confidently-worded, yet factually inaccurate responses, caution should be used when interpreting ChatGPT's answers to general surgery questions.


Assuntos
Cirurgia Geral , Internato e Residência , Humanos , Cirurgia Geral/educação , Avaliação Educacional/métodos , Avaliação Educacional/estatística & dados numéricos , Estados Unidos , Competência Clínica/estatística & dados numéricos , Conselhos de Especialidade Profissional
9.
J Surg Res ; 300: 191-197, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-38824849

RESUMO

INTRODUCTION: There is no consensus regarding optimal curricula to teach cognitive elements of general surgery. The American Board of Surgery In-Training Exam (ABSITE) aims to measure trainees' progress in attaining this knowledge. Resources like question banks (QBs), Surgical Council on Resident Education (SCORE) curriculum, and didactic conferences have mixed findings related to ABSITE performance and are often evaluated in isolation. This study characterized relationships between multiple learning methods and ABSITE performance to elucidate the relative educational value of learning strategies. METHODS: Use and score of QB, SCORE use, didactic conference attendance, and ABSITE percentile score were collected at an academic general surgery residency program from 2017 to 2022. QB data were available in the years 2017-2018 and 2021-2022 during institutional subscription to the same platform. Given differences in risk of qualifying exam failure, groups of ≤30th and >30th percentile were analyzed. Linear quantile mixed regressions and generalized linear mixed models determined factors associated with ABSITE performance. RESULTS: Linear quantile mixed regressions revealed a relationship between ABSITE performance and QB questions completed (1.5 percentile per 100 questions, P < 0.001) and QB score (1.2 percentile per 1% score, P < 0.001), but not with SCORE use and didactic attendance. Performers >30th percentile had a significantly higher QB score. CONCLUSIONS: Use and score of QB had a significant relationship with ABSITE performance, while SCORE use and didactic attendance did not. Performers >30th percentile completed a median 1094 QB questions annually with a score of 65%. Results emphasize success of QB use as an active learning strategy, while passive learning methods warrant further evaluation.


Assuntos
Avaliação Educacional , Cirurgia Geral , Internato e Residência , Humanos , Avaliação Educacional/métodos , Avaliação Educacional/estatística & dados numéricos , Cirurgia Geral/educação , Internato e Residência/métodos , Estados Unidos , Competência Clínica/estatística & dados numéricos , Currículo , Conselhos de Especialidade Profissional , Aprendizagem , Educação de Pós-Graduação em Medicina/métodos
10.
Scand J Gastroenterol ; 59(8): 989-995, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-38742832

RESUMO

BACKGROUND AND AIM: To explore the feasibility of a standardized training and assessment system for magnetically controlled capsule gastroscopy (MCCG). METHODS: The results of 90 trainees who underwent the standardized training and assessment system of the MCCG at the First Affiliated Hospital of Xi'an Jiaotong University from May 2020 to November 2023 was retrospectively analyzed. The trainees were divided into three groups according to their medical backgrounds: doctor, nurse, and non-medical groups. The training and assessment system adopted the '7 + 2' mode, seven days of training plus two days of theoretical and operational assessment. The passing rates of theoretical, operational, and total assessment were the primary outcomes. Satisfaction and mastery of the MCCG was checked. RESULTS: Ninety trainees were assessed; theoretical assessment's passing rates in the three groups were 100%. The operational and total assessment passing rates were 100% (25/25), 97.92% (47/48), and 94.12% (16/17), for the doctor, nurse, and non-doctor groups respectively, with no significant difference (χ2 = 1.741, p = 0.419). No bleeding or perforation occurred during the procedure. Approximately, 96.00% (24/25), 95.83% (46/48), and 94.12% (16/17) of the doctor, nurse and non-medical groups anonymously expressed great satisfaction, respectively, without statistically significant difference (χ2 = 0.565, p = 1.000). The average follow-up time was 4-36 months, and 87 trainees (96.67%) had mastered the operation of the MCCG in daily work. CONCLUSIONS: Standardized training and assessment of magnetically controlled capsule endoscopists is effective and feasible. Additionally, a strict assessment system and long-term communication and learning can improve teaching effects.


Assuntos
Endoscopia por Cápsula , Competência Clínica , Gastroscopia , Humanos , Gastroscopia/educação , Gastroscopia/métodos , Estudos Retrospectivos , Feminino , Masculino , Endoscopia por Cápsula/métodos , Endoscopia por Cápsula/educação , Adulto , Estudos de Viabilidade , Avaliação Educacional/métodos , Magnetismo , China
11.
Surg Endosc ; 38(7): 3547-3555, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38814347

RESUMO

INTRODUCTION: The variety of robotic surgery systems, training modalities, and assessment tools within robotic surgery training is extensive. This systematic review aimed to comprehensively overview different training modalities and assessment methods for teaching and assessing surgical skills in robotic surgery, with a specific focus on comparing objective and subjective assessment methods. METHODS: A systematic review was conducted following the PRISMA guidelines. The electronic databases Pubmed, EMBASE, and Cochrane were searched from inception until February 1, 2022. Included studies consisted of robotic-assisted surgery training (e.g., box training, virtual reality training, cadaver training and animal tissue training) with an assessment method (objective or subjective), such as assessment forms, virtual reality scores, peer-to-peer feedback or time recording. RESULTS: The search identified 1591 studies. After abstract screening and full-texts examination, 209 studies were identified that focused on robotic surgery training and included an assessment tool. The majority of the studies utilized the da Vinci Surgical System, with dry lab training being the most common approach, followed by the da Vinci Surgical Skills Simulator. The most frequently used assessment methods included simulator scoring system (e.g., dVSS score), and assessment forms (e.g., GEARS and OSATS). CONCLUSION: This systematic review provides an overview of training modalities and assessment methods in robotic-assisted surgery. Dry lab training on the da Vinci Surgical System and training on the da Vinci Skills Simulator are the predominant approaches. However, focused training on tissue handling, manipulation, and force interaction is lacking, despite the absence of haptic feedback. Future research should focus on developing universal objective assessment and feedback methods to address these limitations as the field continues to evolve.


Assuntos
Competência Clínica , Procedimentos Cirúrgicos Robóticos , Procedimentos Cirúrgicos Robóticos/educação , Humanos , Treinamento por Simulação/métodos , Avaliação Educacional/métodos , Realidade Virtual , Animais , Cadáver
12.
Surg Endosc ; 38(9): 5086-5095, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-39020120

RESUMO

BACKGROUND: Simulation is increasingly being explored as an assessment modality. This study sought to develop and collate validity evidence for a novel simulation-based assessment of operative competence. We describe the approach to assessment design, development, pilot testing, and validity investigation. METHODS: Eight procedural stations were generated using both virtual reality and bio-hybrid models. Content was identified from a previously conducted Delphi consensus study of trainers. Trainee performance was scored using an equally weighted Objective Structured Assessment of Technical Skills (OSATS) tool and a modified Procedure-Based Assessment (PBA) tool. Validity evidence was analyzed in accordance with Messick's validity framework. Both 'junior' (ST2-ST4) and 'senior' trainees (ST 5-ST8) were included to allow for comparative analysis. RESULTS: Thirteen trainees were assessed by ten assessors across eight stations. Inter-station reliability was high (α = 0.81), and inter-rater reliability was acceptable (inter-class correlation coefficient 0.77). A significant difference in mean station score was observed between junior and senior trainees (44.82 vs 58.18, p = .004), while overall mean scores were moderately correlated with increasing training year (rs = .74, p = .004, Kendall's tau-b .57, p = 0.009). A pass-fail score generated using borderline regression methodology resulted in all 'senior' trainees passing and 4/6 of junior trainees failing the assessment. CONCLUSION: This study reports validity evidence for a novel simulation-based assessment, designed to assess the operative competence of higher specialist trainees in general surgery.


Assuntos
Competência Clínica , Avaliação Educacional , Cirurgia Geral , Treinamento por Simulação , Humanos , Cirurgia Geral/educação , Treinamento por Simulação/métodos , Reprodutibilidade dos Testes , Avaliação Educacional/métodos , Educação de Pós-Graduação em Medicina/métodos , Realidade Virtual , Projetos Piloto , Técnica Delphi , Simulação por Computador
13.
Anesth Analg ; 138(5): 1081-1093, 2024 May 01.
Artigo em Inglês | MEDLINE | ID: mdl-37801598

RESUMO

BACKGROUND: In 2018, a set of entrustable professional activities (EPAs) and procedural skills assessments were developed for anesthesiology training, but they did not assess all the Accreditation Council for Graduate Medical Education (ACGME) milestones. The aims of this study were to (1) remap the 2018 EPA and procedural skills assessments to the revised ACGME Anesthesiology Milestones 2.0, (2) develop new assessments that combined with the original assessments to create a system of assessment that addresses all level 1 to 4 milestones, and (3) provide evidence for the validity of the assessments. METHODS: Using a modified Delphi process, a panel of anesthesiology education experts remapped the original assessments developed in 2018 to the Anesthesiology Milestones 2.0 and developed new assessments to create a system that assessed all level 1 through 4 milestones. Following a 24-month pilot at 7 institutions, the number of EPA and procedural skill assessments and mean scores were computed at the end of the academic year. Milestone achievement and subcompetency data for assessments from a single institution were compared to scores assigned by the institution's clinical competency committee (CCC). RESULTS: New assessment development, 2 months of testing and feedback, and revisions resulted in 5 new EPAs, 11 nontechnical skills assessments (NTSAs), and 6 objective structured clinical examinations (OSCEs). Combined with the original 20 EPAs and procedural skills assessments, the new system of assessment addresses 99% of level 1 to 4 Anesthesiology Milestones 2.0. During the 24-month pilot, aggregate mean EPA and procedural skill scores significantly increased with year in training. System subcompetency scores correlated significantly with 15 of 23 (65.2%) corresponding CCC scores at a single institution, but 8 correlations (36.4%) were <30.0, illustrating poor correlation. CONCLUSIONS: A panel of experts developed a set of EPAs, procedural skill assessment, NTSAs, and OSCEs to form a programmatic system of assessment for anesthesiology residency training in the United States. The method used to develop and pilot test the assessments, the progression of assessment scores with time in training, and the correlation of assessment scores with CCC scoring of milestone achievement provide evidence for the validity of the assessments.


Assuntos
Anestesiologia , Internato e Residência , Estados Unidos , Anestesiologia/educação , Educação de Pós-Graduação em Medicina , Avaliação Educacional/métodos , Competência Clínica , Acreditação
14.
Anesth Analg ; 139(2): 349-356, 2024 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-38640076

RESUMO

BACKGROUND: Over the past decade, artificial intelligence (AI) has expanded significantly with increased adoption across various industries, including medicine. Recently, AI-based large language models such as Generative Pretrained Transformer-3 (GPT-3), Bard, and Generative Pretrained Transformer-3 (GPT-4) have demonstrated remarkable language capabilities. While previous studies have explored their potential in general medical knowledge tasks, here we assess their clinical knowledge and reasoning abilities in a specialized medical context. METHODS: We studied and compared the performance of all 3 models on both the written and oral portions of the comprehensive and challenging American Board of Anesthesiology (ABA) examination, which evaluates candidates' knowledge and competence in anesthesia practice. RESULTS: Our results reveal that only GPT-4 successfully passed the written examination, achieving an accuracy of 78% on the basic section and 80% on the advanced section. In comparison, the less recent or smaller GPT-3 and Bard models scored 58% and 47% on the basic examination, and 50% and 46% on the advanced examination, respectively. Consequently, only GPT-4 was evaluated in the oral examination, with examiners concluding that it had a reasonable possibility of passing the structured oral examination. Additionally, we observe that these models exhibit varying degrees of proficiency across distinct topics, which could serve as an indicator of the relative quality of information contained in the corresponding training datasets. This may also act as a predictor for determining which anesthesiology subspecialty is most likely to witness the earliest integration with AI. CONCLUSIONS: GPT-4 outperformed GPT-3 and Bard on both basic and advanced sections of the written ABA examination, and actual board examiners considered GPT-4 to have a reasonable possibility of passing the real oral examination; these models also exhibit varying degrees of proficiency across distinct topics.


Assuntos
Anestesiologia , Inteligência Artificial , Competência Clínica , Conselhos de Especialidade Profissional , Anestesiologia/educação , Humanos , Estados Unidos , Avaliação Educacional/métodos , Raciocínio Clínico
15.
Child Dev ; 95(1): 242-260, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-37566438

RESUMO

This study used rich individual-level registry data covering the entire Norwegian population to identify students aged 17-21 who either failed a high-stakes exit exam or who received the lowest passing grade from 2006 to 2018. Propensity score matching on high-quality observed characteristics was utilized to allow meaningful comparisons (N = 18,052, 64% boys). Results showed a 21% increase in odds of receiving a psychological diagnosis among students who failed the exam. Adolescents were at 57% reduced odds of graduating and 44% reduction in odds of enrolling in tertiary education 5 years following the exam. Results suggest that failing a high-stakes exam is associated with mental health issues and therefore may impact adolescents more broadly than captured in educational outcomes.


Assuntos
Avaliação Educacional , Saúde Mental , Masculino , Adolescente , Humanos , Feminino , Avaliação Educacional/métodos , Pontuação de Propensão , Estudantes , Escolaridade
16.
J Clin Densitom ; 27(2): 101480, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38401238

RESUMO

BACKGROUND: Artificial intelligence (AI) large language models (LLMs) such as ChatGPT have demonstrated the ability to pass standardized exams. These models are not trained for a specific task, but instead trained to predict sequences of text from large corpora of documents sourced from the internet. It has been shown that even models trained on this general task can pass exams in a variety of domain-specific fields, including the United States Medical Licensing Examination. We asked if large language models would perform as well on a much narrower subdomain tests designed for medical specialists. Furthermore, we wanted to better understand how progressive generations of GPT (generative pre-trained transformer) models may be evolving in the completeness and sophistication of their responses even while generational training remains general. In this study, we evaluated the performance of two versions of GPT (GPT 3 and 4) on their ability to pass the certification exam given to physicians to work as osteoporosis specialists and become a certified clinical densitometrists. The CCD exam has a possible score range of 150 to 400. To pass, you need a score of 300. METHODS: A 100-question multiple-choice practice exam was obtained from a 3rd party exam preparation website that mimics the accredited certification tests given by the ISCD (International Society for Clinical Densitometry). The exam was administered to two versions of GPT, the free version (GPT Playground) and ChatGPT+, which are based on GPT-3 and GPT-4, respectively (OpenAI, San Francisco, CA). The systems were prompted with the exam questions verbatim. If the response was purely textual and did not specify which of the multiple-choice answers to select, the authors matched the text to the closest answer. Each exam was graded and an estimated ISCD score was provided from the exam website. In addition, each response was evaluated by a rheumatologist CCD and ranked for accuracy using a 5-level scale. The two GPT versions were compared in terms of response accuracy and length. RESULTS: The average response length was 11.6 ±19 words for GPT-3 and 50.0±43.6 words for GPT-4. GPT-3 answered 62 questions correctly resulting in a failing ISCD score of 289. However, GPT-4 answered 82 questions correctly with a passing score of 342. GPT-3 scored highest on the "Overview of Low Bone Mass and Osteoporosis" category (72 % correct) while GPT-4 scored well above 80 % accuracy on all categories except "Imaging Technology in Bone Health" (65 % correct). Regarding subjective accuracy, GPT-3 answered 23 questions with nonsensical or totally wrong responses while GPT-4 had no responses in that category. CONCLUSION: If this had been an actual certification exam, GPT-4 would now have a CCD suffix to its name even after being trained using general internet knowledge. Clearly, more goes into physician training than can be captured in this exam. However, GPT algorithms may prove to be valuable physician aids in the diagnoses and monitoring of osteoporosis and other diseases.


Assuntos
Inteligência Artificial , Certificação , Humanos , Osteoporose/diagnóstico , Competência Clínica , Avaliação Educacional/métodos , Estados Unidos
17.
Med Educ ; 58(5): 535-543, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-37932950

RESUMO

INTRODUCTION: Self-monitoring of clinical-decision-making is essential for health care professional practice. Using certainty in responses to assessment items could allow self-monitoring of clinical-decision-making by medical students to be tracked over time. This research introduces how aspects of insightfulness, safety and efficiency could be based on certainty in, and correctness of, multiple-choice question (MCQ) responses. We also show how these measures change over time. METHODS: With each answer on twice yearly MCQ progress tests, medical students provided their certainty of correctness. An insightful student would be more likely to be correct for those answers given with increasing certainty. A safe student would be expected to have a high probability of being correct for answers given with a high certainty. An efficient student would be expected to have a sufficiently low probability of being correct when they have no certainty. The system was developed using first principles and data from one cohort of students. A dataset from a second cohort was then used as an independent validation sample. RESULTS: The patterns of aspects of self-monitoring were similar for both cohorts. Almost all the students met the criteria for insightfulness on all tests. Most students had an undetermined outcome for the safety aspect. When a definitive result for safety was obtained, absence of safety was most prevalent in the middle of the course, while the presence of safety increased later. Most of the students met the criteria for efficiency, with the highest prevalence mid-course, but efficiency was more likely to be absent later. DISCUSSION: Throughout the course, students showed reassuring levels of insightfulness. The results suggest that students may balance safety with efficiency. This may be explained by students learning the positive implications of decisions before the negative implications, making them initially more efficient, but later being more cautious and safer.


Assuntos
Avaliação Educacional , Estudantes de Medicina , Humanos , Avaliação Educacional/métodos , Aprendizagem , Competência Clínica , Tomada de Decisão Clínica
18.
Med Educ ; 58(6): 730-736, 2024 06.
Artigo em Inglês | MEDLINE | ID: mdl-38548481

RESUMO

OBJECTIVE: This study explored how the Syrian crisis, training conditions, and relocation influenced the National Medical Examination (NME) scores of final-year medical students. METHODS: Results of the NME were used to denote the performance of final-year medical students between 2014 and 2021. The NME is a mandatory standardised test that measures the knowledge and competence of students in various clinical subjects. We categorised the data into two periods: period-I (2014-2018) and period-II (2019-2021). Period-I represents students who trained under hostile circumstances, which refer to the devastating effects of a decade-long Syrian crisis. Period-II represents post-hostilities phase, which is marked by a deepening economic crisis. RESULTS: Collected data included test scores for a total of 18 312 final-year medical students from nine medical schools (from six public and three private universities). NME scores improved significantly in period-II compared with period-I tests (p < 0.0001). Campus location or relocation during the crisis affected the results significantly, with higher scores from students of medical schools located in lower-risk regions compared with those from medical schools located in high-risk regions (p < 0.0001), both during and in the post-hostilities phases. Also, students of medical schools re-located to lesser-risk regions scored significantly less than those of medical schools located in high-risk regions (p < 0.0001), but their scores remained inferior to that of students of medical schools that were originally located in lower-risk regions (p < 0.0001). CONCLUSION: Academic performance of final year medical students can be adversely affected by crises and conflicts, with a clear tendency to recovery upon crises resolution. The study underscores the importance of maintaining and safeguarding the infrastructure of educational institutions, especially during times of crisis. Governments and educational authorities should prioritise resource allocation to ensure that medical schools have access to essential services, learning resources, and teaching personnel.


Assuntos
Avaliação Educacional , Estudantes de Medicina , Síria , Humanos , Avaliação Educacional/métodos , Avaliação Educacional/normas , Competência Clínica/normas , Faculdades de Medicina , Educação de Graduação em Medicina , Educação Médica
19.
Med Educ ; 58(7): 825-837, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38167833

RESUMO

BACKGROUND: Assessment of the Core Entrustable Professional Activities for Entering Residency requires direct observation through workplace-based assessments (WBAs). Single-institution studies have demonstrated mixed findings regarding the reliability of WBAs developed to measure student progression towards entrustment. Factors such as faculty development, rater engagement and scale selection have been suggested to improve reliability. The purpose of this investigation was to conduct a multi-institutional generalisability study to determine the influence of specific factors on reliability of WBAs. METHODS: The authors analysed WBA data obtained for clerkship-level students across seven institutions from 2018 to 2020. Institutions implemented a variety of strategies including selection of designated assessors, altered scales and different EPAs. Data were aggregated by these factors. Generalisability theory was then used to examine the internal structure validity evidence of the data. An unbalanced cross-classified random-effects model was used to decompose variance components. A phi coefficient of >0.7 was used as threshold for acceptable reliability. RESULTS: Data from 53 565 WBAs were analysed, and a total of 77 generalisability studies were performed. Most data came from EPAs 1 (n = 17 118, 32%) 2 (n = 10 237, 19.1%), and 6 (n = 6000, 18.5%). Low variance attributed to the learner (<10%) was found for most (59/77, 76%) analyses, resulting in a relatively large number of observations required for reasonable reliability (range = 3 to >560, median = 60). Factors such as DA, scale or EPA were not consistently associated with improved reliability. CONCLUSION: The results from this study describe relatively low reliability in the WBAs obtained across seven sites. Generalisability for these instruments may be less dependent on factors such as faculty development, rater engagement or scale selection. When used for formative feedback, data from these instruments may be useful. However, such instruments do not consistently provide reasonable reliability to justify their use in high-stakes summative entrustment decisions.


Assuntos
Competência Clínica , Avaliação Educacional , Local de Trabalho , Humanos , Avaliação Educacional/métodos , Reprodutibilidade dos Testes , Competência Clínica/normas , Estudantes de Medicina/psicologia , Educação Baseada em Competências , Internato e Residência , Estágio Clínico
20.
Med Educ ; 58(8): 980-988, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-38462812

RESUMO

BACKGROUND: Active engagement with feedback is crucial for feedback to be effective and improve students' learning and achievement. Medical students are provided feedback on their development in the progress test (PT), which has been implemented in various medical curricula, although its format, integration and feedback differ across institutions. Existing research on engagement with feedback in the context of PT is not sufficient to make a definitive judgement on what works and which barriers exist. Therefore, we conducted an interview study to explore students' feedback use in medical progress testing. METHODS: All Dutch medical students participate in a national, curriculum-independent PT four times a year. This mandatory test, composed of multiple-choice questions, provides students with written feedback on their scores. Furthermore, an answer key is available to review their answers. Semi-structured interviews were conducted with 21 preclinical and clinical medical students who participated in the PT. Template analysis was performed on the qualitative data using a priori themes based on previous research on feedback use. RESULTS: Template analysis revealed that students faced challenges in crucial internal psychological processes that impact feedback use, including 'awareness', 'cognizance', 'agency' and 'volition'. Factors such as stakes, available time, feedback timing and feedback presentation contributed to these difficulties, ultimately hindering feedback use. Notably, feedback engagement was higher during clinical rotations, and students were interested in the feedback when seeking insights into their performance level and career perspectives. CONCLUSION: Our study enhanced the understanding of students' feedback utilisation in medical progress testing by identifying key processes and factors that impact feedback use. By recognising and addressing barriers in feedback use, we can improve both student and teacher feedback literacy, thereby transforming the PT into a more valuable learning tool.


Assuntos
Educação de Graduação em Medicina , Avaliação Educacional , Pesquisa Qualitativa , Estudantes de Medicina , Humanos , Estudantes de Medicina/psicologia , Avaliação Educacional/métodos , Masculino , Feminino , Países Baixos , Entrevistas como Assunto , Feedback Formativo , Retroalimentação , Currículo , Competência Clínica
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa