RESUMO
BACKGROUND: The aim of the study was to evaluate the effectiveness of a novel e-learning module in teaching the physical exam of the temporomandibular joint (TMJ) in Juvenile idiopathic arthritis (JIA.). METHODS: An e-learning module was developed to convey the TMJ physical examination maneuvers that are considered to be best practice in JIA. Pediatric rheumatology fellows were randomized to two groups. One group received an article describing the physical examination skills while the second group received both the article and module. All participants completed a written pre-test, an in-person objective structured clinical examination (OSCE), a written post-test, and a follow-up survey. RESULTS: Twenty-two pediatric rheumatology fellows enrolled, with 11 per group. Written test: The two groups improved equally, although there was a trend toward improved defining of maximal incisal opening (MIO) in the module group. OSCE: The mean OSCE score was 11.1 (SD 3.3) in the article group and 13.5 (SD 1.9) in the module group (p = 0.06); significant differences were seen in measuring MIO (p = 0.01), calculating maximal unassisted mouth opening (MUMO; p = 0.01), and assessment of facial symmetry (p = 0.03), all favoring the module. Enjoyment scores in the module group were higher than in the article group (mean 7.7/10 vs. 5.9/10, p = 0.02). The two groups self-reported performing TMJ examinations at comparable rates three months following the intervention. CONCLUSIONS: The study demonstrated that a formalized educational program improved knowledge of the physical exam of the TMJ in JIA. Learners viewing the module were more adept at obtaining quantitative TMJ measurements.
Assuntos
Artrite Juvenil , Exame Físico , Articulação Temporomandibular , Humanos , Artrite Juvenil/diagnóstico , Exame Físico/métodos , Articulação Temporomandibular/fisiopatologia , Reumatologia/educação , Transtornos da Articulação Temporomandibular/diagnóstico , Feminino , Masculino , Competência Clínica , Avaliação Educacional/métodos , Instrução por Computador/métodos , CriançaRESUMO
BACKGROUND: Nursing tests are considered a high-stakes method of evaluating students' mastery of learned concepts and a means to prepare students for the licensing examination. After observing a downward trend in nursing program graduates' NCLEX-RN pass rates, an increase in test question issues, and anticipating the NCLEX-RN changes to the NGN format in 2023, the department sought to increase test quality. METHOD: Using shared governance, faculty mentorship, and evidence-based test practices, a testing methodology and process was created by implementing a test review committee. RESULTS: Implementation of the test review committee demonstrated a significant improvement in the quality of tests administered, facilitated improved faculty test development, and increased student preparedness for the 2023 NGN examination by demonstrating a 100% pass rate. CONCLUSION: Although implementing the test review committee and changing test practice required continued faculty support and mentorship, the outcome produced a standardized quality testing process that demonstrated student success. [J Nurs Educ. 2024;63(10):679-685.].
Assuntos
Bacharelado em Enfermagem , Avaliação Educacional , Humanos , Avaliação Educacional/métodos , Revisão por Pares , Estudantes de Enfermagem/estatística & dados numéricos , Estudantes de Enfermagem/psicologia , Pesquisa em Educação em Enfermagem , Docentes de Enfermagem , MentoresRESUMO
BACKGROUND: Artificial intelligence (AI) chatbots have demonstrated proficiency in structured knowledge assessments; however, there is limited research on their performance in scenarios involving diagnostic uncertainty, which requires careful interpretation and complex decision-making. This study aims to evaluate the efficacy of AI chatbots, GPT-4o and Claude-3, in addressing medical scenarios characterized by diagnostic uncertainty relative to Family Medicine residents. METHODS: Questions with diagnostic uncertainty were extracted from the Progress Tests administered by the Department of Family and Community Medicine at the University of Toronto between 2022 and 2023. Diagnostic uncertainty questions were defined as those presenting clinical scenarios where symptoms, clinical findings, and patient histories do not converge on a definitive diagnosis, necessitating nuanced diagnostic reasoning and differential diagnosis. These questions were administered to a cohort of 320 Family Medicine residents in their first (PGY-1) and second (PGY-2) postgraduate years and inputted into GPT-4o and Claude-3. Errors were categorized into statistical, information, and logical errors. Statistical analyses were conducted using a binomial generalized estimating equation model, paired t-tests, and chi-squared tests. RESULTS: Compared to the residents, both chatbots scored lower on diagnostic uncertainty questions (p < 0.01). PGY-1 residents achieved a correctness rate of 61.1% (95% CI: 58.4-63.7), and PGY-2 residents achieved 63.3% (95% CI: 60.7-66.1). In contrast, Claude-3 correctly answered 57.7% (n = 52/90) of questions, and GPT-4o correctly answered 53.3% (n = 48/90). Claude-3 had a longer mean response time (24.0 s, 95% CI: 21.0-32.5 vs. 12.4 s, 95% CI: 9.3-15.3; p < 0.01) and produced longer answers (2001 characters, 95% CI: 1845-2212 vs. 1596 characters, 95% CI: 1395-1705; p < 0.01) compared to GPT-4o. Most errors by GPT-4o were logical errors (62.5%). CONCLUSIONS: While AI chatbots like GPT-4o and Claude-3 demonstrate potential in handling structured medical knowledge, their performance in scenarios involving diagnostic uncertainty remains suboptimal compared to human residents.
Assuntos
Inteligência Artificial , Competência Clínica , Internato e Residência , Humanos , Incerteza , Medicina de Família e Comunidade/educação , Avaliação Educacional/métodosRESUMO
Background: The creation of large language models (LLMs) such as ChatGPT is an important step in the development of artificial intelligence, which shows great potential in medical education due to its powerful language understanding and generative capabilities. The purpose of this study was to quantitatively evaluate and comprehensively analyze ChatGPT's performance in handling questions for the National Nursing Licensure Examination (NNLE) in China and the United States, including the National Council Licensure Examination for Registered Nurses (NCLEX-RN) and the NNLE. Objective: This study aims to examine how well LLMs respond to the NCLEX-RN and the NNLE multiple-choice questions (MCQs) in various language inputs. To evaluate whether LLMs can be used as multilingual learning assistance for nursing, and to assess whether they possess a repository of professional knowledge applicable to clinical nursing practice. Methods: First, we compiled 150 NCLEX-RN Practical MCQs, 240 NNLE Theoretical MCQs, and 240 NNLE Practical MCQs. Then, the translation function of ChatGPT 3.5 was used to translate NCLEX-RN questions from English to Chinese and NNLE questions from Chinese to English. Finally, the original version and the translated version of the MCQs were inputted into ChatGPT 4.0, ChatGPT 3.5, and Google Bard. Different LLMs were compared according to the accuracy rate, and the differences between different language inputs were compared. Results: The accuracy rates of ChatGPT 4.0 for NCLEX-RN practical questions and Chinese-translated NCLEX-RN practical questions were 88.7% (133/150) and 79.3% (119/150), respectively. Despite the statistical significance of the difference (P=.03), the correct rate was generally satisfactory. Around 71.9% (169/235) of NNLE Theoretical MCQs and 69.1% (161/233) of NNLE Practical MCQs were correctly answered by ChatGPT 4.0. The accuracy of ChatGPT 4.0 in processing NNLE Theoretical MCQs and NNLE Practical MCQs translated into English was 71.5% (168/235; P=.92) and 67.8% (158/233; P=.77), respectively, and there was no statistically significant difference between the results of text input in different languages. ChatGPT 3.5 (NCLEX-RN P=.003, NNLE Theoretical P<.001, NNLE Practical P=.12) and Google Bard (NCLEX-RN P<.001, NNLE Theoretical P<.001, NNLE Practical P<.001) had lower accuracy rates for nursing-related MCQs than ChatGPT 4.0 in English input. English accuracy was higher when compared with ChatGPT 3.5's Chinese input, and the difference was statistically significant (NCLEX-RN P=.02, NNLE Practical P=.02). Whether submitted in Chinese or English, the MCQs from the NCLEX-RN and NNLE demonstrated that ChatGPT 4.0 had the highest number of unique correct responses and the lowest number of unique incorrect responses among the 3 LLMs. Conclusions: This study, focusing on 618 nursing MCQs including NCLEX-RN and NNLE exams, found that ChatGPT 4.0 outperformed ChatGPT 3.5 and Google Bard in accuracy. It excelled in processing English and Chinese inputs, underscoring its potential as a valuable tool in nursing education and clinical decision-making.
Assuntos
Avaliação Educacional , Licenciamento em Enfermagem , China , Humanos , Licenciamento em Enfermagem/normas , Estudos Transversais , Estados Unidos , Avaliação Educacional/métodos , Avaliação Educacional/normas , Inteligência ArtificialRESUMO
In competency-based medical education (CBME), which is being embraced globally, the patient-learner-educator encounter occurs in a highly complex context which contributes to a wide range of assessment outcomes. Current and historical barriers to considering context in assessment include the existing post-positivist epistemological stance that values objectivity and validity evidence over the variability introduced by context. This is most evident in standardized testing. While always critical to medical education the impact of context on assessment is becoming more pronounced as many aspects of training diversify. This diversity includes an expanding interest beyond individual trainee competence to include the interdependency and collective nature of clinical competence and the growing awareness that medical education needs to be co-produced among a wider group of stakeholders. In this Eye Opener, we wish to consider: 1) How might we best account for the influence of context in the clinical competence assessment of individuals in medical education? and by doing so, 2) How could we usher in the next era of assessment that improves our ability to meet the dynamic needs of society and all its stakeholders? The purpose of this Eye Opener is thus two-fold. First, we conceptualize - from a variety of viewpoints, how we might address context in assessment of competence at the level of the individual learner. Second, we present recommendations that address how to approach implementation of a more contextualized competence assessment.
Assuntos
Competência Clínica , Educação Baseada em Competências , Educação Médica , Humanos , Educação Médica/métodos , Competência Clínica/normas , Educação Baseada em Competências/métodos , Avaliação Educacional/métodosRESUMO
BACKGROUND: Generalizability theory is widely used in psychological and educational measurement. Budget and cost are the problems that cannot be neglected in the measurement. When there is a budget constraint, the generalizability theory needs to consider how to design a measurement program with relatively high reliability and feasibility, which requires the optimal sample size to be estimated by some means. Lagrange multiplier method is a commonly used method for estimating the optimal sample size under budget constraints in generalizability theory. Unfortunately, to date, many formulas of estimating the optimal sample size for some more complex generalizability designs such as those with four facets or more facets have not been derived using the Lagrange multiplier method. PURPOSE: The purpose of this article is to provide a detailed step-by-step derivation of the formula of estimating the optimal sample size for three typical complex generalizability designs using the Lagrange multiplier method under budget constraints in generalizability theory, which can demonstrate the new roles of the Lagrange multiplier method. METHOD: This article derived the optimal sample size for teaching ability evaluation of college teachers with budget constraints in three generalizability designs such as the (s:t)×i, (s:t)× (i:v) and (s:t) × (i:v) ×o and explored their practical applications. By estimating the optimal sample size, the optimal generalizability design, which is more applicable in practice, can be compared to obtain. FINDINGS: (1) Using the Lagrange multiplier method, the optimal sample size for students and items under budget constraints in different generalizability design can be derived. (2) For an example, based on teaching ability evaluation of college teachers in China, these designs of (s:t) ×i, (s:t) × (i:v) and (s:t) × (i:v) ×o were used to obtain the optimal sample size, which indicates the Lagrange multiplier method can been used in practice. (3) Under budget constraints, the (s:t) × (i:v) is the optimal generalizability design. The optimal sample size of students is 17 for each teacher and the optimal sample size of items is 4 for each dimension. CONCLUSION: The optimal sample size can be derived carefully using the Lagrange multiplier method under budget constraints in generalizability theory. The Lagrange multiplier method with new roles is worth recommending.
Assuntos
Ensino , Tamanho da Amostra , Humanos , Universidades , Docentes , Avaliação Educacional/métodosRESUMO
BACKGROUND: Clinical education is a crucial process in which students learn how to use theoretical knowledge in clinical settings. However, there is no standardized assessment tool routinely used to evaluate the clinical performance of physiotherapy students in Turkey. This study aimed to examine the psychometric (validity and reliability) and edumetric (acceptability, feasibility, and educational impact) properties of the Turkish version of the Assessment of Physiotherapy Practices (APP-T). METHODS: This methodological study included 7 clinical educators and 174 4th-grade physiotherapy students at three universities in Turkey. Students were assessed with the APP-T on completion of 4-week clinical placements. The construct validity was examined using the exploratory factor analysis. The internal consistency was determined using Cronbach's α-coefficient. The test-retest and inter-rater reliability were evaluated with the intra-class correlation coefficient (ICC). For the measurement error of the APP-T, standard error of measurement (SEM) and minimum detectable change (MDC) values were calculated. After the administration of the APP-T was completed, clinical educators were requested to provide feedback on the acceptability, applicability, and educational impact of the APP-T. RESULTS: Exploratory factor analysis demonstrated the 20 items of the APP-T represent a single dominant dimension explaining 76.28% of the variance. Excellent test-retest and inter-rater reliability was detected by ICC = 0.94 and ICC = 0.77, respectively, and good internal consistency was detected by Cronbach's α = 0.935. The MDC90 values for test-retest and inter-rater reliability were 3.11 and 6.86, respectively. Clinical educators provided generally positive feedback on the acceptability, feasibility, and educational impact of the APP-T. CONCLUSIONS: The current findings provided evidence for universities and clinical educators that the APP-T has sufficient psychometric and educational properties for evaluating the clinical performance of physiotherapy students in Turkey.
Assuntos
Competência Clínica , Avaliação Educacional , Psicometria , Humanos , Turquia , Reprodutibilidade dos Testes , Feminino , Masculino , Avaliação Educacional/métodos , Competência Clínica/normas , Modalidades de Fisioterapia/educação , Modalidades de Fisioterapia/normas , Adulto , Especialidade de Fisioterapia/educaçãoRESUMO
Background: Self-directed learning in the workplace should lead to the mastery of predefined learning objectives, with subjective competence judgements steering learning and promoting acceptance of feedback. Rotations should support self-directed learning in basic training by allowing junior physicians (JPs) to apply basic clinical competencies in various internal medicine and surgical departments. Aim: The study hypothesises that rotations support self-directed learning, as measured by self-judgements and supervisor judgements. Additionally, it describes JPs' willingness to reflect on their learning needs at the end of their basic training. Methods: This longitudinal study comprises 147 pseudonymised logbooks completed by JPs from three Vienna healthcare group (WIGEV) clinics. The logbook accompanies JPs' training, requiring them and their supervising specialist physicians to rate their training goal completion (10-level % scale) in training months 2, 3, 5, 6, 8 and 9. In addition, in months 3, 6 and 9, the JPs document the level of competence (knowledge, experience and proficiency) they feel they have achieved for each learning objective specified by the Austrian medical association (ÖÄK). Results: The self-judged level of training goal completion demonstrates a multi-peaked distribution with an increasing trend; the supervisors' judgement of JPs' level of training goal completion is almost parallel. The share of learning objectives where the required level of competence is seen as not yet mastered decreases throughout the training. In the 9th month of training, approximately » of the JPs indicated a need to learn in ≥10% of the learning objectives, independent of the training clinic. Conclusions: After switching departments, JPs downgraded their rating of training goal completion. Rotation supports realistic self-judgement, as competencies must be applied and reassessed in a new context. Most JPs consider the required level of competence per learning objective to be mastered at the end of basic training, yet they remain prepared to reflect critically on their learning needs.
Assuntos
Competência Clínica , Educação de Pós-Graduação em Medicina , Humanos , Competência Clínica/normas , Áustria , Educação de Pós-Graduação em Medicina/métodos , Estudos Longitudinais , Autoaprendizagem como Assunto , Julgamento , Feminino , Masculino , Medicina Interna/educação , Avaliação Educacional/métodosRESUMO
The high performance of generative artificial intelligence (AI) and large language models (LLM) in examination contexts has triggered an intense debate about their applications, effects and risks. What legal aspects need to be considered when using LLM in teaching and assessment? What possibilities do language models offer? Statutes and laws are used to assess the use of LLM: - University statutes, state higher education laws, licensing regulations for doctors - Copyright Act (UrhG) - General Data Protection Regulation (DGPR) - AI Regulation (EU AI Act) LLM and AI offer opportunities but require clear university frameworks. These should define legitimate uses and areas where use is prohibited. Cheating and plagiarism violate good scientific practice and copyright laws. Cheating is difficult to detect. Plagiarism by AI is possible. Users of the products are responsible. LLM are effective tools for generating exam questions. Nevertheless, careful review is necessary as even apparently high-quality products may contain errors. However, the risk of copyright infringement with AI-generated exam questions is low, as copyright law allows up to 15% of protected works to be used for teaching and exams. The grading of exam content is subject to higher education laws and regulations and the GDPR. Exclusively computer-based assessment without human review is not permitted. For high-risk applications in education, the EU's AI Regulation will apply in the future. When dealing with LLM in assessments, evaluation criteria for existing assessments can be adapted, as can assessment programmes, e.g. to reduce the motivation to cheat. LLM can also become the subject of the examination themselves. Teachers should undergo further training in AI and consider LLM as an addition.
Assuntos
Inteligência Artificial , Inteligência Artificial/legislação & jurisprudência , Humanos , Direitos Autorais/legislação & jurisprudência , Plágio , Avaliação Educacional/métodos , Educação Médica/legislação & jurisprudência , IdiomaRESUMO
Introduction: Clinical reasoning ability is one of the core competencies of physicians. It should already be trained during undergraduate medical education. At University Medical Center Göttingen (UMG), medical students can participate in formative key feature examinations in which they work on virtual patient cases in order to apply and deepen the procedural knowledge acquired in lectures and seminars. Problem and objective: While this teaching format is already established in the medical curriculum at the UMG, it has not yet been implemented in the dental curriculum at the same institution. Therefore, the current project aimed to evaluate the feasibility of formative key feature examinations in dental education. Methods: In 2022, new key feature cases focusing on dental-surgical teaching content were created. For pilot testing, the new cases were worked on by two cohorts of dental students via an online learning platform in February 2023. The students were also asked to complete an anonymous online questionnaire in order to evaluate the new teaching format. Results: Overall, the formative key feature examinations were evaluated positively by the dental students, and they demanded for further dental key feature cases. However, descriptive analyses of item characteristics as well as students' comments in the questionnaire revealed some potential for improvements, so that a few cases were partly revised afterwards. Conclusion and outlook: This project shows that formative key feature examinations are feasible in dental education and that dental students can benefit from working on virtual case scenarios. Whether dental students' clinical reasoning competence can be improved by completing formative key feature examinations is being investigated in an ongoing study at the UMG.
Assuntos
Competência Clínica , Currículo , Educação em Odontologia , Avaliação Educacional , Humanos , Educação em Odontologia/métodos , Avaliação Educacional/métodos , Inquéritos e Questionários , Estudantes de Odontologia , EnsinoRESUMO
Objectives: Teaching of ECG interpretation frequently relies on visual schemas. However, subsequent student ECG interpretation skills are often poor. Expertise research shows that expert looking patterns frequently deviate from the steps taught in schema learning. The present study made a cardiology expert's gaze interpreting ECGs visible - through eye-tracking videos with cued retrospective reporting (CRR) - and investigated the potential as an additional expert-driven route to improve medical students' ECG interpretation skills. Methods: N=91 Medical students participated in the RCT of an ECG e-learning session aimed at medical students' ECG interpretation skills gain, either receiving the newly developed eye-tracking video with CRR audio commentary materials (n=47) or studying via four clinical cases only (n=44). Three outcome scores relating to different aspects of ECG interpretation skills were derived from pre-post MCQ ECG tests. The effect of the EYE-ECG training and additional characteristics (e.g., prior experience, interest) on student ECG interpretation skills were evaluated using t-tests and multivariate linear regression. Results: A small, non-significant advantage of the EYE-ECG training signifying a tendency for greater knowledge gain was observed, compared to training as usual. In multivariate regression models, the predictive value of clinical case 1 was an unexpected finding warranting further exploration. Conclusion: Additional gains after an only 9-minute intervention using videos of expert's real-time gaze pattern in combination with hearing their thought processes during ECG interpretation is a promising finding. Furthermore, a number of specific performance characteristics enabling students to best benefit from ECG training were identified and possible modifications to the learning intervention suggested.
Assuntos
Competência Clínica , Educação de Graduação em Medicina , Eletrocardiografia , Tecnologia de Rastreamento Ocular , Estudantes de Medicina , Humanos , Eletrocardiografia/métodos , Competência Clínica/normas , Estudantes de Medicina/estatística & dados numéricos , Educação de Graduação em Medicina/métodos , Feminino , Masculino , Sinais (Psicologia) , Avaliação Educacional/métodos , Cardiologia/educação , Gravação em Vídeo , Adulto , Estudos RetrospectivosRESUMO
Background The medical workplace presents challenges for workplace-based learning. Structured debriefing of shared clinical experiences has been proposed as a way to take advantage of workplace-based learning in a setting that facilitates deep learning conversations. Objective To investigate faculty and learner acceptance of private, face-to-face, structured debriefing of performance of entrustable professional activities (EPAs). Methods During the 2020-2021 academic year, faculty at the University of Colorado (CU) and the University of Utah (UU) debriefed fellow performance of jointly selected EPAs in neonatal-perinatal medicine pertinent to shared 1- to 3-week clinical rotations. Private face-to-face debriefing was structured by a comprehensive EPA-specific list of behavioral anchors describing 3 levels of entrustment/accomplishment. Sessions ended with joint decisions as to level of entrustment/accomplishment and goals for improvement. We used thematic analysis of semistructured fellow interviews and faculty focus groups to identify themes illustrated with representative quotations. Results We interviewed 17 fellows and 18 faculty. CU participants debriefed after clinical rotations; UU usually debriefed during rotations. Debriefing sessions for 1 to 2 EPAs lasted 20 to 40 minutes. Themes represented in fellow interviews and faculty focus groups suggested that debriefing facilitated formative feedback along with shared understanding of clinical performance and assessment criteria. The standardized format and private conversations supported assessment of aspects of performance for which review might otherwise have been overlooked or avoided. The conversations also provided valuable opportunities for formative discussion of other matters of importance to fellows. Conclusions Structured debriefing of recently shared clinical experiences fostered formative assessment viewed positively by teachers and learners.
Assuntos
Competência Clínica , Educação Baseada em Competências , Docentes de Medicina , Internato e Residência , Humanos , Colorado , Educação Baseada em Competências/métodos , Utah , Educação de Pós-Graduação em Medicina , Grupos Focais , Feedback Formativo , Avaliação Educacional/métodosRESUMO
BACKGROUND: Although digital health is essential for improving health care, its adoption remains slow due to the lack of literacy in this area. Therefore, it is crucial for health professionals to acquire digital skills and for a digital competence assessment and accreditation model to be implemented to make advances in this field. OBJECTIVE: This study had two objectives: (1) to create a specific map of digital competences for health professionals and (2) to define and test a digital competence assessment and accreditation model for health professionals. METHODS: We took an iterative mixed methods approach, which included a review of the gray literature and consultation with local experts. We used the arithmetic mean and SD in descriptive statistics, P values in hypothesis testing and subgroup comparisons, the greatest lower bound in test diagnosis, and the discrimination index in study instrument analysis. RESULTS: The assessment model designed in accordance with the competence content defined in the map of digital competences and based on scenarios had excellent internal consistency overall (greatest lower bound=0.91). Although most study participants (110/122, 90.2%) reported an intermediate self-perceived digital competence level, we found that the vast majority would not attain a level-2 Accreditation of Competence in Information and Communication Technologies. CONCLUSIONS: Knowing the digital competence level of health professionals based on a defined competence framework should enable such professionals to be trained and updated to meet real needs in their specific professional contexts and, consequently, take full advantage of the potential of digital technologies. These results have informed the Health Plan for Catalonia 2021-2025, thus laying the foundations for creating and offering specific training to assess and certify the digital competence of such professionals.
Assuntos
Acreditação , Pessoal de Saúde , Humanos , Acreditação/normas , Pessoal de Saúde/educação , Masculino , Feminino , Adulto , Pessoa de Meia-Idade , Inquéritos e Questionários , Avaliação Educacional/métodos , Avaliação Educacional/normasRESUMO
Recent national calls to transform undergraduate science education have centered on engaging students in scientific practices as a means to help them develop deeper insights into science. The three-dimensional framework for science education encapsulates the goals of these national calls by recommending that instructors integrate scientific practices, crosscutting concepts, and disciplinary core ideas throughout their courses. Prior research has found that introductory undergraduate biology exams contain few three-dimensional items suggesting that instructors likely face barriers in meeting the goals of national calls. To better understand these putative challenges, we explored factors potentially associated with three-dimensional alignment. Our generalized linear mixed model indicated that instructors who used three-dimensional items on their exams were more likely to use Bloom's Taxonomy when designing their exams and were more likely to write these items using a constructed-response format. We also found that professional development opportunities did not necessarily change the likelihood an instructor would employ three-dimensional items. We previously identified that few items in our sample fully aligned to scientific practices, making scientific practices the limiting dimension for three-dimensional alignment. Our subsequent analysis here revealed that most biology instructors had exam items that were at least partially aligned to scientific practices. Based on our results, we highlight the significant time and resources that instructors likely need to write and grade constructed-response assessments, suggest that instructors build on items that are mostly aligned to scientific practices to increase their three-dimensional alignment, and propose ways that professional development programs and communities might further support instructors in meeting national calls.
Assuntos
Biologia , Avaliação Educacional , Estudantes , Humanos , Biologia/educação , Avaliação Educacional/métodos , Currículo , Docentes , UniversidadesRESUMO
This article reports the results of an experiment conducted with ChatGPT to see how its performance compares to human performance on tests that require specific knowledge and skills, such as university admission tests. We chose a general undergraduate admission test and two tests for admission to biomedical programs: the Scholastic Assessment Test (SAT), the Cambridge BioMedical Admission Test (BMAT), and the Italian Medical School Admission Test (IMSAT). In particular, we looked closely at the difference in performance between ChatGPT-4 and its predecessor, ChatGPT-3.5, to assess its evolution. The performance of ChatGPT-4 showed a significant improvement over ChatGPT-3.5 and, compared to real students, was on average within the top 10% in the SAT test, while the score in the IMSAT test granted admission to the two highest ranked Italian medical schools. In addition to the performance analysis, we provide a qualitative analysis of incorrect answers and a classification of three different types of logical and computational errors made by ChatGPT-4, which reveal important weaknesses of the model. This provides insight into the skills needed to use these models effectively despite their weaknesses, and also suggests possible applications of our analysis in the field of education.
Assuntos
Faculdades de Medicina , Estudantes de Medicina , Humanos , Avaliação Educacional/métodos , Critérios de Admissão Escolar , Educação de Graduação em Medicina , Itália , Masculino , Feminino , Estudos ProspectivosRESUMO
BACKGROUND: Objective Structured Clinical Examinations (OSCEs) aim to assess medical students' clinical skills rather than just their theoretical knowledge. We propose a study between the learning style of second-cycle medical students and their performance in OSCEs. METHODS: During their internship at our hospital, students were invited to complete the LSQ-Fa, a questionnaire designed to identify their preferences among 4 learning styles: active, reflective, theoretical, and pragmatic. In parallel, an evaluation of their clinical skills was conducted through OSCEs. We then performed an analysis to establish a correlation between learning styles and OSCE performance, to better understand how learning preferences influence academic outcomes. RESULTS: Between April 2021 and January 2023, 55 students were evaluated by OSCEs and had completed the LSQ-Fa. The average OSCE score was 14.8/20. The theoretical learning style was predominant in 47.3% of students, followed by active (27.3%), reflective (14.6%), and pragmatic (5.4%), with 3 cases of mixed active and theoretical styles (5.4%). No significant correlation was observed between learning style and OSCE performance (p = 0.28), although students with a pragmatic style recorded a slightly higher average. CONCLUSION: This study highlights the importance of considering diverse learning styles in the design of medical education programs.
Assuntos
Competência Clínica , Avaliação Educacional , Aprendizagem , Estudantes de Medicina , Humanos , Competência Clínica/normas , Avaliação Educacional/métodos , Estudantes de Medicina/psicologia , Feminino , Masculino , Educação de Graduação em Medicina , Inquéritos e QuestionáriosRESUMO
Background: This research explores the capabilities of ChatGPT-4 in passing the American Board of Family Medicine (ABFM) Certification Examination. Addressing a gap in existing literature, where earlier artificial intelligence (AI) models showed limitations in medical board examinations, this study evaluates the enhanced features and potential of ChatGPT-4, especially in document analysis and information synthesis. Objective: The primary goal is to assess whether ChatGPT-4, when provided with extensive preparation resources and when using sophisticated data analysis, can achieve a score equal to or above the passing threshold for the Family Medicine Board Examinations. Methods: In this study, ChatGPT-4 was embedded in a specialized subenvironment, "AI Family Medicine Board Exam Taker," designed to closely mimic the conditions of the ABFM Certification Examination. This subenvironment enabled the AI to access and analyze a range of relevant study materials, including a primary medical textbook and supplementary web-based resources. The AI was presented with a series of ABFM-type examination questions, reflecting the breadth and complexity typical of the examination. Emphasis was placed on assessing the AI's ability to interpret and respond to these questions accurately, leveraging its advanced data processing and analysis capabilities within this controlled subenvironment. Results: In our study, ChatGPT-4's performance was quantitatively assessed on 300 practice ABFM examination questions. The AI achieved a correct response rate of 88.67% (95% CI 85.08%-92.25%) for the Custom Robot version and 87.33% (95% CI 83.57%-91.10%) for the Regular version. Statistical analysis, including the McNemar test (P=.45), indicated no significant difference in accuracy between the 2 versions. In addition, the chi-square test for error-type distribution (P=.32) revealed no significant variation in the pattern of errors across versions. These results highlight ChatGPT-4's capacity for high-level performance and consistency in responding to complex medical examination questions under controlled conditions. Conclusions: The study demonstrates that ChatGPT-4, particularly when equipped with specialized preparation and when operating in a tailored subenvironment, shows promising potential in handling the intricacies of medical board examinations. While its performance is comparable with the expected standards for passing the ABFM Certification Examination, further enhancements in AI technology and tailored training methods could push these capabilities to new heights. This exploration opens avenues for integrating AI tools such as ChatGPT-4 in medical education and assessment, emphasizing the importance of continuous advancement and specialized training in medical applications of AI.
Assuntos
Inteligência Artificial , Certificação , Avaliação Educacional , Medicina de Família e Comunidade , Conselhos de Especialidade Profissional , Medicina de Família e Comunidade/educação , Humanos , Avaliação Educacional/métodos , Estados Unidos , Competência Clínica/normasRESUMO
OBJECTIVES: The advent of artificial intelligence (AI) and large language model (LLM)-based AI applications (LLMAs) has tremendous implications for our society. This study analyzed the performance of LLMAs on solving restorative dentistry and endodontics (RDE) student assessment questions. MATERIALS AND METHODS: 151 questions from a RDE question pool were prepared for prompting using LLMAs from OpenAI (ChatGPT-3.5,-4.0 and -4.0o) and Google (Gemini 1.0). Multiple-choice questions were sorted into four question subcategories, entered into LLMAs and answers recorded for analysis. P-value and chi-square statistical analyses were performed using Python 3.9.16. RESULTS: The total answer accuracy of ChatGPT-4.0o was the highest, followed by ChatGPT-4.0, Gemini 1.0 and ChatGPT-3.5 (72%, 62%, 44% and 25%, respectively) with significant differences between all LLMAs except GPT-4.0 models. The performance on subcategories direct restorations and caries was the highest, followed by indirect restorations and endodontics. CONCLUSIONS: Overall, there are large performance differences among LLMAs. Only the ChatGPT-4 models achieved a success ratio that could be used with caution to support the dental academic curriculum. CLINICAL RELEVANCE: While LLMAs could support clinicians to answer dental field-related questions, this capacity depends strongly on the employed model. The most performant model ChatGPT-4.0o achieved acceptable accuracy rates in some subject sub-categories analyzed.
Assuntos
Inteligência Artificial , Endodontia , Humanos , Endodontia/educação , Educação em Odontologia/métodos , Avaliação Educacional/métodos , Estudantes de Odontologia , Dentística Operatória/educação , Competência Clínica , Inquéritos e QuestionáriosRESUMO
Undergraduates are trained in the basic neonatal resuscitation programme (NRP) and evaluated for skill acquisition by NRP performance evaluation test (PET). Video use improves the validity of assessment when video-information adequacy, intrusiveness to students, and educational purpose are adequately balanced. We evaluated whether there was a difference between instructor's real-time assessment and video-based assessment done by another independent assessor using videos recorded via the SimCapture application. Undergraduates were trained in basic neonatal resuscitation and were evaluated at the end for skill acquisition by PET as a part of a randomized control trial (RCT). Video recording of the PET assessment was done with the SimCapture app. Independent assessors evaluated recorded videos to evaluate and validate real-time instructor performance evaluation time assessments. We conducted an agreement study to evaluate the difference between a real-time instructor and video assessments. We trained 143 students; 139 videos were evaluated. The Bland-Altman plot showed good agreement. For the post-test result, 66 (47.5%) passed, and 73 (52.5%) failed according to the real-time assessment, while 61 (43.8%) passed and 78 (56.1%) failed according to the video assessment. The agreement between the two assessments was 94.9%, with a kappa value of 0.898. Indicating the need for positive pressure ventilation (PPV), 138 (99.3%) did correctly in real-time assessment, while 133 (95.6%) did correctly in video assessment with agreement of 96.4% but low kappa of 0.27. We conclude that the instructors' assessment was excellent and didn't differ from assessments obtained from recorded videos and vice-versa.