Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 19 de 19
Filtrar
1.
Artigo em Inglês | MEDLINE | ID: mdl-37665413

RESUMO

Recent advances in automated scoring technology have made it practical to replace multiple-choice questions (MCQs) with short-answer questions (SAQs) in large-scale, high-stakes assessments. However, most previous research comparing these formats has used small examinee samples testing under low-stakes conditions. Additionally, previous studies have not reported on the time required to respond to the two item types. This study compares the difficulty, discrimination, and time requirements for the two formats when examinees responded as part of a large-scale, high-stakes assessment. Seventy-one MCQs were converted to SAQs. These matched items were randomly assigned to examinees completing a high-stakes assessment of internal medicine. No examinee saw the same item in both formats. Items administered in the SAQ format were generally more difficult than items in the MCQ format. The discrimination index for SAQs was modestly higher than that for MCQs and response times were substantially higher for SAQs. These results support the interchangeability of MCQs and SAQs. When it is important that the examinee generate the response rather than selecting it, SAQs may be preferred. The results relating to difficulty and discrimination reported in this paper are consistent with those of previous studies. The results on the relative time requirements for the two formats suggest that with a fixed testing time fewer SAQs can be administered, this limitation more than makes up for the higher discrimination that has been reported for SAQs. We additionally examine the extent to which increased difficulty may directly impact the discrimination of SAQs.

2.
J Biomed Inform ; 98: 103268, 2019 10.
Artigo em Inglês | MEDLINE | ID: mdl-31421211

RESUMO

OBJECTIVE: The assessment of written medical examinations is a tedious and expensive process, requiring significant amounts of time from medical experts. Our objective was to develop a natural language processing (NLP) system that can expedite the assessment of unstructured answers in medical examinations by automatically identifying relevant concepts in the examinee responses. MATERIALS AND METHODS: Our NLP system, Intelligent Clinical Text Evaluator (INCITE), is semi-supervised in nature. Learning from a limited set of fully annotated examples, it sequentially applies a series of customized text comparison and similarity functions to determine if a text span represents an entry in a given reference standard. Combinations of fuzzy matching and set intersection-based methods capture inexact matches and also fragmented concepts. Customizable, dynamic similarity-based matching thresholds allow the system to be tailored for examinee responses of different lengths. RESULTS: INCITE achieved an average F1-score of 0.89 (precision = 0.87, recall = 0.91) against human annotations over held-out evaluation data. Fuzzy text matching, dynamic thresholding and the incorporation of supervision using annotated data resulted in the biggest jumps in performances. DISCUSSION: Long and non-standard expressions are difficult for INCITE to detect, but the problem is mitigated by the use of dynamic thresholding (i.e., varying the similarity threshold for a text span to be considered a match). Annotation variations within exams and disagreements between annotators were the primary causes for false positives. Small amounts of annotated data can significantly improve system performance. CONCLUSIONS: The high performance and interpretability of INCITE will likely significantly aid the assessment process and also help mitigate the impact of manual assessment inconsistencies.


Assuntos
Educação Médica/métodos , Educação Médica/normas , Avaliação Educacional/métodos , Licenciamento em Medicina/normas , Processamento de Linguagem Natural , Faculdades de Medicina , Algoritmos , Competência Clínica/normas , Coleta de Dados , Curadoria de Dados/métodos , Lógica Fuzzy , Humanos , Prontuários Médicos , Reconhecimento Automatizado de Padrão , Reprodutibilidade dos Testes , Software , Unified Medical Language System
3.
BMC Med Educ ; 19(1): 389, 2019 Oct 23.
Artigo em Inglês | MEDLINE | ID: mdl-31647012

RESUMO

BACKGROUND: Examinees often believe that changing answers will lower their scores; however, empirical studies suggest that allowing examinees to change responses may improve their performance in classroom assessments. To date, no studies have been able to examine answer changes during large scale professional credentialing or licensing examinations. METHODS: In this study, we expand the research on answer changes by analyzing responses from 27,830 examinees who completed the Step 2 Clinical Knowledge (CK) examination between August of 2015 and August of 2016. RESULTS: The results showed that although 68% of examinees changed at least one item, the overall average number of changes was small. Among the examinees who changed answers, approximately 45% increased their scores and approximately 28% decreased their scores. On average, examinees spent shortest time on the item changes from wrong to right and they were more likely to change their scores from wrong to right than right to wrong. CONCLUSIONS: Consistent with previous studies, these findings support the beneficial effects of answer changes in high-stakes medical examinations and suggest that examinees who are overly cautious about changing answers may put themselves at a disadvantage.


Assuntos
Competência Clínica/normas , Avaliação Educacional/estatística & dados numéricos , Licenciamento em Medicina/normas , Estudantes de Medicina/estatística & dados numéricos , Conhecimentos, Atitudes e Prática em Saúde , Humanos , Licenciamento em Medicina/tendências , Análise e Desempenho de Tarefas
4.
Med Care ; 55(4): 436-441, 2017 04.
Artigo em Inglês | MEDLINE | ID: mdl-27906769

RESUMO

OBJECTIVE: The objective of this study was to identify modifiable factors that improve the reliability of ratings of severity of health care-associated harm in clinical practice improvement and research. METHODS: A diverse group of clinicians rated 8 types of adverse events: blood product, device or medical/surgical supply, fall, health care-associated infection, medication, perinatal, pressure ulcer, surgery. We used a generalizability theory framework to estimate the impact of number of raters, rater experience, and rater provider type on reliability. RESULTS: Pharmacists were slightly more precise and consistent in their ratings than either physicians or nurses. For example, to achieve high reliability of 0.83, 3 physicians could be replaced by 2 pharmacists without loss in precision of measurement. If only 1 rater was available for rating, ∼5% of the reviews for severe harm would have been incorrectly categorized. Reliability was greatly improved with 2 reviewers. CONCLUSIONS: We identified factors that influence the reliability of clinician reviews of health care-associated harm. Our novel use of generalizability analyses improved our understanding of how differences affect reliability. This approach was useful in optimizing resource utilization when selecting raters to assess harm and may have similar applications in other settings in health care.


Assuntos
Atitude do Pessoal de Saúde , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Redução do Dano , Erros Médicos/estatística & dados numéricos , Revisão por Pares , Humanos , Doença Iatrogênica , Estudos Prospectivos , Reprodutibilidade dos Testes , Estados Unidos
5.
Acad Med ; 2024 May 15.
Artigo em Inglês | MEDLINE | ID: mdl-38753971

RESUMO

PROBLEM: Many non-workplace-based assessments do not provide good evidence of a learner's problem representation or ability to provide a rationale for a clinical decision they have made. Exceptions include assessment formats that require resource-intensive administration and scoring. This article reports on research efforts toward building a scalable non-workplace-based assessment format that was specifically developed to capture evidence of a learner's ability to provide a justification for a clinical decision that they had made. APPROACH: The authors developed a 2-step item format called SHARP (SHort Answer, Rationale Provision), referring to the 2 tasks that comprise the item. In collaboration with physician-educators, the authors integrated short-answer questions into a patient medical record-based item starting in October 2021 and arrived at an innovative item format in December 2021. In this format, a test-taker interprets patient medical record data to make a clinical decision, types in their response, and pinpoints medical record details that justify their answers. In January 2022, a total of 177 fourth-year medical students, representing 20 U.S. medical schools, completed 35 SHARP items in a proof-of-concept study. OUTCOMES: Primary outcomes were item timing, difficulty, reliability, and scoring ease. There was substantial variability in item difficulty, with the average item answered correctly by 44% of students (range, 4%-76%). The estimated reliability (Cronbach α) of the set of SHARP items was 0.76 (95% CI, 0.70-0.80). Item scoring is fully automated, minimizing resource requirements. NEXT STEPS: A larger study is planned to gather additional validity evidence about the item format. This study will allow comparisons between performance on SHARP items and other examinations, the examination of group differences in performance, and possible use cases for formative assessment purposes. Cognitive interviews are also planned to better understand the thought processes of medical students as they work through the SHARP items.

6.
Diagnosis (Berl) ; 10(1): 54-60, 2023 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-36409593

RESUMO

In this op-ed, we discuss the advantages of leveraging natural language processing (NLP) in the assessment of clinical reasoning. Clinical reasoning is a complex competency that cannot be easily assessed using multiple-choice questions. Constructed-response assessments can more directly measure important aspects of a learner's clinical reasoning ability, but substantial resources are necessary for their use. We provide an overview of INCITE, the Intelligent Clinical Text Evaluator, a scalable NLP-based computer-assisted scoring system that was developed to measure clinical reasoning ability as assessed in the written documentation portion of the now-discontinued USMLE Step 2 Clinical Skills examination. We provide the rationale for building a computer-assisted scoring system that is aligned with the intended use of an assessment. We show how INCITE's NLP pipeline was designed with transparency and interpretability in mind, so that every score produced by the computer-assisted system could be traced back to the text segment it evaluated. We next suggest that, as a consequence of INCITE's transparency and interpretability features, the system may easily be repurposed for formative assessment of clinical reasoning. Finally, we provide the reader with the resources to consider in building their own NLP-based assessment tools.


Assuntos
Competência Clínica , Processamento de Linguagem Natural , Humanos , Cefaleia , Raciocínio Clínico
7.
Acad Med ; 94(3): 314-316, 2019 03.
Artigo em Inglês | MEDLINE | ID: mdl-30540567

RESUMO

The United States Medical Licensing Examination Step 2 Clinical Skills (CS) exam uses physician raters to evaluate patient notes written by examinees. In this Invited Commentary, the authors describe the ways in which the Step 2 CS exam could benefit from adopting a computer-assisted scoring approach that combines physician raters' judgments with computer-generated scores based on natural language processing (NLP). Since 2003, the National Board of Medical Examiners has researched NLP technology to determine whether it offers the opportunity to mitigate challenges associated with human raters while continuing to capitalize on the judgment of physician experts. The authors discuss factors to consider before computer-assisted scoring is introduced into a high-stakes licensure exam context. They suggest that combining physician judgments and computer-assisted scoring can enhance and improve performance-based assessments in medical education and medical regulation.


Assuntos
Competência Clínica/normas , Educação de Graduação em Medicina/métodos , Avaliação Educacional/métodos , Humanos , Licenciamento em Medicina , Processamento de Linguagem Natural , Estados Unidos
8.
Acad Med ; 82(10 Suppl): S101-4, 2007 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-17895671

RESUMO

BACKGROUND: Systematic trends in examinee performance across the testing day (sequence effects) could indicate that artifacts of the testing situation have an impact on scores. This research investigated the presence of sequence effects for United States Medical Licensing Exam (USMLE) Step 2 clinical skills (CS) examination components. METHOD: Data from Step 2 CS examinees were analyzed using analysis of covariance and hierarchical linear modeling procedures. RESULTS: Sequence was significant for three of the components; communication and interpersonal skills, data gathering, and documentation. A significant gender x sequence interaction was found for two components. CONCLUSIONS: The presence of sequence effects suggests that scores on early cases are influenced by factors that are unrelated to the proficiencies of interest. More research is needed to fully understand these effects.


Assuntos
Competência Clínica/normas , Avaliação Educacional/métodos , Docentes de Medicina , Licenciamento em Medicina , Estudantes de Medicina , Comunicação , Feminino , Humanos , Relações Interpessoais , Modelos Lineares , Masculino , Fatores Sexuais , Estados Unidos
9.
Acad Med ; 81(10 Suppl): S21-4, 2006 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-17001128

RESUMO

BACKGROUND: This research examined relationships between and among scores from the United States Medical Licensing Examination (USMLE) Step 1, Step 2 Clinical Knowledge (CK), and subcomponents of the Step 2 Clinical Skills (CS) examination. METHOD: Correlations and failure rates were produced for first-time takers who tested during the first year of Step 2 CS Examination administration (June 2004 to July 2005). RESULTS: True-score correlations were high between patient note (PN) and data gathering (DG), moderate between communication and interpersonal skills and DG, and low between the remaining score pairs. There was little overlap between examinees failing Step 2 CK and the different components of Step 2 CS. CONCLUSION: Results suggest that combining DG and PN scores into a single composite score is reasonable and that relatively little redundancy exists between Step 2 CK and CS scores.


Assuntos
Competência Clínica , Relações Interpessoais , Idioma , Licenciamento em Medicina , Comunicação , Médicos Graduados Estrangeiros , Humanos , Estados Unidos
10.
Acad Med ; 81(10 Suppl): S56-60, 2006 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-17001137

RESUMO

BACKGROUND: Multivariate generalizability analysis was used to investigate the performance of a commonly used clinical evaluation tool. METHOD: Practicing physicians were trained to use the mini-Clinical Skills Examination (CEX) rating form to rate performances from the United States Medical Licensing Examination Step 2 Clinical Skills examination. RESULTS: Differences in rater stringency made the greatest contribution to measurement error; more raters rating each examinee, even on fewer occasions, could enhance score stability. Substantial correlated error across the competencies suggests that decisions about one scale unduly influence those on others. CONCLUSIONS: Given the appearance of a halo effect across competencies, score interpretations that assume assessment of distinct dimensions of clinical performance should be made with caution. If the intention is to produce a single composite score by combining results across competencies, the presence of these effects may be less critical.


Assuntos
Competência Clínica/normas , Avaliação Educacional/métodos , Exame Físico/métodos , Software , Análise de Variância , Humanos , Entrevistas como Assunto
11.
Acad Med ; 79(10 Suppl): S62-4, 2004 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-15383392

RESUMO

PURPOSE: Operational USMLE(TM) computer-based case simulation results were examined to determine the extent to which rater reliability and regression model performance met expectations based on preoperational data. METHOD: Operational data resulted from Step 3 examinations given between 1999 and 2004. Plots were produced using reliability and multiple correlation coefficients. RESULTS: Operational testing reliabilities increased over the four years but were lower than the preoperational reliability. Multiple correlation coefficient results are somewhat superior to the results reported during the preoperational period and suggest that the operational scoring algorithms have been relatively consistent. CONCLUSIONS: Changes in the rater population, changes in the rating task, and enhancements to the training procedures are several factors that can explain the identified differences between preoperational and operational results. The present findings have important implications for test development and test validity.


Assuntos
Competência Clínica , Simulação por Computador , Educação Médica , Avaliação Educacional/métodos , Licenciamento em Medicina , Algoritmos , Avaliação Educacional/estatística & dados numéricos , Humanos , Variações Dependentes do Observador , Análise de Regressão , Reprodutibilidade dos Testes
12.
Acad Med ; 78(10 Suppl): S27-9, 2003 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-14557087

RESUMO

PURPOSE: To examine the relationship between performance on a large-scale clinical skills examination (CSE) and a high-stakes multiple-choice examination. METHOD: Two samples were used: (1) 6,372 first-taker international medical graduates (IMGs); and (2) 858 fourth-year U.S. medical students. Ninety-seven percent of IMGs and 70% of U.S. students had completed Step 2. Correlations were calculated, scatter plots produced, and regression lines estimated. RESULTS: Correlations between CSE and Step 2 ranged from .16 to .38. The observed relationship between scores confirms that CSE score information is not redundant with MCQ score information. This result was consistent across samples. CONCLUSIONS: Results suggest that the CSE assesses proficiencies distinct from those assessed by current USMLE components and therefore provides evidence justifying its inclusion in the medical licensure process.


Assuntos
Competência Clínica/estatística & dados numéricos , Avaliação Educacional , Licenciamento em Medicina/estatística & dados numéricos , Médicos Graduados Estrangeiros/estatística & dados numéricos , Humanos , Análise de Regressão , Estudantes de Medicina/estatística & dados numéricos , Estados Unidos
15.
Acad Med ; 84(10 Suppl): S86-9, 2009 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-19907395

RESUMO

BACKGROUND: In clinical skills, closely related skills are often combined to form a composite score. For example, history-taking and physical examination scores are typically combined. Interestingly, there is relatively little research to support this practice. METHOD: Multivariate generalizability theory was employed to examine the relationship between history-taking and physical examination scores from the United States Medical Licensing Examination (USMLE) Step 2 Clinical Skills examination. These two proficiencies are currently combined into a data-gathering score. RESULTS: The physical examination score is less generalizable than the score for history taking, and there is only a modest to moderate relationship between these two proficiencies. CONCLUSIONS: A decision about combining physical examination and history-taking proficiencies into one composite score, as well as the weighting of these components, should be driven by the intended use of the score. The choice of weights in combining physical examination and history taking makes a substantial difference in the precision of the resulting score.


Assuntos
Competência Clínica , Avaliação Educacional , Licenciamento em Medicina , Anamnese , Exame Físico , Análise Multivariada , Estados Unidos
16.
Acad Med ; 84(10 Suppl): S79-82, 2009 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-19907393

RESUMO

BACKGROUND: The 2000 Institute of Medicine report on patient safety brought renewed attention to the issue of preventable medical errors, and subsequently specialty boards and the National Board of Medical Examiners were encouraged to play a role in setting expectations around safety education. This paper examines potentially dangerous actions taken by examinees during the portion of the United States Medical Licensing Examination Step 3 that is particularly well suited to evaluating lapses in physician decision making, the Computer-based Case Simulation (CCS). METHOD: Descriptive statistics and a general linear modeling approach were used to analyze dangerous actions ordered by 25,283 examinees that completed CCS for the first time between November 2006 and January 2008. RESULTS: More than 20% of examinees ordered at least one dangerous action with the potential to cause significant patient harm. The propensity to order dangerous actions may vary across clinical cases. CONCLUSIONS: The CCS format may provide a means of collecting important information about patient-care situations in which examinees may be more likely to commit dangerous actions and the propensity of examinees to order dangerous tests and treatments.


Assuntos
Competência Clínica , Instrução por Computador , Avaliação Educacional , Licenciamento em Medicina , Erros Médicos , Medição de Risco , Estados Unidos
17.
Acad Med ; 84(10 Suppl): S97-100, 2009 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-19907399

RESUMO

BACKGROUND: Documentation is a subcomponent of the Step 2 Clinical Skills Examination Integrated Clinical Encounter (ICE) component wherein licensed physicians rate examinees on their ability to communicate the findings of the patient encounter, diagnostic impression, and initial patient work-up. The main purpose of this research was to examine the impact of modifications to the scoring rubric and rater training protocol on the psychometric characteristics of the documentation scores. METHOD: Following the modifications, the variance structure of the ICE components was modeled using multivariate generalizability theory. RESULTS: The results confirmed the expectations that true score variance for the documentation subcomponent would increase after adopting a modified training protocol and increased rubric specificity. CONCLUSIONS: In general, results support the commonsense assumption that providing raters with detailed rubrics and comprehensive training will indeed improve measurement outcomes. Although the steps taken here were in the right direction, there remains room for improvement. Efforts are currently under way to further improve both the scoring rubrics and rater training.


Assuntos
Competência Clínica , Avaliação Educacional/métodos , Avaliação Educacional/normas , Licenciamento em Medicina , Estados Unidos
18.
Acad Med ; 83(10 Suppl): S45-8, 2008 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-18820499

RESUMO

BACKGROUND: As with any examination using human raters, it is possible that human subjectivity may introduce measurement error. An examinee's performance might be scored differently on the basis of the quality of the preceding performance(s) (contrast effects). This research investigated the presence of contrast effects, within and across test sessions, for the communication and interpersonal skills component of the United States Medical Licensing Examination Step 2 Clinical Skills (CS) examination. METHOD: Data from Step 2 CS examinees were analyzed using hierarchical and general linear modeling procedures. RESULTS: Contrast effect was significant for the communication and interpersonal skills score, both within and across test sessions. The effect was found to have a nontrivial impact on the overall score. CONCLUSIONS: The presence of contrast effects suggests that scores for an examinee are influenced by the performance of other examinees. More research is needed to fully understand these effects.


Assuntos
Competência Clínica , Licenciamento em Medicina , Estudos de Coortes , Comunicação , Feminino , Humanos , Modelos Lineares , Masculino , Variações Dependentes do Observador , Exame Físico , Relações Médico-Paciente , Estudos Retrospectivos , Estados Unidos
19.
Acad Med ; 83(10 Suppl): S41-4, 2008 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-18820498

RESUMO

BACKGROUND: This research examined various sources of measurement error in the documentation score component of the United States Medical Licensing Examination (USMLE) Step 2 Clinical Skills examination. METHOD: A generalizability theory framework was employed to examine the documentation ratings for 847 examinees who completed the USMLE Step 2 Clinical Skills examination during an eight-day period in 2006. Each patient note was scored by two different raters allowing for a persons-crossed-with-raters-nested-in-cases design. RESULTS: The results suggest that inconsistent performance on the part of raters makes a substantially greater contribution to measurement error than case specificity. Double scoring the notes significantly increases precision. CONCLUSIONS: The results provide guidance for improving operational scoring of the patient notes. Double scoring of the notes may produce an increase in the precision of measurement equivalent to that achieved by lengthening the test by more than 50%. The study also cautions researchers that when examining sources of measurement error, inappropriate data-collection designs may result in inaccurate inferences.


Assuntos
Competência Clínica , Licenciamento em Medicina , Estudos de Coortes , Comunicação , Generalização Psicológica , Humanos , Variações Dependentes do Observador , Simulação de Paciente , Exame Físico , Relações Médico-Paciente , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Estados Unidos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA