Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
1.
Artículo en Inglés | MEDLINE | ID: mdl-37843678

RESUMEN

Quantitative measures of systematic differences in OSCE scoring across examiners (often termed examiner stringency) can threaten the validity of examination outcomes. Such effects are usually conceptualised and operationalised based solely on checklist/domain scores in a station, and global grades are not often used in this type of analysis. In this work, a large candidate-level exam dataset is analysed to develop a more sophisticated understanding of examiner stringency. Station scores are modelled based on global grades-with each candidate, station and examiner allowed to vary in their ability/stringency/difficulty in the modelling. In addition, examiners are also allowed to vary in how they discriminate across grades-to our knowledge, this is the first time this has been investigated. Results show that examiners contribute strongly to variance in scoring in two distinct ways-via the traditional conception of score stringency (34% of score variance), but also in how they discriminate in scoring across grades (7%). As one might expect, candidate and station account only for a small amount of score variance at the station-level once candidate grades are accounted for (3% and 2% respectively) with the remainder being residual (54%). Investigation of impacts on station-level candidate pass/fail decisions suggest that examiner differential stringency effects combine to give false positive (candidates passing in error) and false negative (failing in error) rates in stations of around 5% each but at the exam-level this reduces to 0.4% and 3.3% respectively. This work adds to our understanding of examiner behaviour by demonstrating that examiners can vary in qualitatively different ways in their judgments. For institutions, it emphasises the key message that it is important to sample widely from the examiner pool via sufficient stations to ensure OSCE-level decisions are sufficiently defensible. It also suggests that examiner training should include discussion of global grading, and the combined effect of scoring and grading on candidate outcomes.

2.
Med Teach ; 45(10): 1163-1169, 2023 10.
Artículo en Inglés | MEDLINE | ID: mdl-37029957

RESUMEN

INTRODUCTION: Alongside the usual exam-level cut-score requirement, the use of a conjunctive minimum number of stations passed (MNSP) standard in OSCE-type assessments is common practice across some parts of the world. Typically, the MNSP is fixed in advance with little justification, and does not vary from one administration to another in a particular setting-which is not congruent to best assessment practice for high stakes examinations. In this paper, we investigate empirically four methods of setting such a standard in an examinee-centred (i.e. post hoc) and criterion-based way that allows the standard to vary appropriately with station and test difficulty. METHODS AND RESULTS: Using many administrations (n = 442) from a single exam (PLAB2 in the UK), we show via mixed modelling that the total number of stations passed for each candidate has reliability close to that of the total test score (relative g-coefficient 0.73 and 0.76 respectively). We then argue that calculating the MNSP based on the predicted number of stations passed at the 'main' exam-level cut-score (i.e. for the borderline candidate) is conceptually, theoretically and practically preferred amongst the four approaches considered. Further analysis indicates that this standard does vary from administration to administration, but acts in a secondary way, with approximately a quarter of exam-level candidate failures resulting from application of the MNSP standard alone. CONCLUSION: Collectively, this work suggests that employing the identified approach to setting the MNSP standard is practically possible and, in many settings, is more defensible than using a fixed number of stations set in advance.


Asunto(s)
Competencia Clínica , Evaluación Educacional , Humanos , Reproducibilidad de los Resultados , Evaluación Educacional/métodos
3.
Med Educ ; 56(10): 1042-1050, 2022 10.
Artículo en Inglés | MEDLINE | ID: mdl-35701388

RESUMEN

BACKGROUND: Given the widespread use of Multiple Mini Interviews (MMIs), their impact on the selection of candidates and the considerable resources invested in preparing and administering them, it is essential to ensure their quality. Given the variety of station formats used and the degree to which that factor resides in the control of training programmes that we know so little about, format's effect on MMI quality is a considerable oversight. This study assessed the effect of two popular station formats (interview vs. role-play) on the psychometric properties of MMIs. METHODS: We analysed candidate data from the first 8 years of the Integrated French MMIs (IF-MMI) (2010-2017, n = 11 761 applicants), an MMI organised yearly by three francophone universities and administered at four testing sites located in two Canadian provinces. There were 84 role-play and 96 interview stations administered, totalling 180 stations. Mixed design analyses of variance (ANOVAs) were used to test the effect of station format on candidates' scores and stations' discrimination. Cronbach's alpha coefficients for interview and role-play stations were also compared. Predictive validity of both station formats was estimated with a mixed multiple linear regression model testing the relation between interview and role-play scores with average clerkship performance for those who gained entry to medical school (n = 462). RESULTS: Role-play stations (M = 20.67, standard deviation [SD] = 3.38) had a slightly lower mean score than interview stations (M = 21.36, SD = 3.08), p < 0.01, Cohen's d = 0.2. The correlation between role-play and interview stations scores was r = 0.5 (p < 0.01). Discrimination coefficients, Cronbach's alpha and predictive validity statistics did not vary by station format. CONCLUSION: Interview and role-play stations have comparable psychometric properties, suggesting format to be interchangeable. Programmes should select station format based on match to the personal qualities for which they are trying to select.


Asunto(s)
Criterios de Admisión Escolar , Facultades de Medicina , Canadá , Humanos , Psicometría , Reproducibilidad de los Resultados
4.
Adv Health Sci Educ Theory Pract ; 27(2): 457-473, 2022 05.
Artículo en Inglés | MEDLINE | ID: mdl-35230590

RESUMEN

Variation in examiner stringency is a recognised problem in many standardised summative assessments of performance such as the OSCE. The stated strength of the OSCE is that such error might largely balance out over the exam as a whole. This study uses linear mixed models to estimate the impact of different factors (examiner, station, candidate and exam) on station-level total domain score and, separately, on a single global grade. The exam data is from 442 separate administrations of an 18 station OSCE for international medical graduates who want to work in the National Health Service in the UK. We find that variation due to examiner is approximately twice as large for domain scores as it is for grades (16% vs. 8%), with smaller residual variance in the former (67% vs. 76%). Combined estimates of exam-level (relative) reliability across all data are 0.75 and 0.69 for domains scores and grades respectively. The correlation between two separate estimates of stringency for individual examiners (one for grades and one for domain scores) is relatively high (r=0.76) implying that examiners are generally quite consistent in their stringency between these two assessments of performance. Cluster analysis indicates that examiners fall into two broad groups characterised as hawks or doves on both measures. At the exam level, correcting for examiner stringency produces systematically lower cut-scores under borderline regression standard setting than using the raw marks. In turn, such a correction would produce higher pass rates-although meaningful direct comparisons are challenging to make. As in other studies, this work shows that OSCEs and other standardised performance assessments are subject to substantial variation in examiner stringency, and require sufficient domain sampling to ensure quality of pass/fail decision-making is at least adequate. More, perhaps qualitative, work is needed to understand better how examiners might score similarly (or differently) between the awarding of station-level domain scores and global grades. The issue of the potential systematic bias of borderline regression evidenced for the first time here, with sources of error producing cut-scores higher than they should be, also needs more investigation.


Asunto(s)
Competencia Clínica , Evaluación Educacional , Humanos , Reproducibilidad de los Resultados , Medicina Estatal
5.
Adv Health Sci Educ Theory Pract ; 26(2): 369-383, 2021 05.
Artículo en Inglés | MEDLINE | ID: mdl-32876815

RESUMEN

Variation in examiner stringency is an ongoing problem in many performance settings such as in OSCEs, and usually is conceptualised and measured based on scores/grades examiners award. Under borderline regression, the standard within a station is set using checklist/domain scores and global grades acting in combination. This complexity requires a more nuanced view of what stringency might mean when considering sources of variation of cut-scores in stations. This study uses data from 349 administrations of an 18-station, 36 candidate single circuit OSCE for international medical graduates wanting to practice in the UK (PLAB2). The station-level data was gathered over a 34-month period up to July 2019. Linear mixed models are used to estimate and then separate out examiner (n = 547), station (n = 330) and examination (n = 349) effects on borderline regression cut-scores. Examiners are the largest source of variation in cut-scores accounting for 56% of variance in cut-scores, compared to 6% for stations, < 1% for exam and 37% residual. Aggregating to the exam level tends to ameliorate this effect. For 96% of examinations, a 'fair' cut-score, equalising out variation in examiner stringency that candidates experience, is within one standard error of measurement (SEM) of the actual cut-score. The addition of the SEM to produce the final pass mark generally ensures the public is protected from almost all false positives in the examination caused by examiner cut-score stringency acting in candidates' favour.


Asunto(s)
Competencia Clínica , Evaluación Educacional , Estudios de Cohortes , Humanos , Examen Físico
7.
Med Teach ; 43(4): 448-455, 2021 04.
Artículo en Inglés | MEDLINE | ID: mdl-33290124

RESUMEN

INTRODUCTION: Many institutions require candidates to achieve a minimum number of OSCE stations passed (MNSP) in addition to the aggregate pass mark. The stated rationale is usually that this conjunctive standard prevents excessive degrees of compensation across an assessment. However, there is a lack of consideration and discussion of this practice in the medical education literature. METHODS: We consider the motivations for the adoption of the MNSP from the assessment designer perspective, outlining potential concerns about the complexity of what the OSCE is trying to achieve, particularly around the blueprinting process and the limitations of scoring instruments. We also introduce four potential methods for setting an examinee-centred MNSP standard, and highlight briefly the theoretical advantages and disadvantages of these approaches. DISCUSSION AND CONCLUSION: There are psychometric arguments for and against the limiting of compensation in OSCEs, but it is clear that many stakeholders value the application of an MNSP standard. This paper adds to the limited literature on this important topic and notes that current MNSP practices are often problematic in high stakes settings. More empirical work is needed to develop understanding of the impact on pass/fail decision-making of the proposed standard setting methods developed in this paper.


Asunto(s)
Educación Médica , Evaluación Educacional , Competencia Clínica , Humanos , Psicometría
8.
Med Teach ; 42(9): 1037-1042, 2020 09.
Artículo en Inglés | MEDLINE | ID: mdl-32608303

RESUMEN

Introduction: There has been a long-running debate about the validity of item-based checklist scoring of performance assessments like OSCEs. In recent years, the conception of a checklist has developed from its dichotomous inception into a more 'key-features' and/or chunked approach, where 'items' have the potential to become weighted differently, but the literature does not always reflect these broader conceptions.Methods: We consider theoretical, design and (clinically trained) assessor issues related to differential item weighting in checklist scoring of OSCEs stations. Using empirical evidence, this work also compares candidate decisions and psychometric quality of different item-weighting approaches (i.e. a simple 'unweighted' scheme versus a differentially weighted one).Results: The impact of different weighting schemes affect approximately 30% of the key borderline group of candidates, and 3% of candidates overall. We also find that measures of overall assessment quality are a little better under the differentially weighted scoring system.Discussion and conclusion: Differentially weighted modern checklists can contribute to valid assessment outcomes, and bring a range of additional benefits to the assessment. Judgment about weighting of particular items should be considered a key design consideration during station development and must align to clinical assessor expectations of the relative importance of sub-tasks.


Asunto(s)
Lista de Verificación , Evaluación Educacional , Competencia Clínica , Humanos , Juicio , Psicometría , Reproducibilidad de los Resultados
9.
Med Teach ; 42(3): 306-315, 2020 03.
Artículo en Inglés | MEDLINE | ID: mdl-31657266

RESUMEN

Introduction: Borderline regression (BRM) is considered problematic in small cohort OSCEs (e.g. n < 50), with institutions often relying on item-centred standard setting approaches which can be resource intensive and lack defensibility in performance tests.Methods: Through an analysis of post-hoc station- and test-level metrics, we investigate the application of BRM in three different small-cohort OSCE contexts: the exam for international medical graduates wanting to practice in the UK, senior sequential undergraduate exams, and Physician associates exams in a large UK medical school.Results: We find that BRM provides robust metrics and concomitantly defensible cut scores in the majority of stations (percentage of problematic stations 5, 14, and 12%, respectively across our three contexts). Where problems occur, this is generally due to an insufficiently strong relationship between global grades and checklist scores to be confident in the standard set by BRM in these stations.Conclusion: This work challenges previous assumptions about the application of BRM in small test cohorts. Where there is sufficient spread of ability, BRM will generally provide defensible standards, assuming careful design of station-level scoring instruments. However, extant station cut-scores are preferred as a substitute where BRM standard setting problems do occur.


Asunto(s)
Educación de Pregrado en Medicina , Evaluación Educacional , Competencia Clínica , Estudios de Cohortes , Humanos , Reproducibilidad de los Resultados , Facultades de Medicina
10.
Med Educ ; 53(3): 250-263, 2019 03.
Artículo en Inglés | MEDLINE | ID: mdl-30575092

RESUMEN

BACKGROUND: Although averaging across multiple examiners' judgements reduces unwanted overall score variability in objective structured clinical examinations (OSCE), designs involving several parallel circuits of the OSCE require that different examiner cohorts collectively judge performances to the same standard in order to avoid bias. Prior research suggests the potential for important examiner-cohort effects in distributed or national examinations that could compromise fairness or patient safety, but despite their importance, these effects are rarely investigated because fully nested assessment designs make them very difficult to study. We describe initial use of a new method to measure and adjust for examiner-cohort effects on students' scores. METHODS: We developed video-based examiner score comparison and adjustment (VESCA): volunteer students were filmed 'live' on 10 out of 12 OSCE stations. Following the examination, examiners additionally scored station-specific common-comparator videos, producing partial crossing between examiner cohorts. Many-facet Rasch modelling and linear mixed modelling were used to estimate and adjust for examiner-cohort effects on students' scores. RESULTS: After accounting for students' ability, examiner cohorts differed substantially in their stringency or leniency (maximal global score difference of 0.47 out of 7.0 [Cohen's d = 0.96]; maximal total percentage score difference of 5.7% [Cohen's d = 1.06] for the same student ability by different examiner cohorts). Corresponding adjustment of students' global and total percentage scores altered the theoretical classification of 6.0% of students for both measures (either pass to fail or fail to pass), whereas 8.6-9.5% students' scores were altered by at least 0.5 standard deviations of student ability. CONCLUSIONS: Despite typical reliability, the examiner cohort that students encountered had a potentially important influence on their score, emphasising the need for adequate sampling and examiner training. Development and validation of VESCA may offer a means to measure and adjust for potential systematic differences in scoring patterns that could exist between locations in distributed or national OSCE examinations, thereby ensuring equivalence and fairness.


Asunto(s)
Competencia Clínica/normas , Educación de Pregrado en Medicina/normas , Evaluación Educacional/métodos , Evaluación Educacional/normas , Variaciones Dependientes del Observador , Grabación de Cinta de Video/métodos , Educación de Pregrado en Medicina/métodos , Humanos , Reproducibilidad de los Resultados , Estudiantes de Medicina
11.
Med Teach ; 40(3): 275-284, 2018 03.
Artículo en Inglés | MEDLINE | ID: mdl-29172861

RESUMEN

INTRODUCTION: In recent decades, there has been a move towards standardized models of assessment where all students sit the same test (e.g. OSCE). By contrast, in a sequential test the examination is in two parts, a "screening" test (S1) that all candidates take, and then a second "test" (S2) which only the weaker candidates sit. This article investigates the diagnostic accuracy of this assessment design, and investigates failing students' subsequent performance under this model. METHODS: Using recent undergraduate knowledge and performance data, we compare S1 "decisions" to S2 overall pass/fail decisions to assess diagnostic accuracy in a sequential model. We also evaluate the longitudinal performance of failing students using changes in percentile ranks over a full repeated year. FINDINGS: We find a small but important improvement in diagnostic accuracy under a sequential model (of the order 2-4% of students misclassified under a traditional model). Further, after a resit year, weaker students' rankings relative to their peers improve by 20-30 percentile points. DISCUSSION: These findings provide strong empirical support for the theoretical arguments in favor of a sequential testing model of assessment, particularly that diagnostic accuracy and longitudinal assessment outcomes post-remediation for the weakest students are both improved.


Asunto(s)
Competencia Clínica/normas , Técnicas y Procedimientos Diagnósticos/normas , Evaluación Educacional/métodos , Estudiantes de Medicina , Toma de Decisiones , Educación de Pregrado en Medicina , Humanos
12.
Med Teach ; 39(5): 469-475, 2017 May.
Artículo en Inglés | MEDLINE | ID: mdl-28440718

RESUMEN

INTRODUCTION: Many standard setting procedures focus on the performance of the "borderline" group, defined through expert judgments by assessors. In performance assessments such as Objective Structured Clinical Examinations (OSCEs), these judgments usually apply at the station level. METHODS AND RESULTS: Using largely descriptive approaches, we analyze the assessment profile of OSCE candidates at the end of a five year undergraduate medical degree program to investigate the consistency of the borderline group across stations. We look specifically at those candidates who are borderline in individual stations, and in the overall assessment. While the borderline group can be clearly defined at the individual station level, our key finding is that the membership of this group varies considerably across stations. DISCUSSION AND CONCLUSIONS: These findings pose challenges for some standard setting methods, particularly the borderline group and objective borderline methods. They also suggest that institutions should ensure appropriate conjunctive rules to limit compensation in performance between stations to maximize "diagnostic accuracy". In addition, this work highlights a key benefit of sequential testing formats in OSCEs. In comparison with a traditional, single-test format, sequential models allow assessment of "borderline" candidates across a wider range of content areas with concomitant improvements in pass/fail decision-making.


Asunto(s)
Competencia Clínica/normas , Toma de Decisiones , Educación de Pregrado en Medicina/métodos , Evaluación Educacional/métodos , Educación de Pregrado en Medicina/normas , Humanos , Estudiantes
13.
Perspect Med Educ ; 6(1): 7-9, 2017 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-28050883
14.
Med Teach ; 39(1): 58-66, 2017 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-27670246

RESUMEN

CONTEXT: There is a growing body of research investigating assessor judgments in complex performance environments such as OSCE examinations. Post hoc analysis can be employed to identify some elements of "unwanted" assessor variance. However, the impact of individual, apparently "extreme" assessors on OSCE quality, assessment outcomes and pass/fail decisions has not been previously explored. This paper uses a range of "case studies" as examples to illustrate the impact that "extreme" examiners can have in OSCEs, and gives pragmatic suggestions to successfully alleviating problems. METHOD AND RESULTS: We used real OSCE assessment data from a number of examinations where at station level, a single examiner assesses student performance using a global grade and a key features checklist. Three exemplar case studies where initial post hoc analysis has indicated problematic individual assessor behavior are considered and discussed in detail, highlighting both the impact of individual examiner behavior and station design on subsequent judgments. CONCLUSIONS: In complex assessment environments, institutions have a duty to maximize the defensibility, quality and validity of the assessment process. A key element of this involves critical analysis, through a range of approaches, of assessor judgments. However, care must be taken when assuming that apparent aberrant examiner behavior is automatically just that.


Asunto(s)
Educación Médica/métodos , Educación Médica/normas , Evaluación Educacional/métodos , Evaluación Educacional/normas , Variaciones Dependientes del Observador , Lista de Verificación , Competencia Clínica , Humanos , Juicio , Psicometría , Reproducibilidad de los Resultados
15.
Med Teach ; 38(12): 1267-1277, 2016 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-27650218

RESUMEN

INTRODUCTION: It is known that test-centered methods for setting standards in knowledge tests (e.g. Angoff or Ebel) are problematic, with expert judges not able to consistently predict the difficulty of individual items. A different approach is the Cohen method, which benchmarks the difficulty of the test based on the performance of the top candidates. METHODS: This paper investigates the extent to which Ebel (and also Cohen) produces a consistent standard in a knowledge test when comparing between adjacent cohorts. The two tests are linked using common anchor items and Rasch analysis to put all items and all candidates on the same scale. RESULTS: The two tests are of a similar standard, but the two cohorts are different in their average abilities. The Ebel method is entirely consistent across the two years, but the Cohen method looks less so, whilst the Rasch equating itself has complications - for example, with evidence of overall misfit to the Rasch model and change in difficulty for some anchor items. CONCLUSION: Based on our findings, we advocate a pluralistic and pragmatic approach to standard setting in such contexts, and recommend the use of multiple sources of information to inform the decision about the correct standard.


Asunto(s)
Educación Médica/métodos , Educación Médica/normas , Evaluación Educacional/métodos , Evaluación Educacional/normas , Conocimiento , Humanos
16.
Med Teach ; 38(2): 181-8, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-25909810

RESUMEN

BACKGROUND: The use of the borderline regression method (BRM) is a widely accepted standard setting method for OSCEs. However, it is unclear whether this method is appropriate for use with small cohorts (e.g. specialist post-graduate examinations). AIMS AND METHODS: This work uses an innovative application of resampling methods applied to four pre-existing OSCE data sets (number of stations between 17 and 21) from two institutions to investigate how the robustness of the BRM changes as the cohort size varies. Using a variety of metrics, the 'quality' of an OSCE is evaluated for cohorts of approximately n = 300 down to n = 15. Estimates of the standard error in station-level and overall pass marks, R(2) coefficient, and Cronbach's alpha are all calculated as cohort size varies. RESULTS AND CONCLUSION: For larger cohorts (n > 200), the standard error in the overall pass mark is small (less than 0.5%), and for individual stations is of the order of 1-2%. These errors grow as the sample size reduces, with cohorts of less than 50 candidates showing unacceptably large standard error. Alpha and R(2) also become unstable for small cohorts. The resampling methodology is shown to be robust and has the potential to be more widely applied in standard setting and medical assessment quality assurance and research.


Asunto(s)
Sesgo , Medicina Interna/educación , Análisis de Regresión , Tamaño de la Muestra , Estudios de Cohortes
17.
Nutrients ; 7(6): 4426-37, 2015 Jun 02.
Artículo en Inglés | MEDLINE | ID: mdl-26043039

RESUMEN

With growing evidence for the positive health outcomes associated with a plant-based diet, the study's purpose was to examine the potential of shifting adolescents' food choices towards plant-based foods. Using a real world setting of a school canteen, a set of small changes to the choice architecture was designed and deployed in a secondary school in Yorkshire, England. Focussing on designated food items (whole fruit, fruit salad, vegetarian daily specials, and sandwiches containing salad) the changes were implemented for six weeks. Data collected on students' food choice (218,796 transactions) enabled students' (980 students) selections to be examined. Students' food choice was compared for three periods: baseline (29 weeks); intervention (six weeks); and post-intervention (three weeks). Selection of designated food items significantly increased during the intervention and post-intervention periods, compared to baseline (baseline, 1.4%; intervention 3.0%; post-intervention, 2.2%) χ(2)(2) = 68.1, p < 0.001. Logistic regression modelling also revealed the independent effect of the intervention, with students 2.5 times as likely (p < 0.001) to select the designated food items during the intervention period, compared to baseline. The study's results point to the influence of choice architecture within secondary school settings, and its potential role in improving adolescents' daily food choices.


Asunto(s)
Conducta de Elección , Grasas de la Dieta/administración & dosificación , Conducta Alimentaria , Preferencias Alimentarias , Promoción de la Salud/métodos , Adolescente , Niño , Dieta Vegetariana , Inglaterra , Servicios de Alimentación , Frutas , Conductas Relacionadas con la Salud , Humanos , Modelos Logísticos , Instituciones Académicas , Estudiantes , Verduras
18.
Med Teach ; 37(12): 1106-13, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-25683174

RESUMEN

BACKGROUND: When measuring assessment quality, increasing focus is placed on the value of station-level metrics in the detection and remediation of problems in the assessment. AIMS: This article investigates how disparity between checklist scores and global grades in an Objective Structured Clinical Examination (OSCE) can provide powerful new insights at the station level whenever such disparities occur and develops metrics to indicate when this is a problem. METHOD: This retrospective study uses OSCE data from multiple examinations to investigate the extent to which these new measurements of disparity complement existing station-level metrics. RESULTS: In stations where existing metrics are poor, the new metrics provide greater understanding of the underlying sources of error. Equally importantly, stations of apparently satisfactory "quality" based on traditional metrics are shown to sometimes have problems of their own - with a tendency for checklist score "performance" to be judged stronger than would be expected from the global grades awarded. CONCLUSIONS: There is an ongoing tension in OSCE assessment between global holistic judgements and the necessarily more reductionist, but arguably more objective, checklist scores. This article develops methods to quantify the disparity between these judgements and illustrates how such analyses can inform ongoing improvement in station quality.


Asunto(s)
Competencia Clínica , Toma de Decisiones , Evaluación Educacional/métodos , Estudiantes de Medicina , Adulto , Educación de Pregrado en Medicina , Femenino , Humanos , Masculino , Solución de Problemas , Reproducibilidad de los Resultados , Estudios Retrospectivos , Facultades de Medicina , Estudiantes de Medicina/psicología , Estudiantes de Medicina/estadística & datos numéricos , Reino Unido , Adulto Joven
19.
Med Teach ; 35(8): 684-91, 2013 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-23782043

RESUMEN

This paper reports on a study that compares estimates of the reliability of a suite of workplace based assessment forms as employed to formatively assess the progress of trainee obstetricians and gynaecologists. The use of such forms of assessment is growing nationally and internationally in many specialties, but there is little research evidence on comparisons by procedure/competency and form-type across an entire specialty. Generalisability theory combined with a multilevel modelling approach is used to estimate variance components, G-coefficients and standard errors of measurement across 13 procedures and three form-types (mini-CEX, OSATS and CbD). The main finding is that there are wide variations in the estimates of reliability across the forms, and that therefore the guidance on assessment within the specialty does not always allow for enough forms per trainee to ensure that the levels of reliability of the process is adequate. There is, however, little evidence that reliability varies systematically by form-type. Methodologically, the problems of accurately estimating reliability in these contexts through the calculation of variance components and, crucially, their associated standard errors are considered. The importance of the use of appropriate methods in such calculations is emphasised, and the unavoidable limitations of research in naturalistic settings are discussed.


Asunto(s)
Competencia Clínica , Evaluación Educacional/métodos , Ginecología/educación , Obstetricia/educación , Lugar de Trabajo , Comunicación , Toma de Decisiones , Educación de Postgrado en Medicina/métodos , Evaluación Educacional/normas , Ginecología/normas , Humanos , Anamnesis , Obstetricia/normas , Examen Físico , Reproducibilidad de los Resultados , Estudios Retrospectivos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA