RESUMO
UNLABELLED: CONSTRUCT: Automatic item generation (AIG) is an alternative method for producing large numbers of test items that integrate cognitive modeling with computer technology to systematically generate multiple-choice questions (MCQs). The purpose of our study is to describe and validate a method of generating plausible but incorrect distractors. Initial applications of AIG demonstrated its effectiveness in producing test items. However, expert review of the initial items identified a key limitation where the generation of implausible incorrect options, or distractors, might limit the applicability of items in real testing situations. BACKGROUND: Medical educators require development of test items in large quantities to facilitate the continual assessment of student knowledge. Traditional item development processes are time-consuming and resource intensive. Studies have validated the quality of generated items through content expert review. However, no study has yet documented how generated items perform in a test administration. Moreover, no study has yet to validate AIG through student responses to generated test items. APPROACH: To validate our refined AIG method in generating plausible distractors, we collected psychometric evidence from a field test of the generated test items. A three-step process was used to generate test items in the area of jaundice. At least 455 Canadian and international medical graduates responded to each of the 13 generated items embedded in a high-stake exam administration. Item difficulty, discrimination, and index of discrimination estimates were calculated for the correct option as well as each distractor. RESULTS: Item analysis results for the correct options suggest that the generated items measured candidate performances across a range of ability levels while providing a consistent level of discrimination for each item. Results for the distractors reveal that the generated items differentiated the low- from the high-performing candidates. CONCLUSIONS: Previous research on AIG highlighted how this item development method can be used to produce high-quality stems and correct options for MCQ exams. The purpose of the current study was to describe, illustrate, and evaluate a method for modeling plausible but incorrect options. Evidence provided in this study demonstrates that AIG can produce psychometrically sound test items. More important, by adapting the distractors to match the unique features presented in the stem and correct option, the generation of MCQs using automated procedure has the potential to produce plausible distractors and yield large numbers of high-quality items for medical education.
Assuntos
Instrução por Computador/métodos , Educação de Graduação em Medicina/métodos , Avaliação Educacional/métodos , Melhoria de Qualidade , Automação , Humanos , Icterícia/diagnóstico , Icterícia/terapia , Modelos Educacionais , PsicometriaRESUMO
CONTEXT: Constructed-response tasks, which range from short-answer tests to essay questions, are included in assessments of medical knowledge because they allow educators to measure students' ability to think, reason, solve complex problems, communicate and collaborate through their use of writing. However, constructed-response tasks are also costly to administer and challenging to score because they rely on human raters. One alternative to the manual scoring process is to integrate computer technology with writing assessment. The process of scoring written responses using computer programs is known as 'automated essay scoring' (AES). METHODS: An AES system uses a computer program that builds a scoring model by extracting linguistic features from a constructed-response prompt that has been pre-scored by human raters and then, using machine learning algorithms, maps the linguistic features to the human scores so that the computer can be used to classify (i.e. score or grade) the responses of a new group of students. The accuracy of the score classification can be evaluated using different measures of agreement. RESULTS: Automated essay scoring provides a method for scoring constructed-response tests that complements the current use of selected-response testing in medical education. The method can serve medical educators by providing the summative scores required for high-stakes testing. It can also serve medical students by providing them with detailed feedback as part of a formative assessment process. CONCLUSIONS: Automated essay scoring systems yield scores that consistently agree with those of human raters at a level as high, if not higher, as the level of agreement among human raters themselves. The system offers medical educators many benefits for scoring constructed-response tasks, such as improving the consistency of scoring, reducing the time required for scoring and reporting, minimising the costs of scoring, and providing students with immediate feedback on constructed-response tasks.
Assuntos
Instrução por Computador/tendências , Educação Médica/métodos , Educação Médica/tendências , Avaliação Educacional/métodos , Software , Competência Clínica , Humanos , RedaçãoRESUMO
Objective structured clinical examinations (OSCEs) are used worldwide for summative examinations but often lack acceptable reliability. Research has shown that reliability of scores increases if OSCE checklists for medical students include only clinically relevant items. Also, checklists are often missing evidence-based items that high-achieving learners are more likely to use. The purpose of this study was to determine if limiting checklist items to clinically discriminating items and/or adding missing evidence-based items improved score reliability in an Internal Medicine residency OSCE. Six internists reviewed the traditional checklists of four OSCE stations classifying items as clinically discriminating or non-discriminating. Two independent reviewers augmented checklists with missing evidence-based items. We used generalizability theory to calculate overall reliability of faculty observer checklist scores from 45 first and second-year residents and predict how many 10-item stations would be required to reach a Phi coefficient of 0.8. Removing clinically non-discriminating items from the traditional checklist did not affect the number of stations (15) required to reach a Phi of 0.8 with 10 items. Focusing the checklist on only evidence-based clinically discriminating items increased test score reliability, needing 11 stations instead of 15 to reach 0.8; adding missing evidence-based clinically discriminating items to the traditional checklist modestly improved reliability (needing 14 instead of 15 stations). Checklists composed of evidence-based clinically discriminating items improved the reliability of checklist scores and reduced the number of stations needed for acceptable reliability. Educators should give preference to evidence-based items over non-evidence-based items when developing OSCE checklists.
Assuntos
Lista de Checagem , Competência Clínica/normas , Educação de Pós-Graduação em Medicina , Prática Clínica Baseada em Evidências/normas , Medicina Interna/normas , Internato e Residência/normas , Exame Físico/normas , Canadá , Avaliação Educacional/métodos , Humanos , Reprodutibilidade dos Testes , Estudantes de MedicinaRESUMO
OBJECTIVES: Computerised assessment raises formidable challenges because it requires large numbers of test items. Automatic item generation (AIG) can help address this test development problem because it yields large numbers of new items both quickly and efficiently. To date, however, the quality of the items produced using a generative approach has not been evaluated. The purpose of this study was to determine whether automatic processes yield items that meet standards of quality that are appropriate for medical testing. Quality was evaluated firstly by subjecting items created using both AIG and traditional processes to rating by a four-member expert medical panel using indicators of multiple-choice item quality, and secondly by asking the panellists to identify which items were developed using AIG in a blind review. METHODS: Fifteen items from the domain of therapeutics were created in three different experimental test development conditions. The first 15 items were created by content specialists using traditional test development methods (Group 1 Traditional). The second 15 items were created by the same content specialists using AIG methods (Group 1 AIG). The third 15 items were created by a new group of content specialists using traditional methods (Group 2 Traditional). These 45 items were then evaluated for quality by a four-member panel of medical experts and were subsequently categorised as either Traditional or AIG items. RESULTS: Three outcomes were reported: (i) the items produced using traditional and AIG processes were comparable on seven of eight indicators of multiple-choice item quality; (ii) AIG items can be differentiated from Traditional items by the quality of their distractors, and (iii) the overall predictive accuracy of the four expert medical panellists was 42%. CONCLUSIONS: Items generated by AIG methods are, for the most part, equivalent to traditionally developed items from the perspective of expert medical reviewers. While the AIG method produced comparatively fewer plausible distractors than the traditional method, medical experts cannot consistently distinguish AIG items from traditionally developed items in a blind review.
Assuntos
Instrução por Computador/normas , Educação de Graduação em Medicina/métodos , Avaliação Educacional/normas , Instrução por Computador/métodos , Educação de Graduação em Medicina/normas , Avaliação Educacional/métodos , Humanos , Modelos Educacionais , Projetos de Pesquisa , Inquéritos e QuestionáriosRESUMO
OBJECTIVE: Automatic item generation (AIG) is a new area of assessment research where a set of multiple-choice questions (MCQs) are created using models and computer technology. Although successfully demonstrated in medicine and dentistry, AIG has not been implemented in pharmacy. The objective was to implement AIG to create a set of MCQs appropriate for inclusion in a summative, high-stakes, pharmacy examination. METHODS: A 3-step process, well evidenced in AIG research, was employed to create the pharmacy MCQs. The first step was developing a cognitive model based on content within the examination blueprint. Second, an item model was developed based on the cognitive model. A process of systematic distractor generation was also incorporated to optimize distractor plausibility. Third, we used computer technology to assemble a set of test items based on the cognitive and item models. A sample of generated items was assessed for quality against Gierl and Lai's 8 guidelines of item quality. RESULTS: More than 15,000 MCQs were generated to measure knowledge and skill of patient assessment and treatment of nausea and/or vomiting within the scope of clinical pharmacy. A sample of generated items satisfies the requirements of content-related validity and quality after substantive review. CONCLUSION: This research demonstrates the AIG process is a viable strategy for creating a test item bank to provide MCQs appropriate for inclusion in a pharmacy licensing examination.
Assuntos
Educação de Graduação em Medicina , Educação em Farmácia , Farmácia , Humanos , Avaliação Educacional , ComputadoresRESUMO
CONTEXT: Many tests of medical knowledge, from the undergraduate level to the level of certification and licensure, contain multiple-choice items. Although these are efficient in measuring examinees' knowledge and skills across diverse content areas, multiple-choice items are time-consuming and expensive to create. Changes in student assessment brought about by new forms of computer-based testing have created the demand for large numbers of multiple-choice items. Our current approaches to item development cannot meet this demand. METHODS: We present a methodology for developing multiple-choice items based on automatic item generation (AIG) concepts and procedures. We describe a three-stage approach to AIG and we illustrate this approach by generating multiple-choice items for a medical licensure test in the content area of surgery. RESULTS: To generate multiple-choice items, our method requires a three-stage process. Firstly, a cognitive model is created by content specialists. Secondly, item models are developed using the content from the cognitive model. Thirdly, items are generated from the item models using computer software. Using this methodology, we generated 1248 multiple-choice items from one item model. CONCLUSIONS: Automatic item generation is a process that involves using models to generate items using computer technology. With our method, content specialists identify and structure the content for the test items, and computer technology systematically combines the content to generate new test items. By combining these outcomes, items can be generated automatically.
Assuntos
Competência Clínica , Educação de Graduação em Medicina/métodos , Avaliação Educacional/métodos , Cirurgia Geral/educação , Instrução por Computador/métodos , Instrução por Computador/normas , Computadores , Educação de Graduação em Medicina/normas , Avaliação Educacional/normas , Humanos , Modelos EducacionaisRESUMO
Writing a high-quality, multiple-choice test item is a complex process. Creating plausible but incorrect options for each item poses significant challenges for the content specialist because this task is often undertaken without implementing a systematic method. In the current study, we describe and demonstrate a systematic method for creating plausible but incorrect options, also called distractors, based on students' misconceptions. These misconceptions are extracted from the labeled written responses. One thousand five hundred and fifteen written responses from an existing constructed-response item in Biology from Grade 10 students were used to demonstrate the method. Using a topic modeling procedure commonly used with machine learning and natural language processing called latent dirichlet allocation, 22 plausible misconceptions from students' written responses were identified and used to produce a list of plausible distractors based on students' responses. These distractors, in turn, were used as part of new multiple-choice items. Implications for item development are discussed.
RESUMO
Computerized testing provides many benefits to support formative assessment. However, the advent of computerized formative testing has also raised formidable new challenges, particularly in the area of item development. Large numbers of diverse, high-quality test items are required because items are continuously administered to students. Hence, hundreds of items are needed to develop the banks necessary for computerized formative testing. One promising approach that may be used to address this test development challenge is automatic item generation. Automatic item generation is a relatively new but rapidly evolving research area where cognitive and psychometric modeling practices are used to produce items with the aid of computer technology. The purpose of this study is to describe a new method for generating both the items and the rationales required to solve the items to produce the required feedback for computerized formative testing. The method for rationale generation is demonstrated and evaluated in the medical education domain.
RESUMO
We present a framework for technology-enhanced scoring of bilingual clinical decision-making (CDM) questions using an open-source scoring technology and evaluate the strength of the proposed framework using operational data from the Medical Council of Canada Qualifying Examination. Candidates' responses from six write-in CDM questions were used to develop a three-stage-automated scoring framework. In Stage 1, the linguistic features from CDM responses were extracted. In Stage 2, supervised machine learning techniques were employed for developing the scoring models. In Stage 3, responses to six English and French CDM questions were scored using the scoring models from Stage 2. Of the 8,007 English and French CDM responses, 7,643 were accurately scored with an agreement rate of 95.4% between human and computer scoring. This result serves as an improvement of 5.4% when compared with the human inter-rater reliability. Our framework yielded scores similar to those of expert physician markers and could be used for clinical competency assessment.
Assuntos
Competência Clínica , Avaliação Educacional/métodos , Avaliação Educacional/normas , Processamento Eletrônico de Dados/normas , Tradução , Canadá , Tomada de Decisão Clínica , Humanos , Licenciamento em Medicina , Reprodutibilidade dos TestesRESUMO
Test items created for dentistry examinations are often individually written by content experts. This approach to item development is expensive because it requires the time and effort of many content experts but yields relatively few items. The aim of this study was to describe and illustrate how items can be generated using a systematic approach. Automatic item generation (AIG) is an alternative method that allows a small number of content experts to produce large numbers of items by integrating their domain expertise with computer technology. This article describes and illustrates how three modeling approaches to item content-item cloning, cognitive modeling, and image-anchored modeling-can be used to generate large numbers of multiple-choice test items for examinations in dentistry. Test items can be generated by combining the expertise of two content specialists with technology supported by AIG. A total of 5,467 new items were created during this study. From substitution of item content, to modeling appropriate responses based upon a cognitive model of correct responses, to generating items linked to specific graphical findings, AIG has the potential for meeting increasing demands for test items. Further, the methods described in this study can be generalized and applied to many other item types. Future research applications for AIG in dental education are discussed.