RESUMO
Psychometric methods for accurate and timely detection of item compromise have been a long-standing topic. While Bayesian methods can incorporate prior knowledge or expert inputs as additional information for item compromise detection, they have not been employed in item compromise detection itself. The current study proposes a two-phase Bayesian change-point framework for both stationary and real-time detection of changes in each item's compromise status. In Phase I, a stationary Bayesian change-point model for compromise detection is fitted to the observed responses over a specified time-frame. The model produces parameter estimates for the change-point of each item from uncompromised to compromised, as well as structural parameters accounting for the post-change response distribution. Using the post-change model identified in Phase I, the Shiryaev procedure for sequential testing is employed in Phase II for real-time monitoring of item compromise. The proposed methods are evaluated in terms of parameter recovery, detection accuracy, and detection efficiency under various simulation conditions and in a real data example. The proposed method also showed superior detection accuracy and efficiency compared to the cumulative sum procedure.
Assuntos
Conhecimento , Teorema de Bayes , Simulação por Computador , Psicometria/métodosRESUMO
Cognitive diagnosis models have become popular in educational assessment and are used to provide more individualized feedback about a student's specific strengths and weaknesses than traditional total scores. However, if the testing data are contaminated by certain biases or aberrant response patterns, such predictions may not be accurate. The current research objective is to develop a new person-fit method that is based on machine learning and improves the functionality of existing person-fit methods. Various simulations were designed under three aberrant conditions: cheating, sleeping and random guessing. Simulation results showed that the new method was more powerful and effective than previous methods, especially for short-length tests.
Assuntos
Avaliação Educacional , Aprendizado de Máquina , Simulação por Computador , Avaliação Educacional/métodos , Escolaridade , Humanos , Projetos de PesquisaRESUMO
E-learning systems are capable of providing more adaptive and efficient learning experiences for learners than traditional classroom settings. A key component of such systems is the learning policy. The learning policy is an algorithm that designs the learning paths or rather it selects learning materials for learners based on information such as the learners' current progresses and skills, learning material contents. In this article, the authors address the problem of finding the optimal learning policy. To this end, a model for learners' hierarchical skills in the E-learning system is first developed. Based on the hierarchical skill model and the classical cognitive diagnosis model, a framework to model various mastery levels related to hierarchical skills is further developed. The optimal learning path in consideration of the hierarchical structure of skills is found by applying a model-free reinforcement learning method, which does not require any assumption about learners' learning transition processes. The effectiveness of the proposed framework is demonstrated via simulation studies.
RESUMO
As one of the most influential international large-scale educational assessments, the Program for International Student Assessment (PISA) provides a valuable platform for the horizontal comparisons and references of international education. The cognitive diagnostic model, a newly generated evaluation theory, can integrate measurement goals into the cognitive process model through cognitive analysis, which provides a better understanding of the mastery of students of fine-grained knowledge points. On the basis of the mathematical measurement framework of PISA 2012, 11 attributes have been formed from three dimensions in this study. Twelve test items with item responses from 24,512 students from 10 countries participated in answering were selected, and the analyses were divided into several steps. First, the relationships between the 11 attributes and the 12 test items were classified to form a Q matrix. Second, the cognitive model of the PISA mathematics test was established. The liner logistic model (LLM) with better model fit was selected as the parameter evaluation model through model comparisons. By analyzing the knowledge states of these countries and the prerequisite relations among the attributes, this study explored the different learning trajectories of students in the content field. The result showed that students from Australia, Canada, the United Kingdom, and Russia shared similar main learning trajectories, while Finland and Japan were consistent with their main learning trajectories. The primary learning trajectories of the United States and China were the same. Furthermore, the learning trajectory for Singapore was the most complicated, as it showed a diverse learning process, whereas the trajectory in the United States and Saudi Arabia was relatively simple. This study concluded the differences of the mastery of students of the 11 cognitive attributes from the three dimensions of content, process, and context across the 10 countries, which provided a reference for further understanding of the PISA test results in other countries and shed some evidence for a deeper understanding of the strengths and weaknesses of mathematics education in various countries.
RESUMO
Cognitive diagnostic computerized adaptive testing (CD-CAT) aims to obtain more useful diagnostic information by taking advantages of computerized adaptive testing (CAT). Cognitive diagnosis models (CDMs) have been developed to classify examinees into the correct proficiency classes so as to get more efficient remediation, whereas CAT tailors optimal items to the examinee's mastery profile. The item selection method is the key factor of the CD-CAT procedure. In recent years, a large number of parametric/nonparametric item selection methods have been proposed. In this article, the authors proposed a series of stratified item selection methods in CD-CAT, which are combined with posterior-weighted Kullback-Leibler (PWKL), nonparametric item selection (NPS), and weighted nonparametric item selection (WNPS) methods, and named S-PWKL, S-NPS, and S-WNPS, respectively. Two different types of stratification indices were used: original versus novel. The performances of the proposed item selection methods were evaluated via simulation studies and compared with the PWKL, NPS, and WNPS methods without stratification. Manipulated conditions included calibration sample size, item quality, number of attributes, number of strata, and data generation models. Results indicated that the S-WNPS and S-NPS methods performed similarly, and both outperformed the S-PWKL method. And item selection methods with novel stratification indices performed slightly better than the ones with original stratification indices, and those without stratification performed the worst.
RESUMO
Students who wish to learn a specific skill have increasing access to a growing number of online courses and open-source educational repositories of instructional tools, including videos, slides, and exercises. Navigating these tools is time-consuming and the search itself can hinder the learning of the skill. Educators are hence interested in aiding students by selecting the optimal content sequence for individual learners, specifically which skill one should learn next and which material one should use to study. Such adaptive selection would rely on pre-knowledge of how the learners' and the instructional tools' characteristics jointly affect the probability of acquiring a certain skill. Building upon previous research on Latent Transition Analysis and Learning Trajectories, we propose a multilevel logistic hidden Markov model for learning based on cognitive diagnosis models, where the probability that a learner acquires the target skill depends not only on the general difficulty of the skill and the learner's mastery of other skills in the curriculum but also on the effectiveness of the particular learning tool and its interaction with mastery of other skills, captured by random slopes and intercepts for each learning tool. A Bayesian modeling framework and an MCMC algorithm for parameter estimation are proposed and evaluated using a simulation study.
Assuntos
Cognição , Teorema de Bayes , Currículo , Humanos , AprendizagemRESUMO
Content balancing is one of the most important issues in computerized classification testing. To adapt to variable-length forms, special treatments are needed to successfully control content constraints without knowledge of test length during the test. To this end, we propose the notions of 'look-ahead' and 'step size' to adaptively control content constraints in each item selection step. The step size gives a prediction of the number of items to be selected at the current stage, that is, how far we will look ahead. Two look-ahead content balancing (LA-CB) methods, one with a constant step size and another with an adaptive step size, are proposed as feasible solutions to balancing content areas in variable-length computerized classification testing. The proposed LA-CB methods are compared with conventional item selection methods in variable-length tests and are examined with different classification methods. Simulation results show that, integrated with heuristic item selection methods, the proposed LA-CB methods result in fewer constraint violations and can maintain higher classification accuracy. In addition, the LA-CB method with an adaptive step size outperforms that with a constant step size in content management. Furthermore, the LA-CB methods generate higher test efficiency while using the sequential probability ratio test classification method.
Assuntos
Avaliação Educacional/métodos , Modelos Psicológicos , Simulação por Computador , Computadores , Humanos , Modelos EstatísticosRESUMO
The average test overlap rate is often computed and reported as a measure of test security risk or item pool utilization of a computerized adaptive test (CAT). Despite the prevalent use of this sample statistic in both literature and operations, its sampling distribution has never been known nor studied in earnest. In response, a proof is presented for the asymptotic distribution of a linear transformation of the average test overlap rate in fixed-length CAT. The theoretical results enable the estimation of standard error and construction of confidence intervals. Moreover, a practical simulation study demonstrates the statistical comparison of average test overlap rates between two CAT designs with different exposure control methods.
Assuntos
Algoritmos , Simulação por Computador , Computadores , Avaliação Educacional , HumanosRESUMO
Attribute hierarchy is a common assumption in the educational context, where the mastery of one attribute is assumed to be a prerequisite to the mastery of another one. The attribute hierarchy can be incorporated through a restricted Q matrix that implies the specified structure. The latent class-based cognitive diagnostic models (CDMs) usually do not assume a hierarchical structure among attributes, which means all profiles of attributes are possible in a population of interest. This study investigates different estimation methods to the classification accuracy for a family of CDMs when they are combined with a restricted Q-matrix design. A simulation study is used to explain the misclassification caused by an unrestricted estimation procedure. The advantages of the restricted estimation procedure utilizing attribute hierarchies for increased classification accuracy are also further illustrated through a real data analysis on a syllogistic reasoning diagnostic assessment. This research can provide guidelines for educational and psychological researchers and practitioners when they use CDMs to analyze the data with a restricted Q-matrix design and make them be aware of the potentially contaminated classification results if ignoring attribute hierarchies.
RESUMO
For item selection in cognitive diagnostic computerized adaptive testing (CD-CAT), ideally, a single item selection index should be created to simultaneously regulate precision, exposure status, and attribute balancing. For this purpose, in this study, we first proposed an attribute-balanced item selection criterion, namely, the standardized weighted deviation global discrimination index (SWDGDI), and subsequently formulated the constrained progressive index (CP_SWDGDI) by casting the SWDGDI in a progressive algorithm. A simulation study revealed that the SWDGDI method was effective in balancing attribute coverage and the CP_SWDGDI method was able to simultaneously balance attribute coverage and item pool usage while maintaining acceptable estimation precision. This research also demonstrates the advantage of a relatively low number of attributes in CD-CAT applications.
RESUMO
The compatibility of computerized adaptive testing (CAT) with response revision has been a topic of debate in psychometrics for many years. The problem is to provide test takers opportunities to change their answers during the test, while discouraging deceptive strategies from their side and preserving the statistical efficiency of the traditional CAT. The estimating approach proposed in Wang et al. (Stat Sin 27(4):1987-2010, 2017), based on the nominal response model, allows test takers to provide more than one answer to each item during the test, which they all contribute to the interim and final ability estimation. This approach is here reformulated, extended to incorporate a larger class of polytomous and dichotomous item response theory models, and investigated with simulation studies under different test-taking strategies.
Assuntos
Avaliação Educacional , Modelos Estatísticos , Psicometria , Software , Algoritmos , Humanos , Teoria PsicológicaRESUMO
A monotone relationship between a true score (τ) and a latent trait level (θ) has been a key assumption for many psychometric applications. The monotonicity property in dichotomous response models is evident as a result of a transformation via a test characteristic curve. Monotonicity in polytomous models, in contrast, is not immediately obvious because item response functions are determined by a set of response category curves, which are conceivably non-monotonic in θ. The purpose of the present note is to demonstrate strict monotonicity in ordered polytomous item response models. Five models that are widely used in operational assessments are considered for proof: the generalized partial credit model (Muraki, 1992, Applied Psychological Measurement, 16, 159), the nominal model (Bock, 1972, Psychometrika, 37, 29), the partial credit model (Masters, 1982, Psychometrika, 47, 147), the rating scale model (Andrich, 1978, Psychometrika, 43, 561), and the graded response model (Samejima, 1972, A general model for free-response data (Psychometric Monograph no. 18). Psychometric Society, Richmond). The study asserts that the item response functions in these models strictly increase in θ and thus there exists strict monotonicity between τ and θ under certain specified conditions. This conclusion validates the practice of customarily using τ in place of θ in applied settings and provides theoretical grounds for one-to-one transformations between the two scales.
Assuntos
Modelos Teóricos , Psicometria/métodos , Humanos , Funções Verossimilhança , Teoria PsicológicaRESUMO
Item compromise persists in undermining the integrity of testing, even secure administrations of computerized adaptive testing (CAT) with sophisticated item exposure controls. In ongoing efforts to tackle this perennial security issue in CAT, a couple of recent studies investigated sequential procedures for detecting compromised items, in which a significant increase in the proportion of correct responses for each item in the pool is monitored in real time using moving averages. In addition to actual responses, response times are valuable information with tremendous potential to reveal items that may have been leaked. Specifically, examinees that have preknowledge of an item would likely respond more quickly to it than those who do not. Therefore, the current study proposes several augmented methods for the detection of compromised items, all involving simultaneous monitoring of changes in both the proportion correct and average response time for every item using various moving average strategies. Simulation results with an operational item pool indicate that, compared to the analysis of responses alone, utilizing response times can afford marked improvements in detection power with fewer false positives.
Assuntos
Computadores , Tempo de Reação , Algoritmos , Simulação por Computador , Humanos , PsicometriaRESUMO
Multidimensional computerized adaptive testing (MCAT) has received increasing attention over the past few years in educational measurement. Like all other formats of CAT, item replenishment is an essential part of MCAT for its item bank maintenance and management, which governs retiring overexposed or obsolete items over time and replacing them with new ones. Moreover, calibration precision of the new items will directly affect the estimation accuracy of examinees' ability vectors. In unidimensional CAT (UCAT) and cognitive diagnostic CAT, online calibration techniques have been developed to effectively calibrate new items. However, there has been very little discussion of online calibration in MCAT in the literature. Thus, this paper proposes new online calibration methods for MCAT based upon some popular methods used in UCAT. Three representative methods, Method A, the 'one EM cycle' method and the 'multiple EM cycles' method, are generalized to MCAT. Three simulation studies were conducted to compare the three new methods by manipulating three factors (test length, item bank design, and level of correlation between coordinate dimensions). The results showed that all the new methods were able to recover the item parameters accurately, and the adaptive online calibration designs showed some improvements compared to the random design under most conditions.
Assuntos
Algoritmos , Calibragem/normas , Simulação por Computador , Avaliação Educacional/métodos , Avaliação Educacional/normas , Psicometria/normas , Modelos Estatísticos , Sistemas On-Line , Psicometria/métodos , Reprodutibilidade dos Testes , Sensibilidade e EspecificidadeRESUMO
An informational distance/divergence-based approach is proposed to detect the presence of parameter drift in multidimensional computerized adaptive testing (MCAT). The study presents significance testing procedures for identifying changes in multidimensional item response functions (MIRFs) over time based on informational distance/divergence measures that capture the discrepancy between two probability functions. To approximate the MIRFs from the observed response data, the k-nearest neighbors algorithm is used with the random search method. A simulation study suggests that the distance/divergence-based drift measures perform effectively in identifying the instances of parameter drift in MCAT. They showed moderate power with small samples of 500 examinees and excellent power when the sample size was as large as 1,000. The proposed drift measures also adequately controlled for Type I error at the nominal level under the null hypothesis.
RESUMO
Cognitive diagnostic computerized adaptive testing (CD-CAT) purports to obtain useful diagnostic information with great efficiency brought by CAT technology. Most of the existing CD-CAT item selection algorithms are evaluated when test length is fixed and relatively long, but some applications of CD-CAT, such as in interim assessment, require to obtain the cognitive pattern with a short test. The mutual information (MI) algorithm proposed by Wang is the first endeavor to accommodate this need. To reduce the computational burden, Wang provided a simplified scheme, but at the price of scale/sign change in the original index. As a result, it is very difficult to combine it with some popular constraint management methods. The current study proposes two high-efficiency algorithms, posterior-weighted cognitive diagnostic model (CDM) discrimination index (PWCDI) and posterior-weighted attribute-level CDM discrimination index (PWACDI), by modifying the CDM discrimination index. They can be considered as an extension of the Kullback-Leibler (KL) and posterior-weighted KL (PWKL) methods. A pre-calculation strategy has also been developed to address the computational issue. Simulation studies indicate that the newly developed methods can produce results comparable with or better than the MI and PWKL in both short and long tests. The other major advantage is that the computational issue has been addressed more elegantly than MI. PWCDI and PWACDI can run as fast as PWKL. More importantly, they do not suffer from the problem of scale/sign change as MI and, thus, can be used with constraint management methods together in a straightforward manner.
RESUMO
The paper provides a survey of 18 years' progress that my colleagues, students (both former and current) and I made in a prominent research area in Psychometrics-Computerized Adaptive Testing (CAT). We start with a historical review of the establishment of a large sample foundation for CAT. It is worth noting that the asymptotic results were derived under the framework of Martingale Theory, a very theoretical perspective of Probability Theory, which may seem unrelated to educational and psychological testing. In addition, we address a number of issues that emerged from large scale implementation and show that how theoretical works can be helpful to solve the problems. Finally, we propose that CAT technology can be very useful to support individualized instruction on a mass scale. We show that even paper and pencil based tests can be made adaptive to support classroom teaching.
Assuntos
Computadores , Avaliação Educacional/métodos , Psicometria/métodos , Algoritmos , Humanos , Modelos Estatísticos , Teoria da ProbabilidadeRESUMO
Recently, multistage testing (MST) has been adopted by several important large-scale testing programs and become popular among practitioners and researchers. Stemming from the decades of history of computerized adaptive testing (CAT), the rapidly growing MST alleviates several major problems of earlier CAT applications. Nevertheless, MST is only one among all possible solutions to these problems. This article presents a new adaptive testing design, "on-the-fly assembled multistage adaptive testing" (OMST), which combines the benefits of CAT and MST and offsets their limitations. Moreover, OMST also provides some unique advantages over both CAT and MST. A simulation study was conducted to compare OMST with MST and CAT, and the results demonstrated the promising features of OMST. Finally, the "Discussion" section provides suggestions on possible future adaptive testing designs based on the OMST framework, which could provide great flexibility for adaptive tests in the digital future and open an avenue for all types of hybrid designs based on the different needs of specific tests.
RESUMO
With the advent of web-based technology, online testing is becoming a mainstream mode in large-scale educational assessments. Most online tests are administered continuously in a testing window, which may post test security problems because examinees who take the test earlier may share information with those who take the test later. Researchers have proposed various statistical indices to assess the test security, and one most often used index is the average test-overlap rate, which was further generalized to the item pooling index (Chang & Zhang, 2002, 2003). These indices, however, are all defined as the means (that is, the expected proportion of common items among examinees) and they were originally proposed for computerized adaptive testing (CAT). Recently, multistage testing (MST) has become a popular alternative to CAT. The unique features of MST make it important to report not only the mean, but also the standard deviation (SD) of test overlap rate, as we advocate in this paper. The standard deviation of test overlap rate adds important information to the test security profile, because for the same mean, a large SD reflects that certain groups of examinees share more common items than other groups. In this study, we analytically derived the lower bounds of the SD under MST, with the results under CAT as a benchmark. It is shown that when the mean overlap rate is the same between MST and CAT, the SD of test overlap tends to be larger in MST. A simulation study was conducted to provide empirical evidence. We also compared the security of MST under the single-pool versus the multiple-pool designs; both analytical and simulation studies show that the non-overlapping multiple-pool design will slightly increase the security risk.