Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
1.
Behav Res Methods ; 54(5): 2114-2145, 2022 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-34910286

RESUMO

In social sciences, the study of group differences concerning latent constructs is ubiquitous. These constructs are generally measured by means of scales composed of ordinal items. In order to compare these constructs across groups, one crucial requirement is that they are measured equivalently or, in technical jargon, that measurement invariance (MI) holds across the groups. This study compared the performance of scale- and item-level approaches based on multiple group categorical confirmatory factor analysis (MG-CCFA) and multiple group item response theory (MG-IRT) in testing MI with ordinal data. In general, the results of the simulation studies showed that MG-CCFA-based approaches outperformed MG-IRT-based approaches when testing MI at the scale level, whereas, at the item level, the best performing approach depends on the tested parameter (i.e., loadings or thresholds). That is, when testing loadings equivalence, the likelihood ratio test provided the best trade-off between true-positive rate and false-positive rate, whereas, when testing thresholds equivalence, the χ2 test outperformed the other testing strategies. In addition, the performance of MG-CCFA's fit measures, such as RMSEA and CFI, seemed to depend largely on the length of the scale, especially when MI was tested at the item level. General caution is recommended when using these measures, especially when MI is tested for each item individually.


Assuntos
Análise Fatorial , Humanos , Psicometria/métodos
2.
Behav Res Methods ; 50(6): 2325-2344, 2018 12.
Artigo em Inglês | MEDLINE | ID: mdl-29322400

RESUMO

This article proposes a general mixture item response theory (IRT) framework that allows for classes of persons to differ with respect to the type of processes underlying the item responses. Through the use of mixture models, nonnested IRT models with different structures can be estimated for different classes, and class membership can be estimated for each person in the sample. If researchers are able to provide competing measurement models, this mixture IRT framework may help them deal with some violations of measurement invariance. To illustrate this approach, we consider a two-class mixture model, where a person's responses to Likert-scale items containing a neutral middle category are either modeled using a generalized partial credit model, or through an IRTree model. In the first model, the middle category ("neither agree nor disagree") is taken to be qualitatively similar to the other categories, and is taken to provide information about the person's endorsement. In the second model, the middle category is taken to be qualitatively different and to reflect a nonresponse choice, which is modeled using an additional latent variable that captures a person's willingness to respond. The mixture model is studied using simulation studies and is applied to an empirical example.


Assuntos
Tomada de Decisões , Modelos Estatísticos , Psicometria , Habilidades para Realização de Testes/psicologia , Humanos , Enquadramento Psicológico , Análise e Desempenho de Tarefas , Pesos e Medidas
3.
Educ Psychol Meas ; 84(1): 145-170, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38250509

RESUMO

Extreme response style (ERS), the tendency of participants to select extreme item categories regardless of the item content, has frequently been found to decrease the validity of Likert-type questionnaire results. For this reason, various item response theory (IRT) models have been proposed to model ERS and correct for it. Comparisons of these models are however rare in the literature, especially in the context of cross-cultural comparisons, where ERS is even more relevant due to cultural differences between groups. To remedy this issue, the current article examines two frequently used IRT models that can be estimated using standard software: a multidimensional nominal response model (MNRM) and a IRTree model. Studying conceptual differences between these models reveals that they differ substantially in their conceptualization of ERS. These differences result in different category probabilities between the models. To evaluate the impact of these differences in a multigroup context, a simulation study is conducted. Our results show that when the groups differ in their average ERS, the IRTree model and MNRM can drastically differ in their conclusions about the size and presence of differences in the substantive trait between these groups. An empirical example is given and implications for the future use of both models and the conceptualization of ERS are discussed.

4.
Educ Psychol Meas ; 83(3): 433-472, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-37187696

RESUMO

Assessing the measurement model (MM) of self-report scales is crucial to obtain valid measurements of individuals' latent psychological constructs. This entails evaluating the number of measured constructs and determining which construct is measured by which item. Exploratory factor analysis (EFA) is the most-used method to evaluate these psychometric properties, where the number of measured constructs (i.e., factors) is assessed, and, afterward, rotational freedom is resolved to interpret these factors. This study assessed the effects of an acquiescence response style (ARS) on EFA for unidimensional and multidimensional (un)balanced scales. Specifically, we evaluated (a) whether ARS is captured as an additional factor, (b) the effect of different rotation approaches on the content and ARS factors recovery, and (c) the effect of extracting the additional ARS factor on the recovery of factor loadings. ARS was often captured as an additional factor in balanced scales when it was strong. For these scales, ignoring extracting this additional ARS factor, or rotating to simple structure when extracting it, harmed the recovery of the original MM by introducing bias in loadings and cross-loadings. These issues were avoided by using informed rotation approaches (i.e., target rotation), where (part of) the rotation target is specified according to a priori expectations on the MM. Not extracting the additional ARS factor did not affect the loading recovery in unbalanced scales. Researchers should consider the potential presence of ARS when assessing the psychometric properties of balanced scales and use informed rotation approaches when suspecting that an additional factor is an ARS factor.

5.
Br J Math Stat Psychol ; 74 Suppl 1: 176-198, 2021 07.
Artigo em Inglês | MEDLINE | ID: mdl-33351188

RESUMO

With advances in computerized tests, it has become commonplace to register not just the accuracy of the responses provided to the items, but also the response time. The idea that for each response both response accuracy and response time are indicative of ability has explicitly been incorporated in the signed residual time (SRT) model (Maris & van der Maas, 2012, Psychometrika, 77, 615-633), which assumes that fast correct responses are indicative of a higher level of ability than slow correct responses. While the SRT model allows one to gain more information about ability than is possible based on considering only response accuracy, measurement may be confounded if persons show differences in their response speed that cannot be explained by ability, for example due to differences in response caution. In this paper we propose an adapted version of the SRT model that makes it possible to model person differences in overall speed, while maintaining the idea of the SRT model that the speed at which individual responses are given may be indicative of ability. We propose a two-dimensional SRT model that considers dichotomized response time, which allows one to model differences between fast and slow responses. The model includes both an ability and a speed parameter, and allows one to correct the estimates of ability for possible differences in overall speed. The performance of the model is evaluated through simulation, and the relevance of including the speed parameter is studied in the context of an empirical example from formative educational assessment.


Assuntos
Individualidade , Modelos Estatísticos , Simulação por Computador , Humanos , Psicometria , Tempo de Reação
6.
Front Psychol ; 12: 579128, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33815190

RESUMO

Log-file data from computer-based assessments can provide useful collateral information for estimating student abilities. In turn, this can improve traditional approaches that only consider response accuracy. Based on the amounts of time students spent on 10 mathematics items from the PISA 2012, this study evaluated the overall changes in and measurement precision of ability estimates and explored country-level heterogeneity when combining item responses and time-on-task measurements using a joint framework. Our findings suggest a notable increase in precision with the incorporation of response times and indicate differences between countries in how respondents approached items as well as in their response processes. Results also showed that additional information could be captured through differences in the modeling structure when response times were included. However, such information may not reflect the testing objective.

7.
Psychometrika ; 84(4): 1018-1046, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31463656

RESUMO

While standard joint models for response time and accuracy commonly assume the relationship between response time and accuracy to be fully explained by the latent variables of the model, this assumption of conditional independence is often violated in practice. If such violations are present, taking these residual dependencies between response time and accuracy into account may both improve the fit of the model to the data and improve our understanding of the response processes that led to the observed responses. In this paper, we propose a framework for the joint modeling of response time and accuracy data that allows for differences in the processes leading to correct and incorrect responses. Extensions of the standard hierarchical model (van der Linden in Psychometrika 72:287-308, 2007. https://doi.org/10.1007/s11336-006-1478-z ) are considered that allow some or all item parameters in the measurement model of speed to differ depending on whether a correct or an incorrect response was obtained. The framework also allows one to consider models that include two speed latent variables, which explain the patterns observed in the responses times of correct and of incorrect responses, respectively. Model selection procedures are proposed and evaluated based on a simulation study, and a simulation study investigating parameter recovery is presented. An application of the modeling framework to empirical data from international large-scale assessment is considered to illustrate the relevance of modeling possible differences between the processes leading to correct and incorrect responses.


Assuntos
Simulação por Computador , Modelos Estatísticos , Psicometria , Tempo de Reação/fisiologia , Algoritmos , Interpretação Estatística de Dados , Humanos
8.
Psychometrika ; 84(3): 846-869, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-30793230

RESUMO

The assumption of latent monotonicity is made by all common parametric and nonparametric polytomous item response theory models and is crucial for establishing an ordinal level of measurement of the item score. Three forms of latent monotonicity can be distinguished: monotonicity of the cumulative probabilities, of the continuation ratios, and of the adjacent-category ratios. Observable consequences of these different forms of latent monotonicity are derived, and Bayes factor methods for testing these consequences are proposed. These methods allow for the quantification of the evidence both in favor and against the tested property. Both item-level and category-level Bayes factors are considered, and their performance is evaluated using a simulation study. The methods are applied to an empirical example consisting of a 10-item Likert scale to investigate whether a polytomous item scoring rule results in item scores that are of ordinal level measurement.


Assuntos
Teorema de Bayes , Tempo de Reação/fisiologia , Simulação por Computador , Precisão da Medição Dimensional , Feminino , Feminismo/classificação , Identidade de Gênero , Humanos , Modelos Teóricos , Estatísticas não Paramétricas
9.
Front Genet ; 10: 837, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31681400

RESUMO

The often-used A(C)E model that decomposes phenotypic variance into parts due to additive genetic and environmental influences can be extended to a longitudinal model when the trait has been assessed at multiple occasions. This enables inference about the nature (e.g., genetic or environmental) of the covariance among the different measurement points. In the case that the measurement of the phenotype relies on self-report data (e.g., questionnaire data), often, aggregated scores (e.g., sum-scores) are used as a proxy for the phenotype. However, earlier research based on the univariate ACE model that concerns a single measurement occasion has shown that this can lead to an underestimation of heritability and that instead, one should prefer to model the raw item data by integrating an explicit measurement model into the analysis. This has, however, not been translated to the more complex longitudinal case. In this paper, we first present a latent state twin A(C)E model that combines the genetic twin model with an item response theory (IRT) model as well as its specification in a Bayesian framework. Two simulation studies were conducted to investigate 1) how large the bias is when sum-scores are used in the longitudinal A(C)E model and 2) if using the latent twin model can overcome the potential bias. Results of the first simulation study (e.g., AE model) demonstrated that using a sum-score approach leads to underestimated heritability estimates and biased covariance estimates. Surprisingly, the IRT approach also lead to bias, but to a much lesser degree. The amount of bias increased in the second simulation study (e.g., ACE model) under both frameworks, with the IRT approach still being the less biased approach. Since the bias was less severe under the IRT approach than under the sum-score approach and due to other advantages of latent variable modelling, we still advise researcher to adopt the IRT approach. We further illustrate differences between the traditional sum-score approach and the latent state twin A(C)E model by analyzing data of a two-wave twin study, consisting of the answers of 8,016 twins on a scale developed to measure social attitudes related to conservatism.

10.
Psychon Bull Rev ; 25(2): 548-559, 2018 04.
Artigo em Inglês | MEDLINE | ID: mdl-29476482

RESUMO

This article explores whether the null hypothesis significance testing (NHST) framework provides a sufficient basis for the evaluation of statistical model assumptions. It is argued that while NHST-based tests can provide some degree of confirmation for the model assumption that is evaluated-formulated as the null hypothesis-these tests do not inform us of the degree of support that the data provide for the null hypothesis and to what extent the null hypothesis should be considered to be plausible after having taken the data into account. Addressing the prior plausibility of the model assumption is unavoidable if the goal is to determine how plausible it is that the model assumption holds. Without assessing the prior plausibility of the model assumptions, it remains fully uncertain whether the model of interest gives an adequate description of the data and thus whether it can be considered valid for the application at hand. Although addressing the prior plausibility is difficult, ignoring the prior plausibility is not an option if we want to claim that the inferences of our statistical model can be relied upon.


Assuntos
Interpretação Estatística de Dados , Modelos Estatísticos , Projetos de Pesquisa , Teorema de Bayes , Humanos
11.
Front Psychol ; 9: 964, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29951023

RESUMO

In many applications of high- and low-stakes ability tests, a non-negligible amount of respondents may fail to reach the end of the test within the specified time limit. Since for respondents that ran out of time some item responses will be missing, this raises the question of how to best deal with these missing responses for the purpose of obtaining an optimal assessment of ability. Commonly, researchers consider three general solutions: ignore the missing responses, treat them as being incorrect, or treat the responses as missing but model the missingness mechanism. This paper approaches the issue of dealing with not reached items from a measurement perspective, and considers the question what the operationalization of ability should be in maximum performance tests that work with effective time limits. We argue that the target ability that the test attempts to measure is maximum performance when operating at the test-indicated speed, and that the test instructions should be taken to imply that respondents should operate at this target speed. The phenomenon of the speed-ability trade-off informs us that the ability that is measured by the test will depend on this target speed, as different speed levels will result in different levels of performance on the same set of items. Crucially, since respondents with not reached items worked at a speed level lower than this target speed, the level of ability that they have been able to display on the items that they did reach is higher than the level of ability that they would have displayed if they had worked at the target speed (i.e., higher than their level on the target ability). Thus, statistical methods that attempt to obtain unbiased estimates of the ability as displayed on the items that were reached will result in biased estimates of the target ability. The practical implications are studied in a simulation study where different methods of dealing with not reached items are contrasted, which shows that current methods result in biased estimates of target ability when a speed-ability trade-off is present. The paper concludes with a discussion of ways in which the issue can be resolved.

12.
Br J Math Stat Psychol ; 71(1): 13-38, 2018 02.
Artigo em Inglês | MEDLINE | ID: mdl-28635139

RESUMO

By considering information about response time (RT) in addition to response accuracy (RA), joint models for RA and RT such as the hierarchical model (van der Linden, 2007) can improve the precision with which ability is estimated over models that only consider RA. The hierarchical model, however, assumes that only the person's speed is informative of ability. This assumption of conditional independence between RT and ability given speed may be violated in practice, and ignores collateral information about ability that may be present in the residual RTs. We propose a posterior predictive check for evaluating the assumption of conditional independence between RT and ability given speed. Furthermore, we propose an extension of the hierarchical model that contains cross-loadings between ability and RT, which enables one to take additional collateral information about ability into account beyond what is possible in the standard hierarchical model. A Bayesian estimation procedure is proposed for the model. Using simulation studies, the performance of the model is evaluated in terms of parameter recovery, and the possible gain in precision over the standard hierarchical model and an RA-only model is considered. The model is applied to data from a high-stakes educational test.


Assuntos
Simulação por Computador , Modelos Estatísticos , Psicometria/métodos , Tempo de Reação/fisiologia , Algoritmos , Teorema de Bayes , Interpretação Estatística de Dados , Avaliação Educacional , Humanos , Idioma , Países Baixos , Reprodutibilidade dos Testes
13.
Appl Psychol Meas ; 42(7): 553-570, 2018 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-30237646

RESUMO

Reliability is usually estimated for a test score, but it can also be estimated for item scores. Item-score reliability can be useful to assess the item's contribution to the test score's reliability, for identifying unreliable scores in aberrant item-score patterns in person-fit analysis, and for selecting the most reliable item from a test to use as a single-item measure. Four methods were discussed for estimating item-score reliability: the Molenaar-Sijtsma method (method MS), Guttman's method λ6 , the latent class reliability coefficient (method LCRC), and the correction for attenuation (method CA). A simulation study was used to compare the methods with respect to median bias, variability (interquartile range [IQR]), and percentage of outliers. The simulation study consisted of six conditions: standard, polytomous items, unequal α parameters, two-dimensional data, long test, and small sample size. Methods MS and CA were the most accurate. Method LCRC showed almost unbiased results, but large variability. Method λ6 consistently underestimated item-score reliabilty, but showed a smaller IQR than the other methods.

14.
Educ Psychol Meas ; 78(6): 998-1020, 2018 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-30542214

RESUMO

Reliability is usually estimated for a total score, but it can also be estimated for item scores. Item-score reliability can be useful to assess the repeatability of an individual item score in a group. Three methods to estimate item-score reliability are discussed, known as method MS, method λ 6 , and method CA. The item-score reliability methods are compared with four well-known and widely accepted item indices, which are the item-rest correlation, the item-factor loading, the item scalability, and the item discrimination. Realistic values for item-score reliability in empirical-data sets are monitored to obtain an impression of the values to be expected in other empirical-data sets. The relation between the three item-score reliability methods and the four well-known item indices are investigated. Tentatively, a minimum value for the item-score reliability methods to be used in item analysis is recommended.

15.
Front Psychol ; 9: 2298, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30687144

RESUMO

This study investigates the usefulness of item-score reliability as a criterion for item selection in test construction. Methods MS, λ6, and CA were investigated as item-assessment methods in item selection and compared to the corrected item-total correlation, which was used as a benchmark. An ideal ordering to add items to the test (bottom-up procedure) or omit items from the test (top-down procedure) was defined based on the population test-score reliability. The orderings the four item-assessment methods produced in samples were compared to the ideal ordering, and the degree of resemblance was expressed by means of Kendall's τ. To investigate the concordance of the orderings across 1,000 replicated samples, Kendall's W was computed for each item-assessment method. The results showed that for both the bottom-up and the top-down procedures, item-assessment method CA and the corrected item-total correlation most closely resembled the ideal ordering. Generally, all item assessment methods resembled the ideal ordering better, and concordance of the orderings was greater, for larger sample sizes, and greater variance of the item discrimination parameters.

16.
Br J Math Stat Psychol ; 70(2): 257-279, 2017 May.
Artigo em Inglês | MEDLINE | ID: mdl-27618470

RESUMO

It is becoming more feasible and common to register response times in the application of psychometric tests. Researchers thus have the opportunity to jointly model response accuracy and response time, which provides users with more relevant information. The most common choice is to use the hierarchical model (van der Linden, 2007, Psychometrika, 72, 287), which assumes conditional independence between response time and accuracy, given a person's speed and ability. However, this assumption may be violated in practice if, for example, persons vary their speed or differ in their response strategies, leading to conditional dependence between response time and accuracy and confounding measurement. We propose six nested hierarchical models for response time and accuracy that allow for conditional dependence, and discuss their relationship to existing models. Unlike existing approaches, the proposed hierarchical models allow for various forms of conditional dependence in the model and allow the effect of continuous residual response time on response accuracy to be item-specific, person-specific, or both. Estimation procedures for the models are proposed, as well as two information criteria that can be used for model selection. Parameter recovery and usefulness of the information criteria are investigated using simulation, indicating that the procedure works well and is likely to select the appropriate model. Two empirical applications are discussed to illustrate the different types of conditional dependence that may occur in practice and how these can be captured using the proposed hierarchical models.


Assuntos
Modelos Estatísticos , Psicometria/estatística & dados numéricos , Tempo de Reação , Comportamento de Escolha , Humanos
17.
Psychometrika ; 82(4): 1126-1148, 2017 12.
Artigo em Inglês | MEDLINE | ID: mdl-27738955

RESUMO

The assumption of conditional independence between response time and accuracy given speed and ability is commonly made in response time modelling. However, this assumption might be violated in some cases, meaning that the relationship between the response time and the response accuracy of the same item cannot be fully explained by the correlation between the overall speed and ability. We propose to explicitly model the residual dependence between time and accuracy by incorporating the effects of the residual response time on the intercept and the slope parameter of the IRT model for response accuracy. We present an empirical example of a violation of conditional independence from a low-stakes educational test and show that our new model reveals interesting phenomena about the dependence of the item properties on whether the response is relatively fast or slow. For more difficult items responding slowly is associated with a higher probability of a correct response, whereas for the easier items responding slower is associated with a lower probability of a correct response. Moreover, for many of the items slower responses were less informative for the ability because their discrimination parameters decrease with residual response time.


Assuntos
Modelos Psicológicos , Modelos Estatísticos , Tempo de Reação , Simulação por Computador , Humanos , Tamanho da Amostra
18.
Front Psychol ; 8: 202, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28261136

RESUMO

With the widespread use of computerized tests in educational measurement and cognitive psychology, registration of response times has become feasible in many applications. Considering these response times helps provide a more complete picture of the performance and characteristics of persons beyond what is available based on response accuracy alone. Statistical models such as the hierarchical model (van der Linden, 2007) have been proposed that jointly model response time and accuracy. However, these models make restrictive assumptions about the response processes (RPs) that may not be realistic in practice, such as the assumption that the association between response time and accuracy is fully explained by taking speed and ability into account (conditional independence). Assuming conditional independence forces one to ignore that many relevant individual differences may play a role in the RPs beyond overall speed and ability. In this paper, we critically consider the assumption of conditional independence and the important ways in which it may be violated in practice from a substantive perspective. We consider both conditional dependences that may arise when all persons attempt to solve the items in similar ways (homogeneous RPs) and those that may be due to persons differing in fundamental ways in how they deal with the items (heterogeneous processes). The paper provides an overview of what we can learn from observed conditional dependences. We argue that explaining and modeling these differences in the RPs is crucial to increase both the validity of measurement and our understanding of the relevant RPs.

19.
Psychometrika ; 80(4): 880-96, 2015 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-26377889

RESUMO

The assumption of latent monotonicity in item response theory models for dichotomous data cannot be evaluated directly, but observable consequences such as manifest monotonicity facilitate the assessment of latent monotonicity in real data. Standard methods for evaluating manifest monotonicity typically produce a test statistic that is geared toward falsification, which can only provide indirect support in favor of manifest monotonicity. We propose the use of Bayes factors to quantify the degree of support available in the data in favor of manifest monotonicity or against manifest monotonicity. Through the use of informative hypotheses, this procedure can also be used to determine the support for manifest monotonicity over substantively or statistically relevant alternatives to manifest monotonicity, rendering the procedure highly flexible. The performance of the procedure is evaluated using a simulation study, and the application of the procedure is illustrated using empirical data.


Assuntos
Teorema de Bayes , Algoritmos , Modelos Estatísticos , Psicometria/estatística & dados numéricos
20.
Psychometrika ; 78(1): 83-97, 2013 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-25107519

RESUMO

Most dichotomous item response models share the assumption of latent monotonicity, which states that the probability of a positive response to an item is a nondecreasing function of a latent variable intended to be measured. Latent monotonicity cannot be evaluated directly, but it implies manifest monotonicity across a variety of observed scores, such as the restscore, a single item score, and in some cases the total score. In this study, we show that manifest monotonicity can be tested by means of the order-constrained statistical inference framework. We propose a procedure that uses this framework to determine whether manifest monotonicity should be rejected for specific items. This approach provides a likelihood ratio test for which the p-value can be approximated through simulation. A simulation study is presented that evaluates the Type I error rate and power of the test, and the procedure is applied to empirical data.


Assuntos
Psicometria/métodos , Estatística como Assunto/métodos , Humanos
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa