RESUMO
We introduce the special section on nonparametric item response theory (IRT) in Quality of Life Research. Starting from the well-known Rasch model, we provide a brief overview of nonparametric IRT models and discuss the assumptions, the properties, and the investigation of goodness of fit. We provide references to more detailed texts to help readers getting acquainted with nonparametric IRT models. In addition, we show how the rather diverse papers in the special section fit into the nonparametric IRT framework. Finally, we illustrate the application of nonparametric IRT models using data from a questionnaire measuring activity limitations in walking. The real-data example shows the quality of the scale and its constituent items with respect to dimensionality, local independence, monotonicity, and invariant item ordering.
Assuntos
Qualidade de Vida , Humanos , Psicometria , Qualidade de Vida/psicologia , Inquéritos e QuestionáriosRESUMO
PURPOSE: Mokken scale analysis (MSA) is an attractive scaling procedure for ordinal data. MSA is frequently used in health-related quality of life research. Two of MSA's prime features are the scalability coefficients and the automated item selection procedure (AISP). The AISP partitions a (large) set of items into scales based on the observed item scores; the resulting scales can be used as measurement instruments. There exist two issues in MSA: First, point estimates, standard errors, and test statistics for scalability coefficients are inappropriate for clustered item scores, which are omnipresent in quality of life research data. Second, the AISP insufficiently takes sampling fluctuation of Mokken's scalability coefficients into account. METHODS: We solved both issues by providing point estimates and standard errors for the scalability coefficients for clustered data and by implementing a Wald-based significance test in the AISP algorithm, resulting in a test-guided AISP (T-AISP), that is available for both nonclustered and clustered test scores. RESULTS: We integrated the T-AISP into a two-step, test-guided MSA for scale construction, to guide the analysis for nonclustered and clustered data. The first step is performing a T-AISP and select the final scale(s). For clustered data, within-group dependency is investigated on the final scale(s). In the second step, the strength of the scale(s) is determined and further analyses are performed. The procedure was demonstrated on clustered item scores obtained from administering a questionnaire on quality of life in schools to 639 students nested in 30 classrooms. CONCLUSIONS: We developed a two-step, test-guided MSA for scale construction that takes into account sample fluctuation of all scalability coefficients and that can be applied to item scores obtained by a nonclustered or clustered sampling design.
Assuntos
Qualidade de Vida , Projetos de Pesquisa , Algoritmos , Humanos , Psicometria , Qualidade de Vida/psicologia , Reprodutibilidade dos Testes , Inquéritos e QuestionáriosRESUMO
INTRODUCTION: Loss of sensation due to diabetes-related neuropathy often leads to diabetic foot ulceration. Several test instruments are used to assess sensation, such as static and moving 2-point discrimination (S2PD, M2PD), monofilaments, and tuning forks. METHODS: Mokken scale analysis was applied to the Rotterdam Diabetic Foot Study data to select hierarchies of tests to construct measurement scales. RESULTS: We developed 39-item and 31-item scales to measure loss of sensation for research purposes and a 13-item scale for clinical practice. All instruments were strongly scalable and reliable. The 39 items can be classified into 5 hierarchically ordered core clusters: S2PD, M2PD, vibration sense, monofilaments, and prior ulcer or amputation. DISCUSSION: Guided by the presented scales, clinicians may better classify the grade of sensory loss in diabetic patients' feet. Thus, a more personalized approach concerning individual recommendations, intervention strategies, and patient information may be applied.
Assuntos
Pé Diabético/diagnóstico , Limiar Sensorial , Adulto , Idoso , Estudos de Casos e Controles , Estudos de Coortes , Pé Diabético/fisiopatologia , Neuropatias Diabéticas/diagnóstico , Neuropatias Diabéticas/fisiopatologia , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Países Baixos , Estudos Prospectivos , Índice de Gravidade de Doença , VibraçãoRESUMO
BACKGROUND: Two important goals when using questionnaires are (a) measurement: the questionnaire is constructed to assign numerical values that accurately represent the test taker's attribute, and (b) prediction: the questionnaire is constructed to give an accurate forecast of an external criterion. Construction methods aimed at measurement prescribe that items should be reliable. In practice, this leads to questionnaires with high inter-item correlations. By contrast, construction methods aimed at prediction typically prescribe that items have a high correlation with the criterion and low inter-item correlations. The latter approach has often been said to produce a paradox concerning the relation between reliability and validity [1-3], because it is often assumed that good measurement is a prerequisite of good prediction. OBJECTIVE: To answer four questions: (1) Why are measurement-based methods suboptimal for questionnaires that are used for prediction? (2) How should one construct a questionnaire that is used for prediction? (3) Do questionnaire-construction methods that optimize measurement and prediction lead to the selection of different items in the questionnaire? (4) Is it possible to construct a questionnaire that can be used for both measurement and prediction? ILLUSTRATIVE EXAMPLE: An empirical data set consisting of scores of 242 respondents on questionnaire items measuring mental health is used to select items by means of two methods: a method that optimizes the predictive value of the scale (i.e., forecast a clinical diagnosis), and a method that optimizes the reliability of the scale. We show that for the two scales different sets of items are selected and that a scale constructed to meet the one goal does not show optimal performance with reference to the other goal. DISCUSSION: The answers are as follows: (1) Because measurement-based methods tend to maximize inter-item correlations by which predictive validity reduces. (2) Through selecting items that correlate highly with the criterion and lowly with the remaining items. (3) Yes, these methods may lead to different item selections. (4) For a single questionnaire: Yes, but it is problematic because reliability cannot be estimated accurately. For a test battery: Yes, but it is very costly. Implications for the construction of patient-reported outcome questionnaires are discussed.
Assuntos
Medidas de Resultados Relatados pelo Paciente , Inquéritos e Questionários , Feminino , Humanos , Masculino , Psicometria , Qualidade de Vida , Reprodutibilidade dos TestesRESUMO
In a sample of 38 eating disorder (ED) patients who received psychotherapeutic treatment, changes in attachment security, and mentalization in relation to symptoms reduction were investigated. Attachment security improved in 1 year but was unrelated to improvement of ED or comorbid symptoms. Mentalization did not change significantly in 1 year. Pretreatment mentalization was negatively related to the severity of ED symptoms, trait anxiety, psycho-neuroticism, and self-injurious behavior after 1 year of treatment. We conclude that for ED patients, improving mentalization might increase the effect of treatment on core and comorbid symptoms.
Assuntos
Transtornos da Alimentação e da Ingestão de Alimentos/terapia , Apego ao Objeto , Teoria da Mente/fisiologia , Ansiedade/psicologia , Comorbidade , Transtornos da Alimentação e da Ingestão de Alimentos/psicologia , Feminino , Humanos , Comportamento Autodestrutivo , Fatores de Tempo , Resultado do Tratamento , Adulto JovemRESUMO
PURPOSE: To investigate whether recovery from an eating disorder is related to pre-treatment attachment and mentalization and/or to improvement of attachment and mentalization during treatment. METHOD: For a sample of 38 anorexia nervosa (AN) and bulimia nervosa (BN) patients receiving treatment the relations between attachment security, mentalization, comorbidity and recovery status after 12 months (not recovered or recovered), and after 18 months (persistently ill, relapsed, newly recovered, or persistently recovered) were investigated. Attachment security and mentalization were assessed by the Adult Attachment Interview at the start of the treatment and after 12 months. Besides assessing co-morbidity-for its effect on treatment outcome-we measured psycho-neuroticism and autonomy because of their established relations to both eating disorder symptoms and to attachment security. RESULTS: Recovery both at 12 months and at 18 months was related to higher levels of mentalization; for attachment, no significant differences were found between recovered and unrecovered patients. Patients who recovered from AN or BN also improved on co-morbid symptoms: whereas pre-treatment symptom severity was similar, at 12 months recovered patients scored lower on co-morbid personality disorders, anxiety, depression, self-injurious behaviour and psycho-neuroticism than unrecovered patients. Improvement on autonomy (reduced sensitivity to others; greater capacity to manage new situations) in 1 year of treatment was significantly higher in recovered than in unrecovered patients. CONCLUSION: A focus on enhancing mentalization in eating disorder treatment might be useful to increase the chances of successful treatment. Improvement of autonomy might be the mechanism of change in recovering from AN or BN. LEVEL OF EVIDENCE: Level III cohort study.
Assuntos
Transtornos da Alimentação e da Ingestão de Alimentos/terapia , Apego ao Objeto , Teoria da Mente/fisiologia , Adulto , Ansiedade/psicologia , Depressão/psicologia , Transtornos da Alimentação e da Ingestão de Alimentos/psicologia , Feminino , Humanos , Testes Neuropsicológicos , Resultado do Tratamento , Adulto JovemRESUMO
OBJECTIVE: To investigate the relationships of attachment security and mentalization with core and co-morbid symptoms in eating disorder patients. METHOD: We compared 51 eating disorder patients at the start of intensive treatment and 20 healthy controls on attachment, mentalization, eating disorder symptoms, depression, anxiety, personality disorders, psycho-neuroticism, autonomy problems and self-injurious behavior, using the Adult Attachment Interview, the SCID-I and II and several questionnaires. RESULTS: Compared with the controls, the eating disorder patients showed a higher prevalence of insecure attachment; eating disorder patients more often than controls received the AAI classification Unresolved for loss or abuse. They also had a lower level of mentalization and more autonomy problems. In the patient group eating disorder symptoms, depression, anxiety, psycho-neuroticism and autonomy problems were neither related to attachment security nor to mentalization; self-injurious behavior was associated with lesser attachment security and lower mentalization; borderline personality disorder was related to lower mentalization. In the control group no relations were found between attachment, mentalization and psychopathologic variables. DISCUSSION: Eating disorder patients' low level of mentalization suggests the usefulness of Mentalization Based Treatment techniques for eating disorder treatment, especially in case of self-injurious behavior and/or co-morbid borderline personality disorder.
Assuntos
Transtornos da Alimentação e da Ingestão de Alimentos , Apego ao Objeto , Estresse Psicológico , Teoria da Mente , Adolescente , Adulto , Comorbidade , Feminino , Humanos , Entrevistas como Assunto , Países Baixos , Transtornos da Personalidade , Autorrelato , Adulto JovemRESUMO
We discuss reliability definitions from the perspectives of classical test theory, factor analysis, and generalizability theory. For each method, we discuss the rationale, the estimation of reliability, and the goodness of fit of the model that defines the reliability coefficient to the data. Similarities and differences in the three approaches are highlighted. Finally, we provide a computational example using generated data to illustrate the differences among the different reliability methods.
Assuntos
Modelos Estatísticos , Pesquisa em Enfermagem , Reprodutibilidade dos Testes , Análise Fatorial , Humanos , PsicometriaRESUMO
We respond to three commentaries on our discussion article on different conceptions of test score reliability. First, we discuss the use of standard errors for reliability estimates. Second, we discuss the desirability not to confuse issues pertaining to the dimensionality of the test data (closely related to construct validity) and the degree to which measurement values are repeatable under the same circumstances (i.e., the reliability issue). Third, we discuss a new reliability estimation method that is almost unbiased irrespective of the dimensionality of the test data.
Assuntos
Modelos Estatísticos , Pesquisa em Enfermagem , Reprodutibilidade dos Testes , HumanosRESUMO
Categorical marginal models (CMMs) are flexible tools for modelling dependent or clustered categorical data, when the dependencies themselves are not of interest. A major limitation of maximum likelihood (ML) estimation of CMMs is that the size of the contingency table increases exponentially with the number of variables, so even for a moderate number of variables, say between 10 and 20, ML estimation can become computationally infeasible. An alternative method, which retains the optimal asymptotic efficiency of ML, is maximum empirical likelihood (MEL) estimation. However, we show that MEL tends to break down for large, sparse contingency tables. As a solution, we propose a new method, which we call maximum augmented empirical likelihood (MAEL) estimation and which involves augmentation of the empirical likelihood support with a number of well-chosen cells. Simulation results show good finite sample performance for very large contingency tables.
Assuntos
Funções Verossimilhança , Psicometria , Simulação por ComputadorRESUMO
BACKGROUND: The health action process approach (HAPA) model is promising to increase the frequency of brushing children's teeth by parents to improve their children's oral health. A validated HAPA questionnaire is needed as one of the measures of the effects of such an intervention. OBJECTIVES: The aim of this study was to evaluate whether our data, based on a translated and adopted version of the Health Action Process Approach (HAPA)-based questionnaire on dental flossing, supported the constructs of the HAPA model. If so, a next aim was to assess whether these constructs could be measured reliably. METHODS: In this cross-sectional study, 269 questionnaires filled out in dental offices by parents of children 1-10 years old were analysed. Scale validation was performed according to the 6-step protocol of Dima, including Mokken scale analyses (MSA), graded response model (GRM), factor analyses and reliability measures. Pearson correlation coefficients were calculated to identify divergent validity and test-retest reliability. RESULTS: MSA showed a unidimensional, medium total scale. Three items were removed based on this analysis. The total scale with the remaining 26 items did not fit the GRM. Factor analysis extracted five factors and two components for the total scale. The separate subscales, except the 'intention' construct, fitted the MSA and did not fit the GRM. The data fitted a seven-factor model better than a one-factor model. Reliability measures varied from acceptable to excellent, but were poor for 'action control'. Test-retest reliability (r's 0.60-0.83) was questionable to good. CONCLUSION: Our results did not fully support the constructs of the HAPA model. To support the HAPA constructs, modification to the subscales risk perceptions, intention, action planning, action control and self-reported behaviour are suggested. With these adjustments, the reliability and validity of the questionnaire could be significantly improved".
Assuntos
Cognição , Pais , Humanos , Criança , Lactente , Pré-Escolar , Países Baixos , Reprodutibilidade dos Testes , Estudos Transversais , Pais/psicologia , Inquéritos e QuestionáriosRESUMO
AIMS: To demonstrate the principles and application of Mokken scaling. BACKGROUND: The history and development of Mokken scaling is described, some examples of applications are given, and some recent development of the method are summarised. DESIGN: Secondary analysis of data obtained by cross-sectional survey methods, including self-report and observation. METHODS: Data from the Edinburgh Feeding Evaluation in Dementia scale and the Townsend Functional Ability Scale were analysed using the Mokken scaling procedure within the 'R' statistical package. Specifically, invariant item ordering (the extent to which the order of the items in terms of difficulty was the same for all respondents whatever their total scale score) was studied. RESULTS: The Edinburgh Feeding Evaluation in Dementia scale and the Townsend Functional Ability Scale showed no violations of invariant item ordering, although only the Townsend Functional Ability Scale showed a medium accuracy. CONCLUSION: Mokken scaling is an established method for item response theory analysis with wide application in the social sciences. It provides psychometricians with an additional tool in the development of questionnaires and in the study of individuals and their responses to latent traits. Specifically, with regard to the analyses conducted in this study, the Edinburgh Feeding Evaluation in Dementia scale requires further development and study across different levels of severity of dementia and feeding difficulty. RELEVANCE TO CLINICAL PRACTICE: Good scales are required for assessment in clinical practice and the present paper shows how a relatively recently developed method for analysing Mokken scales can contribute to this. The two scales used as examples for analysis are highly clinically relevant.
Assuntos
Atividades Cotidianas , Demência/fisiopatologia , Projetos de Pesquisa , Estudos Transversais , Humanos , Reprodutibilidade dos Testes , SoftwareRESUMO
Several intraclass correlation coefficients (ICCs) are available to assess the interrater reliability (IRR) of observational measurements. Selecting an ICC is complicated, and existing guidelines have three major limitations. First, they do not discuss incomplete designs, in which raters partially vary across subjects. Second, they provide no coherent perspective on the error variance in an ICC, clouding the choice between the available coefficients. Third, the distinction between fixed or random raters is often misunderstood. Based on generalizability theory (GT), we provide updated guidelines on selecting an ICC for IRR, which are applicable to both complete and incomplete observational designs. We challenge conventional wisdom about ICCs for IRR by claiming that raters should seldom (if ever) be considered fixed. Also, we clarify how to interpret ICCs in the case of unbalanced and incomplete designs. We explain four choices a researcher needs to make when selecting an ICC for IRR, and guide researchers through these choices by means of a flowchart, which we apply to three empirical examples from clinical and developmental domains. In the Discussion, we provide guidance in reporting, interpreting, and estimating ICCs, and propose future directions for research into the ICCs for IRR. (PsycInfo Database Record (c) 2023 APA, all rights reserved).
RESUMO
Current interrater reliability (IRR) coefficients ignore the nested structure of multilevel observational data, resulting in biased estimates of both subject- and cluster-level IRR. We used generalizability theory to provide a conceptualization and estimation method for IRR of continuous multilevel observational data. We explain how generalizability theory decomposes the variance of multilevel observational data into subject-, cluster-, and rater-related components, which can be estimated using Markov chain Monte Carlo (MCMC) estimation. We explain how IRR coefficients for each level can be derived from these variance components, and how they can be estimated as intraclass correlation coefficients (ICC). We assessed the quality of MCMC point and interval estimates with a simulation study, and showed that small numbers of raters were the main source of bias and inefficiency of the ICCs. In a follow-up simulation, we showed that a planned missing data design can diminish most estimation difficulties in these conditions, yielding a useful approach to estimating multilevel interrater reliability for most social and behavioral research. We illustrated the method using data on student-teacher relationships. All software code and data used for this article is available on the Open Science Framework: https://osf.io/bwk5t/. (PsycInfo Database Record (c) 2022 APA, all rights reserved).
Assuntos
Pesquisa Comportamental , Projetos de Pesquisa , Viés , Humanos , Método de Monte Carlo , Reprodutibilidade dos TestesRESUMO
Respondents may use satisficing (i.e., nonoptimal) strategies when responding to self-report questionnaires. These satisficing strategies become more likely with decreasing motivation and/or cognitive ability (Krosnick, 1991). Considering that cognitive deficits are characteristic of depressive and anxiety disorders, depressed and anxious patients may be prone to satisficing. Using data from the Netherland's Study of Depression and Anxiety (N = 2,945), we studied the relationship between depression and anxiety, cognitive symptoms, and satisficing strategies on the NEO Five-Factor Inventory. Results showed that respondents with either an anxiety disorder or a comorbid anxiety and depression disorder used satisficing strategies substantially more often than healthy respondents. Cognitive symptom severity partly mediated the effect of anxiety disorder and comorbid anxiety disorder on satisficing. The results suggest that depressed and anxious patients produce relatively low-quality self-report data-partly due to cognitive symptoms. Future research should investigate the degree of satisficing across different mental health care assessment contexts.
Assuntos
Ansiedade/psicologia , Transtornos Cognitivos/psicologia , Depressão/psicologia , Escalas de Graduação Psiquiátrica/estatística & dados numéricos , Autorrelato/estatística & dados numéricos , Adolescente , Adulto , Idoso , Cognição , Confiabilidade dos Dados , Feminino , Humanos , Masculino , Saúde Mental , Pessoa de Meia-Idade , Países Baixos , Adulto JovemRESUMO
For the construction of tests and questionnaires that require multiple raters (e.g., a child behaviour checklist completed by both parents) a novel ordinal scaling technique is currently being further developed, called two-level Mokken scale analysis. The technique uses within-rater and between-rater coefficients to assess the scalability of the test. These coefficients are generalizations of Mokken's scalability coefficients. In this paper we derived standard errors for the two-level coefficients and for their ratios. The coefficients, the estimates, the estimated standard errors and the software implementation are discussed and illustrated using a real-data example, and a small-scale simulation study demonstrates the accuracy of the estimates.
Assuntos
Modelos Estatísticos , Psicometria/métodos , Criança , Comportamento Infantil , Simulação por Computador , Humanos , Probabilidade , Software , Estatísticas não Paramétricas , Inquéritos e Questionários/estatística & dados numéricosRESUMO
Two-level Mokken scale analysis is a generalization of Mokken scale analysis for multi-rater data. The bias of estimated scalability coefficients for two-level Mokken scale analysis, the bias of their estimated standard errors, and the coverage of the confidence intervals has been investigated, under various testing conditions. It was found that the estimated scalability coefficients were unbiased in all tested conditions. For estimating standard errors, the delta method and the cluster bootstrap were compared. The cluster bootstrap structurally underestimated the standard errors of the scalability coefficients, with low coverage values. Except for unequal numbers of raters across subjects and small sets of items, the delta method standard error estimates had negligible bias and good coverage. Post hoc simulations showed that the cluster bootstrap does not correctly reproduce the sampling distribution of the scalability coefficients, and an adapted procedure was suggested. In addition, the delta method standard errors can be slightly improved if the harmonic mean is used for unequal numbers of raters per subject rather than the arithmetic mean.
RESUMO
Test authors report sample reliability values but rarely consider the sampling error and related confidence intervals. This study investigated the truth of this conjecture for 116 tests with 1,024 reliability estimates (105 pertaining to test batteries and 919 to tests measuring a single attribute) obtained from an online database. Based on 90% confidence intervals, approximately 20% of the initial quality assessments had to be downgraded. For 95% confidence intervals, the percentage was approximately 23%. The results demonstrated that reported reliability values cannot be trusted without considering their estimation precision.
Assuntos
Intervalos de Confiança , Testes Psicológicos/normas , Reprodutibilidade dos Testes , Bélgica , Bases de Dados Factuais , Humanos , Países BaixosRESUMO
This study investigates the usefulness of item-score reliability as a criterion for item selection in test construction. Methods MS, λ6, and CA were investigated as item-assessment methods in item selection and compared to the corrected item-total correlation, which was used as a benchmark. An ideal ordering to add items to the test (bottom-up procedure) or omit items from the test (top-down procedure) was defined based on the population test-score reliability. The orderings the four item-assessment methods produced in samples were compared to the ideal ordering, and the degree of resemblance was expressed by means of Kendall's τ. To investigate the concordance of the orderings across 1,000 replicated samples, Kendall's W was computed for each item-assessment method. The results showed that for both the bottom-up and the top-down procedures, item-assessment method CA and the corrected item-total correlation most closely resembled the ideal ordering. Generally, all item assessment methods resembled the ideal ordering better, and concordance of the orderings was greater, for larger sample sizes, and greater variance of the item discrimination parameters.