Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 33
Filter
1.
Psychol Methods ; 29(3): 603-605, 2024 Jun.
Article in English | MEDLINE | ID: mdl-39311828

ABSTRACT

Linde et al. (2021) compared the "two one-sided tests" the "highest density interval-region of practical equivalence", and the "interval Bayes factor" approaches to establishing equivalence in terms of power and Type I error rate using typical decision thresholds. They found that the interval Bayes factor approach exhibited a higher power but also a higher Type I error rate than the other approaches. In response, Campbell and Gustafson (2022) showed that the performances of the three approaches can approximate one another when they are calibrated to have the same Type I error rate. In this article, we argue that these results have little bearing on how these approaches are used in practice; a concrete example is used to highlight this important point. (PsycInfo Database Record (c) 2024 APA, all rights reserved).


Subject(s)
Bayes Theorem , Humans , Psychology/methods , Psychology/standards , Data Interpretation, Statistical
2.
Article in English | MEDLINE | ID: mdl-38379504

ABSTRACT

Several new models based on item response theory have recently been suggested to analyse intensive longitudinal data. One of these new models is the time-varying dynamic partial credit model (TV-DPCM; Castro-Alvarez et al., Multivariate Behavioral Research, 2023, 1), which is a combination of the partial credit model and the time-varying autoregressive model. The model allows the study of the psychometric properties of the items and the modelling of nonlinear trends at the latent state level. However, there is a severe lack of tools to assess the fit of the TV-DPCM. In this paper, we propose and develop several test statistics and discrepancy measures based on the posterior predictive model checking (PPMC) method (PPMC; Rubin, The Annals of Statistics, 1984, 12, 1151) to assess the fit of the TV-DPCM. Simulated and empirical data are used to study the performance of and illustrate the effectiveness of the PPMC method.

3.
Multivariate Behav Res ; 59(1): 78-97, 2024.
Article in English | MEDLINE | ID: mdl-37318274

ABSTRACT

The accessibility to electronic devices and the novel statistical methodologies available have allowed researchers to comprehend psychological processes at the individual level. However, there are still great challenges to overcome as, in many cases, collected data are more complex than the available models are able to handle. For example, most methods assume that the variables in the time series are measured on an interval scale, which is not the case when Likert-scale items were used. Ignoring the scale of the variables can be problematic and bias the results. Additionally, most methods also assume that the time series are stationary, which is rarely the case. To tackle these disadvantages, we propose a model that combines the partial credit model (PCM) of the item response theory framework and the time-varying autoregressive model (TV-AR), which is a popular model used to study psychological dynamics. The proposed model is referred to as the time-varying dynamic partial credit model (TV-DPCM), which allows to appropriately analyze multivariate polytomous data and nonstationary time series. We test the performance and accuracy of the TV-DPCM in a simulation study. Lastly, by means of an example, we show how to fit the model to empirical data and interpret the results.


Subject(s)
Models, Statistical , Time Factors , Computer Simulation , Data Collection
4.
Appl Psychol Meas ; 47(5-6): 420-437, 2023 Sep.
Article in English | MEDLINE | ID: mdl-37810540

ABSTRACT

Aberrant responding on tests and surveys has been shown to affect the psychometric properties of scales and the statistical analyses from the use of those scales in cumulative model contexts. This study extends prior research by comparing the effects of four types of aberrant responding on model fit in both cumulative and ideal point model contexts using graded partial credit (GPCM) and generalized graded unfolding (GGUM) models. When fitting models to data, model misfit can be both a function of misspecification and aberrant responding. Results demonstrate how varying levels of aberrant data can severely impact model fit for both cumulative and ideal point data. Specifically, longstring responses have a stronger impact on dimensionality for both ideal point and cumulative data, while random responding tends to have the most negative impact on data model fit according to information criteria (AIC, BIC). The results also indicate that ideal point data models such as GGUM may be able to fit cumulative data as well as the cumulative model itself (GPCM), whereas cumulative data models may not provide sufficient model fit for data simulated using an ideal point model.

5.
Psychol Methods ; 28(3): 740-755, 2023 Jun.
Article in English | MEDLINE | ID: mdl-34735173

ABSTRACT

Some important research questions require the ability to find evidence for two conditions being practically equivalent. This is impossible to accomplish within the traditional frequentist null hypothesis significance testing framework; hence, other methodologies must be utilized. We explain and illustrate three approaches for finding evidence for equivalence: The frequentist two one-sided tests procedure, the Bayesian highest density interval region of practical equivalence procedure, and the Bayes factor interval null procedure. We compare the classification performances of these three approaches for various plausible scenarios. The results indicate that the Bayes factor interval null approach compares favorably to the other two approaches in terms of statistical power. Critically, compared with the Bayes factor interval null procedure, the two one-sided tests and the highest density interval region of practical equivalence procedures have limited discrimination capabilities when the sample size is relatively small: Specifically, in order to be practically useful, these two methods generally require over 250 cases within each condition when rather large equivalence margins of approximately .2 or .3 are used; for smaller equivalence margins even more cases are required. Because of these results, we recommend that researchers rely more on the Bayes factor interval null approach for quantifying evidence for equivalence, especially for studies that are constrained on sample size. (PsycInfo Database Record (c) 2023 APA, all rights reserved).


Subject(s)
Research Design , Humans , Bayes Theorem , Sample Size
6.
Psychol Methods ; 28(3): 558-579, 2023 Jun.
Article in English | MEDLINE | ID: mdl-35298215

ABSTRACT

The last 25 years have shown a steady increase in attention for the Bayes factor as a tool for hypothesis evaluation and model selection. The present review highlights the potential of the Bayes factor in psychological research. We discuss six types of applications: Bayesian evaluation of point null, interval, and informative hypotheses, Bayesian evidence synthesis, Bayesian variable selection and model averaging, and Bayesian evaluation of cognitive models. We elaborate what each application entails, give illustrative examples, and provide an overview of key references and software with links to other applications. The article is concluded with a discussion of the opportunities and pitfalls of Bayes factor applications and a sketch of corresponding future research lines. (PsycInfo Database Record (c) 2023 APA, all rights reserved).


Subject(s)
Bayes Theorem , Behavioral Research , Psychology , Humans , Behavioral Research/methods , Psychology/methods , Software , Research Design
7.
Psychon Bull Rev ; 30(2): 534-552, 2023 Apr.
Article in English | MEDLINE | ID: mdl-36085233

ABSTRACT

In classical statistics, there is a close link between null hypothesis significance testing (NHST) and parameter estimation via confidence intervals. However, for the Bayesian counterpart, a link between null hypothesis Bayesian testing (NHBT) and Bayesian estimation via a posterior distribution is less straightforward, but does exist, and has recently been reiterated by Rouder, Haaf, and Vandekerckhove (2018). It hinges on a combination of a point mass probability and a probability density function as prior (denoted as the spike-and-slab prior). In the present paper, it is first carefully explained how the spike-and-slab prior is defined, and how results can be derived for which proofs were not given in Rouder, Haaf, and Vandekerckhove (2018). Next, it is shown that this spike-and-slab prior can be approximated by a pure probability density function with a rectangular peak around the center towering highly above the remainder of the density function. Finally, we will indicate how this 'hill-and-chimney' prior may in turn be approximated by fully continuous priors. In this way, it is shown that NHBT results can be approximated well by results from estimation using a strongly peaked prior, and it is noted that the estimation itself offers more than merely the posterior odds on which NHBT is based. Thus, it complies with the strong APA requirement of not just mentioning testing results but also offering effect size information. It also offers a transparent perspective on the NHBT approach employing a prior with a strong peak around the chosen point null hypothesis value.


Subject(s)
Research Design , Humans , Bayes Theorem , Likelihood Functions
8.
Psychol Methods ; 27(3): 466-475, 2022 Jun.
Article in English | MEDLINE | ID: mdl-35901398

ABSTRACT

In 2019 we wrote an article (Tendeiro & Kiers, 2019) in Psychological Methods over null hypothesis Bayesian testing and its working horse, the Bayes factor. Recently, van Ravenzwaaij and Wagenmakers (2021) offered a response to our piece, also in this journal. Although we do welcome their contribution with thought-provoking remarks on our article, we ended up concluding that there were too many "issues" in van Ravenzwaaij and Wagenmakers (2021) that warrant a rebuttal. In this article we both defend the main premises of our original article and we put the contribution of van Ravenzwaaij and Wagenmakers (2021) under critical appraisal. Our hope is that this exchange between scholars decisively contributes toward a better understanding among psychologists of null hypothesis Bayesian testing in general and of the Bayes factor in particular. (PsycInfo Database Record (c) 2022 APA, all rights reserved).


Subject(s)
Research Design , Bayes Theorem , Data Interpretation, Statistical
9.
J Exp Psychol Appl ; 28(1): 166-178, 2022 Mar.
Article in English | MEDLINE | ID: mdl-34138620

ABSTRACT

Robust scientific evidence shows that human performance predictions are more valid when information is combined mechanically (with a decision rule) rather than holistically (in the decision-maker's mind). Yet, information is often combined holistically in practice. One reason is that decision makers lack the knowledge of evidence-based decision making. In a performance prediction task, we tested whether watching an educational video on evidence-based decision making increased decision-makers' use of a decision rule and their prediction accuracy immediately after the manipulation and a month later. Furthermore, we manipulated whether participants earned incentives for accurate predictions. Existing research showed that incentives decrease decision-rule use and prediction accuracy. We hypothesized that this is the case for decision makers who did not receive educational information about evidence-based decision making, but that incentives increase decision-rule use and prediction accuracy for participants who received educational information. Our results showed that educational information increased decision-rule use. This resulted in increased prediction accuracy, but only immediately after receiving the educational information. In contrast to the existing literature, incentives slightly increased decision-rule use. We did not find evidence that this effect was larger for educated participants. Providing decision makers with educational information may be effective to increase decision-rule use in practice. (PsycInfo Database Record (c) 2022 APA, all rights reserved).


Subject(s)
Decision Making , Motivation , Humans
10.
Psychol Methods ; 27(1): 17-43, 2022 Feb.
Article in English | MEDLINE | ID: mdl-34014719

ABSTRACT

Traditionally, researchers have used time series and multilevel models to analyze intensive longitudinal data. However, these models do not directly address traits and states which conceptualize the stability and variability implicit in longitudinal research, and they do not explicitly take into account measurement error. An alternative to overcome these drawbacks is to consider structural equation models (state-trait SEMs) for longitudinal data that represent traits and states as latent variables. Most of these models are encompassed in the latent state-trait (LST) theory. These state-trait SEMs can be problematic when the number of measurement occasions increases. As they require the data to be in wide format, these models quickly become overparameterized and lead to nonconvergence issues. For these reasons, multilevel versions of state-trait SEMs have been proposed, which require the data in long format. To study how suitable state-trait SEMs are for intensive longitudinal data, we carried out a simulation study. We compared the traditional single level to the multilevel version of three state-trait SEMs. The selected models were the multistate-singletrait (MSST) model, the common and unique trait-state (CUTS) model, and the trait-state-occasion (TSO) model. Furthermore, we also included an empirical application. Our results indicated that the TSO model performed best in both the simulated and the empirical data. To conclude, we highlight the usefulness of state-trait SEMs to study the psychometric properties of the questionnaires used in intensive longitudinal data. Yet, these models still have multiple limitations, some of which might be overcome by extending them to more general frameworks. (PsycInfo Database Record (c) 2022 APA, all rights reserved).


Subject(s)
Models, Theoretical , Humans , Latent Class Analysis , Multilevel Analysis , Psychometrics , Surveys and Questionnaires
11.
Assessment ; 29(7): 1392-1405, 2022 10.
Article in English | MEDLINE | ID: mdl-34041940

ABSTRACT

Functional Somatic Symptoms (FSS) are physical symptoms that cannot be attributed to underlying pathology. Their severity is often measured with sum scores on questionnaires; however, this may not adequately reflect FSS severity in subgroups of patients. We aimed to identify the items of the somatization section of the Composite International Diagnostic Interview that best discriminate FSS severity levels, and to assess their functioning in sex and age subgroups. We applied the two-parameter logistic model to 19 items in a population-representative cohort of 962 participants. Subsequently, we examined differential item functioning (DIF). "Localized (muscle) weakness" was the most discriminative item of FSS severity. "Abdominal pain" consistently showed DIF by sex, with males reporting it at higher FSS severity. There was no consistent DIF by age, however, "Joint pain" showed poor discrimination of FSS severity in older adults. These findings could be helpful for the development of better assessment instruments for FSS, which can improve both future research and clinical care.


Subject(s)
Medically Unexplained Symptoms , Aged , Cohort Studies , Humans , Male , Models, Statistical , Pain , Psychometrics , Surveys and Questionnaires
12.
Qual Life Res ; 31(1): 49-59, 2022 Jan.
Article in English | MEDLINE | ID: mdl-34476671

ABSTRACT

PURPOSE: In Mokken scaling, the Crit index was proposed and is sometimes used as evidence (or lack thereof) of violations of some common model assumptions. The main goal of our study was twofold: To make the formulation of the Crit index explicit and accessible, and to investigate its distribution under various measurement conditions. METHODS: We conducted two simulation studies in the context of dichotomously scored item responses. We manipulated the type of assumption violation, the proportion of violating items, sample size, and quality. False positive rates and power to detect assumption violations were our main outcome variables. Furthermore, we used the Crit coefficient in a Mokken scale analysis to a set of responses to the General Health Questionnaire (GHQ-12), a self-administered questionnaire for assessing current mental health. RESULTS: We found that the false positive rates of Crit were close to the nominal rate in most conditions, and that power to detect misfit depended on the sample size, type of violation, and number of assumption-violating items. Overall, in small samples Crit lacked the power to detect misfit, and in larger samples power differed considerably depending on the type of violation and proportion of misfitting items. Furthermore, we also found in our empirical example that even in large samples the Crit index may fail to detect assumption violations. DISCUSSION: Even in large samples, the Crit coefficient showed limited usefulness for detecting moderate and severe violations of monotonicity. Our findings are relevant to researchers and practitioners who use Mokken scaling for scale and questionnaire construction and revision.


Subject(s)
Quality of Life , Research Design , Computer Simulation , Humans , Mental Health , Quality of Life/psychology , Surveys and Questionnaires
13.
Psychon Bull Rev ; 29(1): 70-87, 2022 Feb.
Article in English | MEDLINE | ID: mdl-34254263

ABSTRACT

The practice of sequentially testing a null hypothesis as data are collected until the null hypothesis is rejected is known as optional stopping. It is well known that optional stopping is problematic in the context of p value-based null hypothesis significance testing: The false-positive rates quickly overcome the single test's significance level. However, the state of affairs under null hypothesis Bayesian testing, where p values are replaced by Bayes factors, has perhaps surprisingly been much less consensual. Rouder (2014) used simulations to defend the use of optional stopping under null hypothesis Bayesian testing. The idea behind these simulations is closely related to the idea of sampling from prior predictive distributions. Deng et al. (2016) and Hendriksen et al. (2020) have provided mathematical evidence to the effect that optional stopping under null hypothesis Bayesian testing does hold under some conditions. These papers are, however, exceedingly technical for most researchers in the applied social sciences. In this paper, we provide some mathematical derivations concerning Rouder's approximate simulation results for the two Bayesian hypothesis tests that he considered. The key idea is to consider the probability distribution of the Bayes factor, which is regarded as being a random variable across repeated sampling. This paper therefore offers an intuitive perspective to the literature and we believe it is a valid contribution towards understanding the practice of optional stopping in the context of Bayesian hypothesis testing.


Subject(s)
Research Design , Bayes Theorem , Computer Simulation , Humans , Probability
14.
Nat Hum Behav ; 5(11): 1473-1480, 2021 11.
Article in English | MEDLINE | ID: mdl-34764461

ABSTRACT

We argue that statistical practice in the social and behavioural sciences benefits from transparency, a fair acknowledgement of uncertainty and openness to alternative interpretations. Here, to promote such a practice, we recommend seven concrete statistical procedures: (1) visualizing data; (2) quantifying inferential uncertainty; (3) assessing data preprocessing choices; (4) reporting multiple models; (5) involving multiple analysts; (6) interpreting results modestly; and (7) sharing data and code. We discuss their benefits and limitations, and provide guidelines for adoption. Each of the seven procedures finds inspiration in Merton's ethos of science as reflected in the norms of communalism, universalism, disinterestedness and organized scepticism. We believe that these ethical considerations-as well as their statistical consequences-establish common ground among data analysts, despite continuing disagreements about the foundations of statistical inference.


Subject(s)
Statistics as Topic , Data Interpretation, Statistical , Humans , Information Dissemination , Models, Statistical , Research Design/standards , Statistics as Topic/methods , Statistics as Topic/standards , Uncertainty
15.
Assessment ; 28(8): 1960-1970, 2021 12.
Article in English | MEDLINE | ID: mdl-32757771

ABSTRACT

More than 40 questionnaires have been developed to assess functional somatic symptoms (FSS), but there are several methodological issues regarding the measurement of FSS. We aimed to identify which items of the somatization subscale of the Symptom Checklist-90 (SCL-90) are more informative and discriminative between persons at different levels of severity of FSS. To this end, item response theory was applied to the somatization scale of the SCL-90, collected from a sample of 82,740 adult participants without somatic conditions in the Lifelines Cohort Study. Sensitivity analyses were performed with all the participants who completed the somatization scale. Both analyses showed that Items 11 "feeling weak physically" and 12 "heavy feelings in arms or legs" were the most discriminative and informative to measure severity levels of FSS, regardless of somatic conditions. Clinicians and researchers may pay extra attention to these symptoms to augment the assessment of FSS.


Subject(s)
Medically Unexplained Symptoms , Adult , Cohort Studies , Humans , Somatoform Disorders/diagnosis , Surveys and Questionnaires
16.
Appl Psychol Meas ; 44(6): 482-496, 2020 Sep.
Article in English | MEDLINE | ID: mdl-32782419

ABSTRACT

Mokken scale analysis is a popular method to evaluate the psychometric quality of clinical and personality questionnaires and their individual items. Although many empirical papers report on the extent to which sets of items form Mokken scales, there is less attention for the effect of violations of commonly used rules of thumb. In this study, the authors investigated the practical consequences of retaining or removing items with psychometric properties that do not comply with these rules of thumb. Using simulated data, they concluded that items with low scalability had some influence on the reliability of test scores, person ordering and selection, and criterion-related validity estimates. Removing the misfitting items from the scale had, in general, a small effect on the outcomes. Although important outcome variables were fairly robust against scale violations in some conditions, authors conclude that researchers should not rely exclusively on algorithms allowing automatic selection of items. In particular, content validity must be taken into account to build sensible psychometric instruments.

17.
Int J Methods Psychiatr Res ; 28(4): e1795, 2019 12.
Article in English | MEDLINE | ID: mdl-31264326

ABSTRACT

OBJECTIVES: In this study, we examined the consequences of ignoring violations of assumptions underlying the use of sum scores in assessing attention problems (AP) and if psychometrically more refined models improve predictions of relevant outcomes in adulthood. METHODS: Tracking Adolescents' Individual Lives data were used. AP symptom properties were examined using the AP scale of the Child Behavior Checklist at age 11. Consequences of model violations were evaluated in relation to psychopathology, educational attainment, financial status, and ability to form relationships in adulthood. RESULTS: Results showed that symptoms differed with respect to information and difficulty. Moreover, evidence of multidimensionality was found, with two groups of items measuring sluggish cognitive tempo and attention deficit hyperactivity disorder symptoms. Item response theory analyses indicated that a bifactor model fitted these data better than other competing models. In terms of accuracy of predicting functional outcomes, sum scores were robust against violations of assumptions in some situations. Nevertheless, AP scores derived from the bifactor model showed some superiority over sum scores. CONCLUSION: These findings show that more accurate predictions of later-life difficulties can be made if one uses a more suitable psychometric model to assess AP severity in children. This has important implications for research and clinical practice.


Subject(s)
Attention Deficit Disorder with Hyperactivity/diagnosis , Behavior Rating Scale/standards , Child Behavior Disorders/diagnosis , Psychiatric Status Rating Scales/standards , Psychometrics/standards , Adolescent , Adult , Child , Female , Humans , Longitudinal Studies , Male , Models, Statistical , Severity of Illness Index , Young Adult
18.
Psychol Methods ; 24(6): 774-795, 2019 Dec.
Article in English | MEDLINE | ID: mdl-31094544

ABSTRACT

Null hypothesis significance testing (NHST) has been under scrutiny for decades. The literature shows overwhelming evidence of a large range of problems affecting NHST. One of the proposed alternatives to NHST is using Bayes factors instead of p values. Here we denote the method of using Bayes factors to test point null models as "null hypothesis Bayesian testing" (NHBT). In this article we offer a wide overview of potential issues (limitations or sources of misinterpretation) with NHBT which is currently missing in the literature. We illustrate many of the shortcomings of NHBT by means of reproducible examples. The article concludes with a discussion of NHBT in particular and testing in general. In particular, we argue that posterior model probabilities should be given more emphasis than Bayes factors, because only the former provide direct answers to the most common research questions under consideration. (PsycINFO Database Record (c) 2019 APA, all rights reserved).


Subject(s)
Data Interpretation, Statistical , Models, Statistical , Probability , Research Design , Humans
19.
BMC Med Res Methodol ; 19(1): 71, 2019 03 29.
Article in English | MEDLINE | ID: mdl-30925900

ABSTRACT

BACKGROUND: In clinical trials, study designs may focus on assessment of superiority, equivalence, or non-inferiority, of a new medicine or treatment as compared to a control. Typically, evidence in each of these paradigms is quantified with a variant of the null hypothesis significance test. A null hypothesis is assumed (null effect, inferior by a specific amount, inferior by a specific amount and superior by a specific amount, for superiority, non-inferiority, and equivalence respectively), after which the probabilities of obtaining data more extreme than those observed under these null hypotheses are quantified by p-values. Although ubiquitous in clinical testing, the null hypothesis significance test can lead to a number of difficulties in interpretation of the results of the statistical evidence. METHODS: We advocate quantifying evidence instead by means of Bayes factors and highlight how these can be calculated for different types of research design. RESULTS: We illustrate Bayes factors in practice with reanalyses of data from existing published studies. CONCLUSIONS: Bayes factors for superiority, non-inferiority, and equivalence designs allow for explicit quantification of evidence in favor of the null hypothesis. They also allow for interim testing without the need to employ explicit corrections for multiple testing.


Subject(s)
Algorithms , Bayes Theorem , Evidence-Based Medicine/statistics & numerical data , Outcome Assessment, Health Care/statistics & numerical data , Research Design , Biometry/methods , Evidence-Based Medicine/methods , Humans , Outcome Assessment, Health Care/methods , Therapeutic Equivalency
20.
Appl Psychol Meas ; 43(2): 172-173, 2019 Mar.
Article in English | MEDLINE | ID: mdl-30792563

ABSTRACT

In this article, the newly created GGUM R package is presented. This package finally brings the generalized graded unfolding model (GGUM) to the front stage for practitioners and researchers. It expands the possibilities of fitting this type of item response theory (IRT) model to settings that, up to now, were not possible (thus, beyond the limitations imposed by the widespread GGUM2004 software). The outcome is therefore a unique software, not limited by the dimensions of the data matrix or the operating system used. It includes various routines that allow fitting the model, checking model fit, plotting the results, and also interacting with GGUM2004 for those interested. The software should be of interest to all those who are interested in IRT in general or to ideal point models in particular.

SELECTION OF CITATIONS
SEARCH DETAIL