RESUMO
Replication-an important, uncommon, and misunderstood practice-is gaining appreciation in psychology. Achieving replicability is important for making research progress. If findings are not replicable, then prediction and theory development are stifled. If findings are replicable, then interrogation of their meaning and validity can advance knowledge. Assessing replicability can be productive for generating and testing hypotheses by actively confronting current understandings to identify weaknesses and spur innovation. For psychology, the 2010s might be characterized as a decade of active confrontation. Systematic and multi-site replication projects assessed current understandings and observed surprising failures to replicate many published findings. Replication efforts highlighted sociocultural challenges such as disincentives to conduct replications and a tendency to frame replication as a personal attack rather than a healthy scientific practice, and they raised awareness that replication contributes to self-correction. Nevertheless, innovation in doing and understanding replication and its cousins, reproducibility and robustness, has positioned psychology to improve research practices and accelerate progress.
Assuntos
Projetos de Pesquisa , Humanos , Reprodutibilidade dos TestesRESUMO
A growing body of evidence indicates that the effects reported in many scientific fields may be overestimated or even false. This problem has gained a lot of attention in the field of psychology, where researchers have even started to speak of a 'replication crisis'. Fortunately, a number of measures to rectify this problem have already been proposed and implemented, some inspired by practices in other scientific fields. In this review, I briefly examine this issue in the field of psychology and suggest some practical tools and strategies that researchers can implement to increase replicability and the overall quality of their scientific research. WHAT THIS PAPER ADDS: Researchers can implement many practical tools and strategies to improve replicability of their findings. Strategies include improving statistical inference, pre-registration, multisite collaboration, and sharing data. Different scientific fields could benefit from looking at each other's best practices.
HERRAMIENTAS PRÁCTICAS Y ESTRATEGIAS PARA QUE LOS INVESTIGADORES AUMENTEN LA REPLICABILIDAD: Un numero creciente de evidencia indica que los efectos informados en muchos campos científicos pueden ser sobreestimados o incluso falsos. Este problema ha ganado mucha atención en el campo de la psicología, donde los investigadores incluso han comenzado a hablar de una "crisis de replicación". Afortunadamente, ya se han propuesto e implementado varias medidas para rectificar este problema, algunas inspiradas en prácticas en otros campos científicos. En esta revisión, examino brevemente este tema en el campo de la psicología y sugiero algunas herramientas prácticas y estrategias que los investigadores pueden implementar para aumentar la replicabilidad y la calidad general de su investigación científica.
FERRAMENTAS PRÁTICAS E ESTRATÉGIAS PARA PESQUISADORES AUMENTAREM A REPLICABILIDADE: Um crescente corpo de evidências indica que os efeitos relatados em muitos campos científicos podem ser super estimados ou mesmo falsos. Este problema tem recebido muita atenção no campo da psicolofia, onde pesquisadores até mesmo começam a falar em "crise da replicação". Felizmente, um número de medidas para corrigir este problema já foi proposto e implementado, alguns inspirados por práticas em outros campos científicos. Nesta revisão, eu reviso brevemente esta questão no campo da psicologia e sugiro algumas ferramentas práticas e estratégias que os pesquisadores podem implementar para aumentar a replicabilidade e qualidade geral de sua pesquisa científica.
Assuntos
Implementação de Plano de Saúde , Projetos de Pesquisa , Pesquisadores , Pesquisa , Humanos , Reprodutibilidade dos TestesRESUMO
In determining the need to directly replicate, it is crucial to first verify the original results through independent reanalysis of the data. Original results that appear erroneous and that cannot be reproduced by reanalysis offer little evidence to begin with, thereby diminishing the need to replicate. Sharing data and scripts is essential to ensure reproducibility.
Assuntos
Projetos de Pesquisa , Reprodutibilidade dos TestesRESUMO
This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package "statcheck." statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion. In contrast to earlier findings, we found that the average prevalence of inconsistent p-values has been stable over the years or has declined. The prevalence of gross inconsistencies was higher in p-values reported as significant than in p-values reported as nonsignificant. This could indicate a systematic bias in favor of significant results. Possible solutions for the high prevalence of reporting inconsistencies could be to encourage sharing data, to let co-authors check results in a so-called "co-pilot model," and to use statcheck to flag possible inconsistencies in one's own manuscript or during the review process.
Assuntos
Pesquisa Comportamental/estatística & dados numéricos , Viés , Humanos , PrevalênciaRESUMO
In order to quantify the relationship between multiple variables, researchers often carry out a mediation analysis. In such an analysis, a mediator (e.g., knowledge of a healthy diet) transmits the effect from an independent variable (e.g., classroom instruction on a healthy diet) to a dependent variable (e.g., consumption of fruits and vegetables). Almost all mediation analyses in psychology use frequentist estimation and hypothesis-testing techniques. A recent exception is Yuan and MacKinnon (Psychological Methods, 14, 301-322, 2009), who outlined a Bayesian parameter estimation procedure for mediation analysis. Here we complete the Bayesian alternative to frequentist mediation analysis by specifying a default Bayesian hypothesis test based on the Jeffreys-Zellner-Siow approach. We further extend this default Bayesian test by allowing a comparison to directional or one-sided alternatives, using Markov chain Monte Carlo techniques implemented in JAGS. All Bayesian tests are implemented in the R package BayesMed (Nuijten, Wetzels, Matzke, Dolan, & Wagenmakers, 2014).
Assuntos
Teorema de Bayes , Cadeias de Markov , Método de Monte Carlo , Humanos , Análise Multivariada , Negociação , Projetos de PesquisaRESUMO
Self-report scales are widely used in psychology to compare means in latent constructs across groups, experimental conditions, or time points. However, for these comparisons to be meaningful and unbiased, the scales must demonstrate measurement invariance (MI) across compared time points or (experimental) groups. MI testing determines whether the latent constructs are measured equivalently across groups or time, which is essential for meaningful comparisons. We conducted a systematic review of 426 psychology articles with openly available data, to (a) examine common practices in conducting and reporting of MI testing, (b) assess whether we could reproduce the reported MI results, and (c) conduct MI tests for the comparisons that enabled sufficiently powerful MI testing. We identified 96 articles that contained a total of 929 comparisons. Results showed that only 4% of the 929 comparisons underwent MI testing, and the tests were generally poorly reported. None of the reported MI tests were reproducible, and only 26% of the 174 newly performed MI tests reached sufficient (scalar) invariance, with MI failing completely in 58% of tests. Exploratory analyses suggested that in nearly half of the comparisons where configural invariance was rejected, the number of factors differed between groups. These results indicate that MI tests are rarely conducted and poorly reported in psychological studies. We observed frequent violations of MI, suggesting that reported differences between (experimental) groups may not be solely attributed to group differences in the latent constructs. We offer recommendations aimed at improving reporting and computational reproducibility practices in psychology. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
RESUMO
The COVID-19 outbreak has led to an exponential increase of publications and preprints about the virus, its causes, consequences, and possible cures. COVID-19 research has been conducted under high time pressure and has been subject to financial and societal interests. Doing research under such pressure may influence the scrutiny with which researchers perform and write up their studies. Either researchers become more diligent, because of the high-stakes nature of the research, or the time pressure may lead to cutting corners and lower quality output. In this study, we conducted a natural experiment to compare the prevalence of incorrectly reported statistics in a stratified random sample of COVID-19 preprints and a matched sample of non-COVID-19 preprints. Our results show that the overall prevalence of incorrectly reported statistics is 9-10%, but frequentist as well as Bayesian hypothesis tests show no difference in the number of statistical inconsistencies between COVID-19 and non-COVID-19 preprints. In conclusion, the literature suggests that COVID-19 research may on average have more methodological problems than non-COVID-19 research, but our results show that there is no difference in the statistical reporting quality.
RESUMO
This opinion piece aims to inform future research funding programs on responsible research practices (RRP) based on three specific objectives: (1) to give a sketch of the current international discussion on responsible research practices (RRPs); (2) to give an overview of current initiatives and already obtained results regarding RRP; and (3) to give an overview of potential future needs for research on RRP. In this opinion piece, we have used seven iterative methodological steps (including literature review, ranking, and sorting exercises) to create the proposed research agenda. We identified six main themes that we believe need attention in future research: (1) responsible evaluation of research and researchers, (2) the influence of open science and transparency on RRP, (3) research on responsible mentoring, supervision, and role modeling, (4) the effect of education and training on RRP, (5) checking for reproducibility, and (6) responsible and fair peer review. These themes have in common that they address aspects of research that are mostly on the level of the scientific system, more than on the level of the individual researcher. Some current initiatives are already gathering substantial empirical evidence to start filling these gaps. We believe that with sufficient support from all relevant stakeholders, more progress can be made.
Assuntos
Pesquisadores , Humanos , Reprodutibilidade dos TestesRESUMO
For any scientific report, repeating the original analyses upon the original data should yield the original outcomes. We evaluated analytic reproducibility in 25 Psychological Science articles awarded open data badges between 2014 and 2015. Initially, 16 (64%, 95% confidence interval [43,81]) articles contained at least one 'major numerical discrepancy' (>10% difference) prompting us to request input from original authors. Ultimately, target values were reproducible without author involvement for 9 (36% [20,59]) articles; reproducible with author involvement for 6 (24% [8,47]) articles; not fully reproducible with no substantive author response for 3 (12% [0,35]) articles; and not fully reproducible despite author involvement for 7 (28% [12,51]) articles. Overall, 37 major numerical discrepancies remained out of 789 checked values (5% [3,6]), but original conclusions did not appear affected. Non-reproducibility was primarily caused by unclear reporting of analytic procedures. These results highlight that open data alone is not sufficient to ensure analytic reproducibility.
RESUMO
We present the R package and web app statcheck to automatically detect statistical reporting inconsistencies in primary studies and meta-analyses. Previous research has shown a high prevalence of reported p-values that are inconsistent - meaning a re-calculated p-value, based on the reported test statistic and degrees of freedom, does not match the author-reported p-value. Such inconsistencies affect the reproducibility and evidential value of published findings. The tool statcheck can help researchers to identify statistical inconsistencies so that they may correct them. In this paper, we provide an overview of the prevalence and consequences of statistical reporting inconsistencies. We also discuss the tool statcheck in more detail and give an example of how it can be used in a meta-analysis. We end with some recommendations concerning the use of statcheck in meta-analyses and make a case for better reporting standards of statistical results.
Assuntos
Metanálise como Assunto , Psicologia/métodos , Projetos de Pesquisa , Estatística como Assunto , Algoritmos , Humanos , Modelos Estatísticos , Prevalência , Linguagens de Programação , Reprodutibilidade dos Testes , Interface Usuário-ComputadorRESUMO
In this meta-study, we analyzed 2442 effect sizes from 131 meta-analyses in intelligence research, published from 1984 to 2014, to estimate the average effect size, median power, and evidence for bias. We found that the average effect size in intelligence research was a Pearson's correlation of 0.26, and the median sample size was 60. Furthermore, across primary studies, we found a median power of 11.9% to detect a small effect, 54.5% to detect a medium effect, and 93.9% to detect a large effect. We documented differences in average effect size and median estimated power between different types of intelligence studies (correlational studies, studies of group differences, experiments, toxicology, and behavior genetics). On average, across all meta-analyses (but not in every meta-analysis), we found evidence for small-study effects, potentially indicating publication bias and overestimated effects. We found no differences in small-study effects between different study types. We also found no convincing evidence for the decline effect, US effect, or citation bias across meta-analyses. We concluded that intelligence research does show signs of low power and publication bias, but that these problems seem less severe than in many other scientific fields.
RESUMO
To determine the reproducibility of psychological meta-analyses, we investigated whether we could reproduce 500 primary study effect sizes drawn from 33 published meta-analyses based on the information given in the meta-analyses, and whether recomputations of primary study effect sizes altered the overall results of the meta-analysis. Results showed that almost half (k = 224) of all sampled primary effect sizes could not be reproduced based on the reported information in the meta-analysis, mostly because of incomplete or missing information on how effect sizes from primary studies were selected and computed. Overall, this led to small discrepancies in the computation of mean effect sizes, confidence intervals and heterogeneity estimates in 13 out of 33 meta-analyses. We provide recommendations to improve transparency in the reporting of the entire meta-analytic process, including the use of preregistration, data and workflow sharing, and explicit coding practices.
Assuntos
Psicologia/métodos , Intervalos de Confiança , Metanálise como Assunto , Reprodutibilidade dos TestesRESUMO
Experimental philosophy (x-phi) is a young field of research in the intersection of philosophy and psychology. It aims to make progress on philosophical questions by using experimental methods traditionally associated with the psychological and behavioral sciences, such as null hypothesis significance testing (NHST). Motivated by recent discussions about a methodological crisis in the behavioral sciences, questions have been raised about the methodological standards of x-phi. Here, we focus on one aspect of this question, namely the rate of inconsistencies in statistical reporting. Previous research has examined the extent to which published articles in psychology and other behavioral sciences present statistical inconsistencies in reporting the results of NHST. In this study, we used the R package statcheck to detect statistical inconsistencies in x-phi, and compared rates of inconsistencies in psychology and philosophy. We found that rates of inconsistencies in x-phi are lower than in the psychological and behavioral sciences. From the point of view of statistical reporting consistency, x-phi seems to do no worse, and perhaps even better, than psychological science.
Assuntos
Filosofia , Psicologia/métodos , Estatística como Assunto , Ciências do Comportamento , Humanos , Modelos Estatísticos , Motivação , Padrões de Referência , Reprodutibilidade dos Testes , Projetos de Pesquisa , Ciências Sociais , SoftwareRESUMO
Previous studies provided mixed findings on pecularities in p-value distributions in psychology. This paper examined 258,050 test results across 30,710 articles from eight high impact journals to investigate the existence of a peculiar prevalence of p-values just below .05 (i.e., a bump) in the psychological literature, and a potential increase thereof over time. We indeed found evidence for a bump just below .05 in the distribution of exactly reported p-values in the journals Developmental Psychology, Journal of Applied Psychology, and Journal of Personality and Social Psychology, but the bump did not increase over the years and disappeared when using recalculated p-values. We found clear and direct evidence for the QRP "incorrect rounding of p-value" (John, Loewenstein & Prelec, 2012) in all psychology journals. Finally, we also investigated monotonic excess of p-values, an effect of certain QRPs that has been neglected in previous research, and developed two measures to detect this by modeling the distributions of statistically significant p-values. Using simulations and applying the two measures to the retrieved test results, we argue that, although one of the measures suggests the use of QRPs in psychology, it is difficult to draw general conclusions concerning QRPs based on modeling of p-value distributions.
RESUMO
Statistical analysis is error prone. A best practice for researchers using statistics would therefore be to share data among co-authors, allowing double-checking of executed tasks just as co-pilots do in aviation. To document the extent to which this 'co-piloting' currently occurs in psychology, we surveyed the authors of 697 articles published in six top psychology journals and asked them whether they had collaborated on four aspects of analyzing data and reporting results, and whether the described data had been shared between the authors. We acquired responses for 49.6% of the articles and found that co-piloting on statistical analysis and reporting results is quite uncommon among psychologists, while data sharing among co-authors seems reasonably but not completely standard. We then used an automated procedure to study the prevalence of statistical reporting errors in the articles in our sample and examined the relationship between reporting errors and co-piloting. Overall, 63% of the articles contained at least one p-value that was inconsistent with the reported test statistic and the accompanying degrees of freedom, and 20% of the articles contained at least one p-value that was inconsistent to such a degree that it may have affected decisions about statistical significance. Overall, the probability that a given p-value was inconsistent was over 10%. Co-piloting was not found to be associated with reporting errors.
Assuntos
Comportamento Cooperativo , Interpretação Estatística de Dados , Disseminação de Informação , Psicologia , Confiabilidade dos Dados , Humanos , Publicações Periódicas como Assunto/normas , Publicações Periódicas como Assunto/estatística & dados numéricos , Psicologia/normas , Psicologia/estatística & dados numéricos , Projetos de Pesquisa , Estatística como Assunto/métodos , Estatística como Assunto/normasRESUMO
BACKGROUND: De Winter and Happee examined whether science based on selective publishing of significant results may be effective in accurate estimation of population effects, and whether this is even more effective than a science in which all results are published (i.e., a science without publication bias). Based on their simulation study they concluded that "selective publishing yields a more accurate meta-analytic estimation of the true effect than publishing everything, (and that) publishing nonreplicable results while placing null results in the file drawer can be beneficial for the scientific collective" (p.4). METHODS AND FINDINGS: Using their scenario with a small to medium population effect size, we show that publishing everything is more effective for the scientific collective than selective publishing of significant results. Additionally, we examined a scenario with a null effect, which provides a more dramatic illustration of the superiority of publishing everything over selective publishing. CONCLUSION: Publishing everything is more effective than only reporting significant outcomes.