RESUMEN
Collaborative efforts to directly replicate empirical studies in the medical and social sciences have revealed alarmingly low rates of replicability, a phenomenon dubbed the 'replication crisis'. Poor replicability has spurred cultural changes targeted at improving reliability in these disciplines. Given the absence of equivalent replication projects in ecology and evolutionary biology, two inter-related indicators offer the opportunity to retrospectively assess replicability: publication bias and statistical power. This registered report assesses the prevalence and severity of small-study (i.e., smaller studies reporting larger effect sizes) and decline effects (i.e., effect sizes decreasing over time) across ecology and evolutionary biology using 87 meta-analyses comprising 4,250 primary studies and 17,638 effect sizes. Further, we estimate how publication bias might distort the estimation of effect sizes, statistical power, and errors in magnitude (Type M or exaggeration ratio) and sign (Type S). We show strong evidence for the pervasiveness of both small-study and decline effects in ecology and evolution. There was widespread prevalence of publication bias that resulted in meta-analytic means being over-estimated by (at least) 0.12 standard deviations. The prevalence of publication bias distorted confidence in meta-analytic results, with 66% of initially statistically significant meta-analytic means becoming non-significant after correcting for publication bias. Ecological and evolutionary studies consistently had low statistical power (15%) with a 4-fold exaggeration of effects on average (Type M error rates = 4.4). Notably, publication bias reduced power from 23% to 15% and increased type M error rates from 2.7 to 4.4 because it creates a non-random sample of effect size evidence. The sign errors of effect sizes (Type S error) increased from 5% to 8% because of publication bias. Our research provides clear evidence that many published ecological and evolutionary findings are inflated. Our results highlight the importance of designing high-power empirical studies (e.g., via collaborative team science), promoting and encouraging replication studies, testing and correcting for publication bias in meta-analyses, and adopting open and transparent research practices, such as (pre)registration, data- and code-sharing, and transparent reporting.
Asunto(s)
Biología , Sesgo , Sesgo de Publicación , Reproducibilidad de los Resultados , Estudios Retrospectivos , Metaanálisis como AsuntoRESUMEN
We test whether large language models (LLMs) can be used to simulate human participants in social-science studies. To do this, we ran replications of 14 studies from the Many Labs 2 replication project with OpenAI's text-davinci-003 model, colloquially known as GPT-3.5. Based on our pre-registered analyses, we find that among the eight studies we could analyse, our GPT sample replicated 37.5% of the original results and 37.5% of the Many Labs 2 results. However, we were unable to analyse the remaining six studies due to an unexpected phenomenon we call the "correct answer" effect. Different runs of GPT-3.5 answered nuanced questions probing political orientation, economic preference, judgement, and moral philosophy with zero or near-zero variation in responses: with the supposedly "correct answer." In one exploratory follow-up study, we found that a "correct answer" was robust to changing the demographic details that precede the prompt. In another, we found that most but not all "correct answers" were robust to changing the order of answer choices. One of our most striking findings occurred in our replication of the Moral Foundations Theory survey results, where we found GPT-3.5 identifying as a political conservative in 99.6% of the cases, and as a liberal in 99.3% of the cases in the reverse-order condition. However, both self-reported 'GPT conservatives' and 'GPT liberals' showed right-leaning moral foundations. Our results cast doubts on the validity of using LLMs as a general replacement for human participants in the social sciences. Our results also raise concerns that a hypothetical AI-led future may be subject to a diminished diversity of thought.
Asunto(s)
Lenguaje , Humanos , Principios Morales , Política , Ciencias Sociales/métodos , Pensamiento/fisiologíaRESUMEN
We describe and demonstrate an empirical strategy useful for discovering and replicating empirical effects in psychological science. The method involves the design of a metastudy, in which many independent experimental variables-that may be moderators of an empirical effect-are indiscriminately randomized. Radical randomization yields rich datasets that can be used to test the robustness of an empirical claim to some of the vagaries and idiosyncrasies of experimental protocols and enhances the generalizability of these claims. The strategy is made feasible by advances in hierarchical Bayesian modeling that allow for the pooling of information across unlike experiments and designs and is proposed here as a gold standard for replication research and exploratory research. The practical feasibility of the strategy is demonstrated with a replication of a study on subliminal priming.
Asunto(s)
Investigación Biomédica/normas , Proyectos de Investigación/normas , Teorema de Bayes , Interpretación Estadística de Datos , Humanos , Distribución AleatoriaRESUMEN
There is growing awareness across the neuroscience community that the replicability of findings about the relationship between brain activity and cognitive phenomena can be improved by conducting studies with high statistical power that adhere to well-defined and standardised analysis pipelines. Inspired by recent efforts from the psychological sciences, and with the desire to examine some of the foundational findings using electroencephalography (EEG), we have launched #EEGManyLabs, a large-scale international collaborative replication effort. Since its discovery in the early 20th century, EEG has had a profound influence on our understanding of human cognition, but there is limited evidence on the replicability of some of the most highly cited discoveries. After a systematic search and selection process, we have identified 27 of the most influential and continually cited studies in the field. We plan to directly test the replicability of key findings from 20 of these studies in teams of at least three independent laboratories. The design and protocol of each replication effort will be submitted as a Registered Report and peer-reviewed prior to data collection. Prediction markets, open to all EEG researchers, will be used as a forecasting tool to examine which findings the community expects to replicate. This project will update our confidence in some of the most influential EEG findings and generate a large open access database that can be used to inform future research practices. Finally, through this international effort, we hope to create a cultural shift towards inclusive, high-powered multi-laboratory collaborations.
Asunto(s)
Electroencefalografía , Neurociencias , Cognición , Humanos , Reproducibilidad de los ResultadosRESUMEN
According to a popular model of self-control, willpower depends on a limited resource that can be depleted when we perform a task demanding self-control. This theory has been put to the test in hundreds of experiments showing that completing a task that demands high self-control usually hinders performance in any secondary task that subsequently taxes self-control. Over the last 5 years, the reliability of the empirical evidence supporting this model has been questioned. In the present study, we reanalysed data from a large-scale study-Many Labs 3-to test whether performing a depleting task has any effect on a secondary task that also relies on self-control. Although we used a large sample of more than 2000 participants for our analyses, we did not find any significant evidence of ego depletion: persistence on an anagram-solving task (a typical measure of self-control) was not affected by previous completion of a Stroop task (a typical depleting task in this literature). Our results suggest that either ego depletion is not a real effect or, alternatively, persistence in anagram solving may not be an optimal measure to test it.
RESUMEN
According to the facial feedback hypothesis, people's affective responses can be influenced by their own facial expression (e.g., smiling, pouting), even when their expression did not result from their emotional experiences. For example, Strack, Martin, and Stepper (1988) instructed participants to rate the funniness of cartoons using a pen that they held in their mouth. In line with the facial feedback hypothesis, when participants held the pen with their teeth (inducing a "smile"), they rated the cartoons as funnier than when they held the pen with their lips (inducing a "pout"). This seminal study of the facial feedback hypothesis has not been replicated directly. This Registered Replication Report describes the results of 17 independent direct replications of Study 1 from Strack et al. (1988), all of which followed the same vetted protocol. A meta-analysis of these studies examined the difference in funniness ratings between the "smile" and "pout" conditions. The original Strack et al. (1988) study reported a rating difference of 0.82 units on a 10-point Likert scale. Our meta-analysis revealed a rating difference of 0.03 units with a 95% confidence interval ranging from -0.11 to 0.16.