Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 41
Filter
1.
R Soc Open Sci ; 11(7): 240125, 2024 Jul.
Article in English | MEDLINE | ID: mdl-39050728

ABSTRACT

Many-analysts studies explore how well an empirical claim withstands plausible alternative analyses of the same dataset by multiple, independent analysis teams. Conclusions from these studies typically rely on a single outcome metric (e.g. effect size) provided by each analysis team. Although informative about the range of plausible effects in a dataset, a single effect size from each team does not provide a complete, nuanced understanding of how analysis choices are related to the outcome. We used the Delphi consensus technique with input from 37 experts to develop an 18-item subjective evidence evaluation survey (SEES) to evaluate how each analysis team views the methodological appropriateness of the research design and the strength of evidence for the hypothesis. We illustrate the usefulness of the SEES in providing richer evidence assessment with pilot data from a previous many-analysts study.

2.
Proc Natl Acad Sci U S A ; 121(32): e2403490121, 2024 Aug 06.
Article in English | MEDLINE | ID: mdl-39078672

ABSTRACT

A typical empirical study involves choosing a sample, a research design, and an analysis path. Variation in such choices across studies leads to heterogeneity in results that introduce an additional layer of uncertainty, limiting the generalizability of published scientific findings. We provide a framework for studying heterogeneity in the social sciences and divide heterogeneity into population, design, and analytical heterogeneity. Our framework suggests that after accounting for heterogeneity, the probability that the tested hypothesis is true for the average population, design, and analysis path can be much lower than implied by nominal error rates of statistically significant individual studies. We estimate each type's heterogeneity from 70 multilab replication studies, 11 prospective meta-analyses of studies employing different experimental designs, and 5 multianalyst studies. In our data, population heterogeneity tends to be relatively small, whereas design and analytical heterogeneity are large. Our results should, however, be interpreted cautiously due to the limited number of studies and the large uncertainty in the heterogeneity estimates. We discuss several ways to parse and account for heterogeneity in the context of different methodologies.

5.
Proc Natl Acad Sci U S A ; 120(23): e2215572120, 2023 Jun 06.
Article in English | MEDLINE | ID: mdl-37252958

ABSTRACT

Does competition affect moral behavior? This fundamental question has been debated among leading scholars for centuries, and more recently, it has been tested in experimental studies yielding a body of rather inconclusive empirical evidence. A potential source of ambivalent empirical results on the same hypothesis is design heterogeneity-variation in true effect sizes across various reasonable experimental research protocols. To provide further evidence on whether competition affects moral behavior and to examine whether the generalizability of a single experimental study is jeopardized by design heterogeneity, we invited independent research teams to contribute experimental designs to a crowd-sourced project. In a large-scale online data collection, 18,123 experimental participants were randomly allocated to 45 randomly selected experimental designs out of 95 submitted designs. We find a small adverse effect of competition on moral behavior in a meta-analysis of the pooled data. The crowd-sourced design of our study allows for a clean identification and estimation of the variation in effect sizes above and beyond what could be expected due to sampling variance. We find substantial design heterogeneity-estimated to be about 1.6 times as large as the average standard error of effect size estimates of the 45 research designs-indicating that the informativeness and generalizability of results based on a single experimental design are limited. Drawing strong conclusions about the underlying hypotheses in the presence of substantive design heterogeneity requires moving toward much larger data collections on various experimental designs testing the same hypothesis.

6.
Lakartidningen ; 1202023 05 15.
Article in Swedish | MEDLINE | ID: mdl-37191395

ABSTRACT

Analysis of research data entails many choices. As a result, a space of different analytical strategies is open to researchers. Different justifiable analyses may not give similar results. The method of multiple analysts is a way to study the analytical flexibility and behaviour of researchers under naturalistic conditions, as part of the field known as metascience. Analytical flexibility and risks of bias can be counteracted by open data sharing, pre-registration of analysis plans, and registration of clinical trials in trial registers. These measures are particularly important for retrospective studies where analytical flexibility can be greatest, although pre-registration is less useful in this context. Synthetic datasets can be an alternative to pre-registration when used to decide what analyses should be conducted on real datasets by independent parties. All these strategies help build trustworthiness in scientific reports, and improve the reliability of research findings.


Subject(s)
Biomedical Research , Humans , Reproducibility of Results , Retrospective Studies
7.
R Soc Open Sci ; 9(9): 220440, 2022 Sep.
Article in English | MEDLINE | ID: mdl-36177198

ABSTRACT

Many publications on COVID-19 were released on preprint servers such as medRxiv and bioRxiv. It is unknown how reliable these preprints are, and which ones will eventually be published in scientific journals. In this study, we use crowdsourced human forecasts to predict publication outcomes and future citation counts for a sample of 400 preprints with high Altmetric score. Most of these preprints were published within 1 year of upload on a preprint server (70%), with a considerable fraction (45%) appearing in a high-impact journal with a journal impact factor of at least 10. On average, the preprints received 162 citations within the first year. We found that forecasters can predict if preprints will be published after 1 year and if the publishing journal has high impact. Forecasts are also informative with respect to Google Scholar citations within 1 year of upload on a preprint server. For both types of assessment, we found statistically significant positive correlations between forecasts and observed outcomes. While the forecasts can help to provide a preliminary assessment of preprints at a faster pace than traditional peer-review, it remains to be investigated if such an assessment is suited to identify methodological problems in preprints.

8.
Proc Natl Acad Sci U S A ; 119(30): e2120377119, 2022 Jul 26.
Article in English | MEDLINE | ID: mdl-35858443

ABSTRACT

This initiative examined systematically the extent to which a large set of archival research findings generalizes across contexts. We repeated the key analyses for 29 original strategic management effects in the same context (direct reproduction) as well as in 52 novel time periods and geographies; 45% of the reproductions returned results matching the original reports together with 55% of tests in different spans of years and 40% of tests in novel geographies. Some original findings were associated with multiple new tests. Reproducibility was the best predictor of generalizability-for the findings that proved directly reproducible, 84% emerged in other available time periods and 57% emerged in other geographies. Overall, only limited empirical evidence emerged for context sensitivity. In a forecasting survey, independent scientists were able to anticipate which effects would find support in tests in new samples.

9.
Sci Rep ; 12(1): 7575, 2022 05 09.
Article in English | MEDLINE | ID: mdl-35534489

ABSTRACT

Scientists and policymakers seek to choose effective interventions that promote preventative health measures. We evaluated whether academics, behavioral science practitioners, and laypeople (N = 1034) were able to forecast the effectiveness of seven different messages compared to a baseline message for Republicans and Democrats separately. These messages were designed to nudge mask-wearing attitudes, intentions, and behaviors. When examining predictions across political parties, forecasters predicted larger effects than those observed for Democrats compared to Republicans and made more accurate predictions for Republicans compared to Democrats. These results are partly driven by a lack of nudge effects on Democrats, as reported in Gelfand et al. (J Exp Soc Psychol, 2021). Academics and practitioners made more accurate predictions compared to laypeople. Although forecasters' predictions were correlated with the nudge interventions, all groups overestimated the observed results. We discuss potential reasons for why the forecasts did not perform better and how more accurate forecasts of behavioral intervention outcomes could potentially provide insight that can help save resources and increase the efficacy of interventions.


Subject(s)
Attitude , Politics , Behavior Therapy
10.
Annu Rev Psychol ; 73: 719-748, 2022 01 04.
Article in English | MEDLINE | ID: mdl-34665669

ABSTRACT

Replication-an important, uncommon, and misunderstood practice-is gaining appreciation in psychology. Achieving replicability is important for making research progress. If findings are not replicable, then prediction and theory development are stifled. If findings are replicable, then interrogation of their meaning and validity can advance knowledge. Assessing replicability can be productive for generating and testing hypotheses by actively confronting current understandings to identify weaknesses and spur innovation. For psychology, the 2010s might be characterized as a decade of active confrontation. Systematic and multi-site replication projects assessed current understandings and observed surprising failures to replicate many published findings. Replication efforts highlighted sociocultural challenges such as disincentives to conduct replications and a tendency to frame replication as a personal attack rather than a healthy scientific practice, and they raised awareness that replication contributes to self-correction. Nevertheless, innovation in doing and understanding replication and its cousins, reproducibility and robustness, has positioned psychology to improve research practices and accelerate progress.


Subject(s)
Research Design , Humans , Reproducibility of Results
11.
Elife ; 102021 11 09.
Article in English | MEDLINE | ID: mdl-34751133

ABSTRACT

Any large dataset can be analyzed in a number of ways, and it is possible that the use of different analysis strategies will lead to different results and conclusions. One way to assess whether the results obtained depend on the analysis strategy chosen is to employ multiple analysts and leave each of them free to follow their own approach. Here, we present consensus-based guidance for conducting and reporting such multi-analyst studies, and we discuss how broader adoption of the multi-analyst approach has the potential to strengthen the robustness of results and conclusions obtained from analyses of datasets in basic and applied research.


Subject(s)
Consensus , Data Analysis , Datasets as Topic , Research
12.
R Soc Open Sci ; 8(7): 181308, 2021 Jul.
Article in English | MEDLINE | ID: mdl-34295507

ABSTRACT

There is evidence that prediction markets are useful tools to aggregate information on researchers' beliefs about scientific results including the outcome of replications. In this study, we use prediction markets to forecast the results of novel experimental designs that test established theories. We set up prediction markets for hypotheses tested in the Defense Advanced Research Projects Agency's (DARPA) Next Generation Social Science (NGS2) programme. Researchers were invited to bet on whether 22 hypotheses would be supported or not. We define support as a test result in the same direction as hypothesized, with a Bayes factor of at least 10 (i.e. a likelihood of the observed data being consistent with the tested hypothesis that is at least 10 times greater compared with the null hypothesis). In addition to betting on this binary outcome, we asked participants to bet on the expected effect size (in Cohen's d) for each hypothesis. Our goal was to recruit at least 50 participants that signed up to participate in these markets. While this was the case, only 39 participants ended up actually trading. Participants also completed a survey on both the binary result and the effect size. We find that neither prediction markets nor surveys performed well in predicting outcomes for NGS2.

13.
Cortex ; 144: 213-229, 2021 11.
Article in English | MEDLINE | ID: mdl-33965167

ABSTRACT

There is growing awareness across the neuroscience community that the replicability of findings about the relationship between brain activity and cognitive phenomena can be improved by conducting studies with high statistical power that adhere to well-defined and standardised analysis pipelines. Inspired by recent efforts from the psychological sciences, and with the desire to examine some of the foundational findings using electroencephalography (EEG), we have launched #EEGManyLabs, a large-scale international collaborative replication effort. Since its discovery in the early 20th century, EEG has had a profound influence on our understanding of human cognition, but there is limited evidence on the replicability of some of the most highly cited discoveries. After a systematic search and selection process, we have identified 27 of the most influential and continually cited studies in the field. We plan to directly test the replicability of key findings from 20 of these studies in teams of at least three independent laboratories. The design and protocol of each replication effort will be submitted as a Registered Report and peer-reviewed prior to data collection. Prediction markets, open to all EEG researchers, will be used as a forecasting tool to examine which findings the community expects to replicate. This project will update our confidence in some of the most influential EEG findings and generate a large open access database that can be used to inform future research practices. Finally, through this international effort, we hope to create a cultural shift towards inclusive, high-powered multi-laboratory collaborations.


Subject(s)
Electroencephalography , Neurosciences , Cognition , Humans , Reproducibility of Results
14.
PLoS One ; 16(4): e0248780, 2021.
Article in English | MEDLINE | ID: mdl-33852589

ABSTRACT

The reproducibility of published research has become an important topic in science policy. A number of large-scale replication projects have been conducted to gauge the overall reproducibility in specific academic fields. Here, we present an analysis of data from four studies which sought to forecast the outcomes of replication projects in the social and behavioural sciences, using human experts who participated in prediction markets and answered surveys. Because the number of findings replicated and predicted in each individual study was small, pooling the data offers an opportunity to evaluate hypotheses regarding the performance of prediction markets and surveys at a higher power. In total, peer beliefs were elicited for the replication outcomes of 103 published findings. We find there is information within the scientific community about the replicability of scientific findings, and that both surveys and prediction markets can be used to elicit and aggregate this information. Our results show prediction markets can determine the outcomes of direct replications with 73% accuracy (n = 103). Both the prediction market prices, and the average survey responses are correlated with outcomes (0.581 and 0.564 respectively, both p < .001). We also found a significant relationship between p-values of the original findings and replication outcomes. The dataset is made available through the R package "pooledmaRket" and can be used to further study community beliefs towards replications outcomes as elicited in the surveys and prediction markets.


Subject(s)
Forecasting/methods , Reproducibility of Results , Research/statistics & numerical data , Humans , Research/trends , Surveys and Questionnaires
15.
R Soc Open Sci ; 7(7): 200566, 2020 Jul.
Article in English | MEDLINE | ID: mdl-32874648

ABSTRACT

The Defense Advanced Research Projects Agency (DARPA) programme 'Systematizing Confidence in Open Research and Evidence' (SCORE) aims to generate confidence scores for a large number of research claims from empirical studies in the social and behavioural sciences. The confidence scores will provide a quantitative assessment of how likely a claim will hold up in an independent replication. To create the scores, we follow earlier approaches and use prediction markets and surveys to forecast replication outcomes. Based on an initial set of forecasts for the overall replication rate in SCORE and its dependence on the academic discipline and the time of publication, we show that participants expect replication rates to increase over time. Moreover, they expect replication rates to differ between fields, with the highest replication rate in economics (average survey response 58%), and the lowest in psychology and in education (average survey response of 42% for both fields). These results reveal insights into the academic community's views of the replication crisis, including for research fields for which no large-scale replication studies have been undertaken yet.

16.
Nature ; 582(7810): 84-88, 2020 06.
Article in English | MEDLINE | ID: mdl-32483374

ABSTRACT

Data analysis workflows in many scientific domains have become increasingly complex and flexible. Here we assess the effect of this flexibility on the results of functional magnetic resonance imaging by asking 70 independent teams to analyse the same dataset, testing the same 9 ex-ante hypotheses1. The flexibility of analytical approaches is exemplified by the fact that no two teams chose identical workflows to analyse the data. This flexibility resulted in sizeable variation in the results of hypothesis tests, even for teams whose statistical maps were highly correlated at intermediate stages of the analysis pipeline. Variation in reported results was related to several aspects of analysis methodology. Notably, a meta-analytical approach that aggregated information across teams yielded a significant consensus in activated regions. Furthermore, prediction markets of researchers in the field revealed an overestimation of the likelihood of significant findings, even by researchers with direct knowledge of the dataset2-5. Our findings show that analytical flexibility can have substantial effects on scientific conclusions, and identify factors that may be related to variability in the analysis of functional magnetic resonance imaging. The results emphasize the importance of validating and sharing complex analysis workflows, and demonstrate the need for performing and reporting multiple analyses of the same data. Potential approaches that could be used to mitigate issues related to analytical variability are discussed.


Subject(s)
Data Analysis , Data Science/methods , Data Science/standards , Datasets as Topic , Functional Neuroimaging , Magnetic Resonance Imaging , Research Personnel/organization & administration , Brain/diagnostic imaging , Brain/physiology , Datasets as Topic/statistics & numerical data , Female , Humans , Logistic Models , Male , Meta-Analysis as Topic , Models, Neurological , Reproducibility of Results , Research Personnel/standards , Software
17.
Psychol Bull ; 146(5): 451-479, 2020 05.
Article in English | MEDLINE | ID: mdl-31944796

ABSTRACT

To what extent are research results influenced by subjective decisions that scientists make as they design studies? Fifteen research teams independently designed studies to answer five original research questions related to moral judgments, negotiations, and implicit cognition. Participants from 2 separate large samples (total N > 15,000) were then randomly assigned to complete 1 version of each study. Effect sizes varied dramatically across different sets of materials designed to test the same hypothesis: Materials from different teams rendered statistically significant effects in opposite directions for 4 of 5 hypotheses, with the narrowest range in estimates being d = -0.37 to + 0.26. Meta-analysis and a Bayesian perspective on the results revealed overall support for 2 hypotheses and a lack of support for 3 hypotheses. Overall, practically none of the variability in effect sizes was attributable to the skill of the research team in designing materials, whereas considerable variability was attributable to the hypothesis being tested. In a forecasting survey, predictions of other scientists were significantly correlated with study results, both across and within hypotheses. Crowdsourced testing of research hypotheses helps reveal the true consistency of empirical support for a scientific claim. (PsycInfo Database Record (c) 2020 APA, all rights reserved).


Subject(s)
Crowdsourcing , Psychology/methods , Research Design , Adult , Humans , Random Allocation
18.
PLoS One ; 14(12): e0225826, 2019.
Article in English | MEDLINE | ID: mdl-31805105

ABSTRACT

We measure how accurately replication of experimental results can be predicted by black-box statistical models. With data from four large-scale replication projects in experimental psychology and economics, and techniques from machine learning, we train predictive models and study which variables drive predictable replication. The models predicts binary replication with a cross-validated accuracy rate of 70% (AUC of 0.77) and estimates of relative effect sizes with a Spearman ρ of 0.38. The accuracy level is similar to market-aggregated beliefs of peer scientists [1, 2]. The predictive power is validated in a pre-registered out of sample test of the outcome of [3], where 71% (AUC of 0.73) of replications are predicted correctly and effect size correlations amount to ρ = 0.25. Basic features such as the sample and effect sizes in original papers, and whether reported effects are single-variable main effects or two-variable interactions, are predictive of successful replication. The models presented in this paper are simple tools to produce cheap, prognostic replicability metrics. These models could be useful in institutionalizing the process of evaluation of new findings and guiding resources to those direct replications that are likely to be most informative.


Subject(s)
Laboratories , Research , Social Sciences , Algorithms , Models, Statistical , ROC Curve , Regression Analysis , Reproducibility of Results
19.
Sci Data ; 6(1): 106, 2019 07 01.
Article in English | MEDLINE | ID: mdl-31263104

ABSTRACT

There is an ongoing debate about the replicability of neuroimaging research. It was suggested that one of the main reasons for the high rate of false positive results is the many degrees of freedom researchers have during data analysis. In the Neuroimaging Analysis Replication and Prediction Study (NARPS), we aim to provide the first scientific evidence on the variability of results across analysis teams in neuroscience. We collected fMRI data from 108 participants during two versions of the mixed gambles task, which is often used to study decision-making under risk. For each participant, the dataset includes an anatomical (T1 weighted) scan and fMRI as well as behavioral data from four runs of the task. The dataset is shared through OpenNeuro and is formatted according to the Brain Imaging Data Structure (BIDS) standard. Data pre-processed with fMRIprep and quality control reports are also publicly shared. This dataset can be used to study decision-making under risk and to test replicability and interpretability of previous results in the field.


Subject(s)
Brain/diagnostic imaging , Neuroimaging , Brain/physiology , Brain Mapping , Humans , Image Processing, Computer-Assisted/methods , Magnetic Resonance Imaging/methods , Predictive Value of Tests
20.
J Econ Sci Assoc ; 5(2): 149-169, 2019.
Article in English | MEDLINE | ID: mdl-31894199

ABSTRACT

Many studies report on the association between 2D:4D, a putative marker for prenatal testosterone exposure, and economic preferences. However, most of these studies have limited sample sizes and test multiple hypotheses (without preregistration). In this study we mainly replicate the common specifications found in the literature for the association between the 2D:4D ratio and risk taking, the willingness to compete, and dictator game giving separately. In a sample of 330 women we find no robust associations between any of these economic preferences and 2D:4D. We find no evidence of a statistically significant relation for 16 of the 18 total regressions we run. The two regression specifications which are statistically significant have not previously been reported and the associations are not in the expected direction, and therefore they are unlikely to represent a real effect.

SELECTION OF CITATIONS
SEARCH DETAIL