Search | VHL Regional Portal

1.

Validating a forced-choice method for eliciting quality-of-reasoning judgments.

Marcoci, Alexandru; Webb, Margaret E; Rowe, Luke; Barnett, Ashley; Primoratz, Tamar; Kruger, Ariel; Karvetski, Christopher W; Stone, Benjamin; Diamond, Michael L; Saletta, Morgan; van Gelder, Tim; Tetlock, Philip E; Dennis, Simon.

Behav Res Methods ; 2023 Oct 13.

Article in English | MEDLINE | ID: mdl-37833511

ABSTRACT

In this paper we investigate the criterion validity of forced-choice comparisons of the quality of written arguments with normative solutions. Across two studies, novices and experts assessing quality of reasoning through a forced-choice design were both able to choose arguments supporting more accurate solutions-62.2% (SE = 1%) of the time for novices and 74.4% (SE = 1%) for experts-and arguments produced by larger teams-up to 82% of the time for novices and 85% for experts-with high inter-rater reliability, namely 70.58% (95% CI = 1.18) agreement for novices and 80.98% (95% CI = 2.26) for experts. We also explored two methods for increasing efficiency. We found that the number of comparative judgments needed could be substantially reduced with little accuracy loss by leveraging transitivity and producing quality-of-reasoning assessments using an AVL tree method. Moreover, a regression model trained to predict scores based on automatically derived linguistic features of participants' judgments achieved a high correlation with the objective accuracy scores of the arguments in our dataset. Despite the inherent subjectivity involved in evaluating differing quality of reasoning, the forced-choice paradigm allows even novice raters to perform beyond chance and can provide a valid, reliable, and efficient method for producing quality-of-reasoning assessments at scale.

2.

Big STEM collaborations should include humanities and social science.

Marcoci, Alexandru; Thresher, Ann C; Martens, Niels C M; Galison, Peter; Doeleman, Sheperd S; Johnson, Michael D.

Nat Hum Behav ; 7(8): 1229-1230, 2023 08.

Article in English | MEDLINE | ID: mdl-37563305

Subject(s)

Humanities , Social Sciences , Humans

3.

Predicting and reasoning about replicability using structured groups.

Wintle, Bonnie C; Smith, Eden T; Bush, Martin; Mody, Fallon; Wilkinson, David P; Hanea, Anca M; Marcoci, Alexandru; Fraser, Hannah; Hemming, Victoria; Thorn, Felix Singleton; McBride, Marissa F; Gould, Elliot; Head, Andrew; Hamilton, Daniel G; Kambouris, Steven; Rumpff, Libby; Hoekstra, Rink; Burgman, Mark A; Fidler, Fiona.

R Soc Open Sci ; 10(6): 221553, 2023 Jun.

Article in English | MEDLINE | ID: mdl-37293358

ABSTRACT

This paper explores judgements about the replicability of social and behavioural sciences research and what drives those judgements. Using a mixed methods approach, it draws on qualitative and quantitative data elicited from groups using a structured approach called the IDEA protocol ('investigate', 'discuss', 'estimate' and 'aggregate'). Five groups of five people with relevant domain expertise evaluated 25 research claims that were subject to at least one replication study. Participants assessed the probability that each of the 25 research claims would replicate (i.e. that a replication study would find a statistically significant result in the same direction as the original study) and described the reasoning behind those judgements. We quantitatively analysed possible correlates of predictive accuracy, including self-rated expertise and updating of judgements after feedback and discussion. We qualitatively analysed the reasoning data to explore the cues, heuristics and patterns of reasoning used by participants. Participants achieved 84% classification accuracy in predicting replicability. Those who engaged in a greater breadth of reasoning provided more accurate replicability judgements. Some reasons were more commonly invoked by more accurate participants, such as 'effect size' and 'reputation' (e.g. of the field of research). There was also some evidence of a relationship between statistical literacy and accuracy.

4.

Predicting reliability through structured expert elicitation with the repliCATS (Collaborative Assessments for Trustworthy Science) process.

Fraser, Hannah; Bush, Martin; Wintle, Bonnie C; Mody, Fallon; Smith, Eden T; Hanea, Anca M; Gould, Elliot; Hemming, Victoria; Hamilton, Daniel G; Rumpff, Libby; Wilkinson, David P; Pearson, Ross; Singleton Thorn, Felix; Ashton, Raquel; Willcox, Aaron; Gray, Charles T; Head, Andrew; Ross, Melissa; Groenewegen, Rebecca; Marcoci, Alexandru; Vercammen, Ans; Parker, Timothy H; Hoekstra, Rink; Nakagawa, Shinichi; Mandel, David R; van Ravenzwaaij, Don; McBride, Marissa; Sinnott, Richard O; Vesk, Peter; Burgman, Mark; Fidler, Fiona.

PLoS One ; 18(1): e0274429, 2023.

Article in English | MEDLINE | ID: mdl-36701303

ABSTRACT

As replications of individual studies are resource intensive, techniques for predicting the replicability are required. We introduce the repliCATS (Collaborative Assessments for Trustworthy Science) process, a new method for eliciting expert predictions about the replicability of research. This process is a structured expert elicitation approach based on a modified Delphi technique applied to the evaluation of research claims in social and behavioural sciences. The utility of processes to predict replicability is their capacity to test scientific claims without the costs of full replication. Experimental data supports the validity of this process, with a validation study producing a classification accuracy of 84% and an Area Under the Curve of 0.94, meeting or exceeding the accuracy of other techniques used to predict replicability. The repliCATS process provides other benefits. It is highly scalable, able to be deployed for both rapid assessment of small numbers of claims, and assessment of high volumes of claims over an extended period through an online elicitation platform, having been used to assess 3000 research claims over an 18 month period. It is available to be implemented in a range of ways and we describe one such implementation. An important advantage of the repliCATS process is that it collects qualitative data that has the potential to provide insight in understanding the limits of generalizability of scientific claims. The primary limitation of the repliCATS process is its reliance on human-derived predictions with consequent costs in terms of participant fatigue although careful design can minimise these costs. The repliCATS process has potential applications in alternative peer review and in the allocation of effort for replication studies.

Subject(s)

Behavioral Sciences , Data Accuracy , Humans , Reproducibility of Results , Costs and Cost Analysis , Peer Review

5.

Reimagining peer review as an expert elicitation process.

Marcoci, Alexandru; Vercammen, Ans; Bush, Martin; Hamilton, Daniel G; Hanea, Anca; Hemming, Victoria; Wintle, Bonnie C; Burgman, Mark; Fidler, Fiona.

BMC Res Notes ; 15(1): 127, 2022 Apr 05.

Article in English | MEDLINE | ID: mdl-35382867

ABSTRACT

Journal peer review regulates the flow of ideas through an academic discipline and thus has the power to shape what a research community knows, actively investigates, and recommends to policymakers and the wider public. We might assume that editors can identify the 'best' experts and rely on them for peer review. But decades of research on both expert decision-making and peer review suggests they cannot. In the absence of a clear criterion for demarcating reliable, insightful, and accurate expert assessors of research quality, the best safeguard against unwanted biases and uneven power distributions is to introduce greater transparency and structure into the process. This paper argues that peer review would therefore benefit from applying a series of evidence-based recommendations from the empirical literature on structured expert elicitation. We highlight individual and group characteristics that contribute to higher quality judgements, and elements of elicitation protocols that reduce bias, promote constructive discussion, and enable opinions to be objectively and transparently aggregated.

Subject(s)

Peer Review

6.

Pre-screening workers to overcome bias amplification in online labour markets.

Vercammen, Ans; Marcoci, Alexandru; Burgman, Mark.

PLoS One ; 16(3): e0249051, 2021.

Article in English | MEDLINE | ID: mdl-33755712

ABSTRACT

Groups have access to more diverse information and typically outperform individuals on problem solving tasks. Crowdsolving utilises this principle to generate novel and/or superior solutions to intellective tasks by pooling the inputs from a distributed online crowd. However, it is unclear whether this particular instance of "wisdom of the crowd" can overcome the influence of potent cognitive biases that habitually lead individuals to commit reasoning errors. We empirically test the prevalence of cognitive bias on a popular crowdsourcing platform, examining susceptibility to bias of online panels at the individual and aggregate levels. We then investigate the use of the Cognitive Reflection Test, notable for its predictive validity for both susceptibility to cognitive biases in test settings and real-life reasoning, as a screening tool to improve collective performance. We find that systematic biases in crowdsourced answers are not as prevalent as anticipated, but when they occur, biases are amplified with increasing group size, as predicted by the Condorcet Jury Theorem. The results further suggest that pre-screening individuals with the Cognitive Reflection Test can substantially enhance collective judgement and improve crowdsolving performance.

Subject(s)

Job Application , Problem Solving , Adult , Bias , Crowdsourcing , Female , Humans , Judgment , Male , Middle Aged , Neuropsychological Tests , Surveys and Questionnaires

7.

Judgement aggregation in scientific collaborations: The case for waiving expertise.

Marcoci, Alexandru; Nguyen, James.

Stud Hist Philos Sci ; 84: 66-74, 2020 12.

Article in English | MEDLINE | ID: mdl-33218467

ABSTRACT

The fragmentation of academic disciplines forces individuals to specialise. In doing so, they become experts over their narrow area of research. However, ambitious scientific projects, such as the search for gravitational waves, require them to come together and collaborate across disciplinary borders. How should scientists with expertise in different disciplines treat each others' expert claims? An intuitive answer is that the collaboration should defer to the opinions of experts. In this paper we show that under certain seemingly innocuous assumptions, this intuitive answer gives rise to an impossibility result when it comes to aggregating the beliefs of experts to deliver the beliefs of a collaboration as a whole. We then argue that when experts' beliefs come into conflict, they should waive their expert status.

Subject(s)

Judgment , Humans

8.

Better Together: Reliable Application of the Post-9/11 and Post-Iraq US Intelligence Tradecraft Standards Requires Collective Analysis.

Marcoci, Alexandru; Burgman, Mark; Kruger, Ariel; Silver, Elizabeth; McBride, Marissa; Thorn, Felix Singleton; Fraser, Hannah; Wintle, Bonnie C; Fidler, Fiona; Vercammen, Ans.

Front Psychol ; 9: 2634, 2018.

Article in English | MEDLINE | ID: mdl-30666222

ABSTRACT

Background: The events of 9/11 and the October 2002 National Intelligence Estimate on Iraq's Continuing Programs for Weapons of Mass Destruction precipitated fundamental changes within the United States Intelligence Community. As part of the reform, analytic tradecraft standards were revised and codified into a policy document - Intelligence Community Directive (ICD) 203 - and an analytic ombudsman was appointed in the newly created Office for the Director of National Intelligence to ensure compliance across the intelligence community. In this paper we investigate the untested assumption that the ICD203 criteria can facilitate reliable evaluations of analytic products. Methods: Fifteen independent raters used a rubric based on the ICD203 criteria to assess the quality of reasoning of 64 analytical reports generated in response to hypothetical intelligence problems. We calculated the intra-class correlation coefficients for single and group-aggregated assessments. Results: Despite general training and rater calibration, the reliability of individual assessments was poor. However, aggregate ratings showed good to excellent reliability. Conclusion: Given that real problems will be more difficult and complex than our hypothetical case studies, we advise that groups of at least three raters are required to obtain reliable quality control procedures for intelligence products. Our study sets limits on assessment reliability and provides a basis for further evaluation of the predictive validity of intelligence reports generated in compliance with the tradecraft standards.

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL