Your browser doesn't support javascript.
loading
Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models.
Lai, Honghao; Ge, Long; Sun, Mingyao; Pan, Bei; Huang, Jiajie; Hou, Liangying; Yang, Qiuyu; Liu, Jiayi; Liu, Jianing; Ye, Ziying; Xia, Danni; Zhao, Weilong; Wang, Xiaoman; Liu, Ming; Talukdar, Jhalok Ronjan; Tian, Jinhui; Yang, Kehu; Estill, Janne.
Affiliation
  • Lai H; Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China.
  • Ge L; Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China.
  • Sun M; Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China.
  • Pan B; Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China.
  • Huang J; Key Laboratory of Evidence Based Medicine and Knowledge Translation of Gansu Province, Lanzhou, China.
  • Hou L; Evidence-Based Nursing Center, School of Nursing, Lanzhou University, Lanzhou, China.
  • Yang Q; Evidence-Based Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China.
  • Liu J; College of Nursing, Gansu University of Chinese Medicine, Lanzhou, China.
  • Liu J; Evidence-Based Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China.
  • Ye Z; Department of Health Research Methods, Evidence, and Impact, McMaster University, Ontario, Canada.
  • Xia D; Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China.
  • Zhao W; Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China.
  • Wang X; Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China.
  • Liu M; Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China.
  • Talukdar JR; College of Nursing, Gansu University of Chinese Medicine, Lanzhou, China.
  • Tian J; Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China.
  • Yang K; Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China.
  • Estill J; Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China.
JAMA Netw Open ; 7(5): e2412687, 2024 May 01.
Article in En | MEDLINE | ID: mdl-38776081
ABSTRACT
Importance Large language models (LLMs) may facilitate the labor-intensive process of systematic reviews. However, the exact methods and reliability remain uncertain.

Objective:

To explore the feasibility and reliability of using LLMs to assess risk of bias (ROB) in randomized clinical trials (RCTs). Design, Setting, and

Participants:

A survey study was conducted between August 10, 2023, and October 30, 2023. Thirty RCTs were selected from published systematic reviews. Main Outcomes and

Measures:

A structured prompt was developed to guide ChatGPT (LLM 1) and Claude (LLM 2) in assessing the ROB in these RCTs using a modified version of the Cochrane ROB tool developed by the CLARITY group at McMaster University. Each RCT was assessed twice by both models, and the results were documented. The results were compared with an assessment by 3 experts, which was considered a criterion standard. Correct assessment rates, sensitivity, specificity, and F1 scores were calculated to reflect accuracy, both overall and for each domain of the Cochrane ROB tool; consistent assessment rates and Cohen κ were calculated to gauge consistency; and assessment time was calculated to measure efficiency. Performance between the 2 models was compared using risk differences.

Results:

Both models demonstrated high correct assessment rates. LLM 1 reached a mean correct assessment rate of 84.5% (95% CI, 81.5%-87.3%), and LLM 2 reached a significantly higher rate of 89.5% (95% CI, 87.0%-91.8%). The risk difference between the 2 models was 0.05 (95% CI, 0.01-0.09). In most domains, domain-specific correct rates were around 80% to 90%; however, sensitivity below 0.80 was observed in domains 1 (random sequence generation), 2 (allocation concealment), and 6 (other concerns). Domains 4 (missing outcome data), 5 (selective outcome reporting), and 6 had F1 scores below 0.50. The consistent rates between the 2 assessments were 84.0% for LLM 1 and 87.3% for LLM 2. LLM 1's κ exceeded 0.80 in 7 and LLM 2's in 8 domains. The mean (SD) time needed for assessment was 77 (16) seconds for LLM 1 and 53 (12) seconds for LLM 2.

Conclusions:

In this survey study of applying LLMs for ROB assessment, LLM 1 and LLM 2 demonstrated substantial accuracy and consistency in evaluating RCTs, suggesting their potential as supportive tools in systematic review processes.
Subject(s)

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Bias / Randomized Controlled Trials as Topic Limits: Humans Language: En Journal: JAMA Netw Open Year: 2024 Document type: Article Affiliation country: China Country of publication: United States

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Bias / Randomized Controlled Trials as Topic Limits: Humans Language: En Journal: JAMA Netw Open Year: 2024 Document type: Article Affiliation country: China Country of publication: United States