Search | VHL Regional Portal

Sensitivity and Specificity of Using GPT-3.5 Turbo Models for Title and Abstract Screening in Systematic Reviews and Meta-analyses.

Tran, Viet-Thi; Gartlehner, Gerald; Yaacoub, Sally; Boutron, Isabelle; Schwingshackl, Lukas; Stadelmaier, Julia; Sommer, Isolde; Alebouyeh, Farzaneh; Afach, Sivem; Meerpohl, Joerg; Ravaud, Philippe.

Ann Intern Med ; 177(6): 791-799, 2024 Jun.

Article in English | MEDLINE | ID: mdl-38768452

ABSTRACT

BACKGROUND: Systematic reviews are performed manually despite the exponential growth of scientific literature. OBJECTIVE: To investigate the sensitivity and specificity of GPT-3.5 Turbo, from OpenAI, as a single reviewer, for title and abstract screening in systematic reviews. DESIGN: Diagnostic test accuracy study. SETTING: Unannotated bibliographic databases from 5 systematic reviews representing 22 665 citations. PARTICIPANTS: None. MEASUREMENTS: A generic prompt framework to instruct GPT to perform title and abstract screening was designed. The output of the model was compared with decisions from authors under 2 rules. The first rule balanced sensitivity and specificity, for example, to act as a second reviewer. The second rule optimized sensitivity, for example, to reduce the number of citations to be manually screened. RESULTS: Under the balanced rule, sensitivities ranged from 81.1% to 96.5% and specificities ranged from 25.8% to 80.4%. Across all reviews, GPT identified 7 of 708 citations (1%) missed by humans that should have been included after full-text screening at the cost of 10 279 of 22 665 false-positive recommendations (45.3%) that would require reconciliation during the screening process. Under the sensitive rule, sensitivities ranged from 94.6% to 99.8% and specificities ranged from 2.2% to 46.6%. Limiting manual screening to citations not ruled out by GPT could reduce the number of citations to screen from 127 of 6334 (2%) to 1851 of 4077 (45.4%), at the cost of missing from 0 to 1 of 26 citations (3.8%) at the full-text level. LIMITATIONS: Time needed to fine-tune prompt. Retrospective nature of the study, convenient sample of 5 systematic reviews, and GPT performance sensitive to prompt development and time. CONCLUSION: The GPT-3.5 Turbo model may be used as a second reviewer for title and abstract screening, at the cost of additional work to reconcile added false positives. It also showed potential to reduce the number of citations before screening by humans, at the cost of missing some citations at the full-text level. PRIMARY FUNDING SOURCE: None.

Subject(s)

Meta-Analysis as Topic , Sensitivity and Specificity , Humans , Abstracting and Indexing , Review Literature as Topic , Systematic Reviews as Topic

Psychometric properties and domains covered by patient-reported outcome measures used in trials assessing interventions for chronic pain.

Alebouyeh, Farzaneh; Boutron, Isabelle; Ravaud, Philippe; Tran, Viet-Thi.

J Clin Epidemiol ; 170: 111362, 2024 Jun.

Article in English | MEDLINE | ID: mdl-38615827

ABSTRACT

OBJECTIVES: To identify the patient-reported outcome measures (PROMs) used in clinical trials assessing interventions for chronic pain, describe their psychometric properties, and the clinical domains they cover. STUDY DESIGN AND SETTING: We identified phase 3 or 4 interventional trials: on adult participants (aged >18 years), registered in clinicaltrials.gov between January 1, 2021 and December 31, 2022, and which provided "chronic pain" as a keyword condition. We excluded diagnostic studies and phase 1 or 2 trials. In each trial, one reviewer extracted all outcomes registered and identified those captured using PROMs. For each PROM used in more than 1% of identified trials, two reviewers assessed whether it covered the six important clinical domains from the Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT): pain, emotional functioning, physical functioning, participant ratings of global improvement and satisfaction with treatment, symptoms and adverse events, and participant disposition (eg, adherence to medication). Second, reviewers searched PubMed for both the initial publication and latest review reporting the psychometric properties of each PROM and assessed their content validity, structural validity, internal consistency, reliability, measurement error, hypotheses testing, criterion validity, and responsiveness using published criteria from the literature. RESULTS: In total, 596 trials assessing 4843 outcomes were included in the study (median sample size 60, interquartile range 40-100). Trials evaluated behavioral (22%), device-based (21%), and drug-based (10%) interventions. Of 495 unique PROMs, 55 were used in more than 1% trials (16 were generic pain measures; 8 were pain measures for specific diseases; and 30 were measures of other symptoms or consequences of pain). About 50% PROMs had more than 50% of psychometric properties rated as sufficient. Scales often focused on a single clinical domain. Only 25% trials measured at least three clinical domains from IMMPACT. CONCLUSION: Half of PROMs used in trials assessing interventions for chronic pain had sufficient psychometric properties for more than 50% of criteria assessed. Few PROMs assessed more than one important clinical domain. Only 25% of trials measured more than 3/6 clinical domains considered important by IMMPACT.

Subject(s)

Chronic Pain , Patient Reported Outcome Measures , Psychometrics , Humans , Chronic Pain/therapy , Psychometrics/methods , Adult , Pain Measurement/methods , Male , Female , Middle Aged , Reproducibility of Results , Clinical Trials as Topic

Mutagenicity Assessment of Drinking Water in Combination with Flavored Black Tea Bags: a Cross Sectional Study in Tehran.

Alebouyeh, Farzaneh; Bidgoli, Sepideh Arbabi; Ziarati, Parisa; Heshmati, Masoomeh; Qomi, Mahnaz.

Asian Pac J Cancer Prev ; 16(17): 7479-84, 2015.

Article in English | MEDLINE | ID: mdl-26625748

ABSTRACT

Diseases related to water impurities may present as major public health burdens. The present study aimed to assess the mutagenicity of drinking water from different zones of Tehran, and evaluate possible health risks through making tea with tea bags, by Ames mutagenicity test using TA 100, TA 98 and YG1029 strains. For this purpose, 450 water samples were collected over the period of July to December 2014 from 5 different zones of Tehran. Except for one sample, no mutagenic potential was detected during these two seasons and the MI scores were almost normal (≤ 1-1.6) in TA 100, TA 98 and YG1029 strains. Although no mutagenic effects were considered in TA 98 and TA 100 in the test samples of our three evaluated tea bag brands, one sample from a local company showed mutagenic effects in the YG1029 strain (MI=1.7-1.9 and 2) after prolonged (10-15 min.) steeping. Despite the mild mutagenic effect discovered for one of the brand, this cross sectional study showed relative safety of water samples and black tea bags in Tehran. According to the sensitivity of YG1029 to the mutagenic potential of water and black tea, even without metabolic activation by s9 fraction, this metabolizer strain could be considered as sensitive and applicable to food samples for quantitative analysis of mutagens.

Subject(s)

Drinking Water/adverse effects , Drinking Water/analysis , Mutagenicity Tests/methods , Tea/adverse effects , Water Pollutants/analysis , Cross-Sectional Studies , Humans , Iran , Mutagens/analysis , Salmonella typhimurium/drug effects , Salmonella typhimurium/growth & development

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL