Comparing the Efficacy and Efficiency of Human and Generative AI: Qualitative Thematic Analyses.

Prescott, Maximo R; Yeager, Samantha; Ham, Lillian; Rivera Saldana, Carlos D; Serrano, Vanessa; Narez, Joey; Paltin, Dafna; Delgado, Jorge; Moore, David J; Montoya, Jessica

Prescott, Maximo R; Yeager, Samantha; Ham, Lillian; Rivera Saldana, Carlos D; Serrano, Vanessa; Narez, Joey; Paltin, Dafna; Delgado, Jorge; Moore, David J; Montoya, Jessica.

Afiliação

Prescott MR; HIV Neurobehavioral Research Program, University of California, San Diego, San Diego, CA, United States.
Yeager S; San Diego State University/University of California San Diego Joint Doctoral Program in Clinical Psychology, San Diego, CA, United States.
Ham L; HIV Neurobehavioral Research Program, University of California, San Diego, San Diego, CA, United States.
Rivera Saldana CD; HIV Neurobehavioral Research Program, University of California, San Diego, San Diego, CA, United States.
Serrano V; San Diego State University/University of California San Diego Joint Doctoral Program in Clinical Psychology, San Diego, CA, United States.
Narez J; HIV Neurobehavioral Research Program, University of California, San Diego, San Diego, CA, United States.
Paltin D; Department of Medicine, University of California, San Diego, San Diego, CA, United States.
Delgado J; HIV Neurobehavioral Research Program, University of California, San Diego, San Diego, CA, United States.
Moore DJ; San Diego State University/University of California San Diego Joint Doctoral Program in Clinical Psychology, San Diego, CA, United States.
Montoya J; HIV Neurobehavioral Research Program, University of California, San Diego, San Diego, CA, United States.

JMIR AI ; 3: e54482, 2024 Aug 02.

Article em En | MEDLINE | ID: mdl-39094113

ABSTRACT

ABSTRACT

BACKGROUND:

Qualitative methods are incredibly beneficial to the dissemination and implementation of new digital health interventions; however, these methods can be time intensive and slow down dissemination when timely knowledge from the data sources is needed in ever-changing health systems. Recent advancements in generative artificial intelligence (GenAI) and their underlying large language models (LLMs) may provide a promising opportunity to expedite the qualitative analysis of textual data, but their efficacy and reliability remain unknown.

OBJECTIVE:

The primary objectives of our study were to evaluate the consistency in themes, reliability of coding, and time needed for inductive and deductive thematic analyses between GenAI (ie, ChatGPT and Bard) and human coders.

METHODS:

The qualitative data for this study consisted of 40 brief SMS text message reminder prompts used in a digital health intervention for promoting antiretroviral medication adherence among people with HIV who use methamphetamine. Inductive and deductive thematic analyses of these SMS text messages were conducted by 2 independent teams of human coders. An independent human analyst conducted analyses following both approaches using ChatGPT and Bard. The consistency in themes (or the extent to which the themes were the same) and reliability (or agreement in coding of themes) between methods were compared.

RESULTS:

The themes generated by GenAI (both ChatGPT and Bard) were consistent with 71% (5/7) of the themes identified by human analysts following inductive thematic analysis. The consistency in themes was lower between humans and GenAI following a deductive thematic analysis procedure (ChatGPT 6/12, 50%; Bard 7/12, 58%). The percentage agreement (or intercoder reliability) for these congruent themes between human coders and GenAI ranged from fair to moderate (ChatGPT, inductive 31/66, 47%; ChatGPT, deductive 22/59, 37%; Bard, inductive 20/54, 37%; Bard, deductive 21/58, 36%). In general, ChatGPT and Bard performed similarly to each other across both types of qualitative analyses in terms of consistency of themes (inductive 6/6, 100%; deductive 5/6, 83%) and reliability of coding (inductive 23/62, 37%; deductive 22/47, 47%). On average, GenAI required significantly less overall time than human coders when conducting qualitative analysis (20, SD 3.5 min vs 567, SD 106.5 min).

CONCLUSIONS:

The promising consistency in the themes generated by human coders and GenAI suggests that these technologies hold promise in reducing the resource intensiveness of qualitative thematic analysis; however, the relatively lower reliability in coding between them suggests that hybrid approaches are necessary. Human coders appeared to be better than GenAI at identifying nuanced and interpretative themes. Future studies should consider how these powerful technologies can be best used in collaboration with human coders to improve the efficiency of qualitative research in hybrid approaches while also mitigating potential ethical risks that they may pose.

Palavras-chave

Bard; ChatGPT; GenAI; digital health; generative artificial intelligence; qualitative research; thematic analysis

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links