Comparing the Quality of Domain-Specific Versus General Language Models for Artificial Intelligence-Generated Differential Diagnoses in PICU Patients.

Akhondi-Asl, Alireza; Yang, Youyang; Luchette, Matthew; Burns, Jeffrey P; Mehta, Nilesh M; Geva, Alon

Akhondi-Asl, Alireza; Yang, Youyang; Luchette, Matthew; Burns, Jeffrey P; Mehta, Nilesh M; Geva, Alon.

Afiliação

Akhondi-Asl A; Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA.
Yang Y; Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA.
Luchette M; Department of Anaesthesia, Harvard Medical School, Boston, MA.
Burns JP; Computational Health Informatics Program, Boston Children's Hospital, Boston, MA.
Mehta NM; Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA.
Geva A; Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA.

Pediatr Crit Care Med ; 25(6): e273-e282, 2024 Jun 01.

Article em En | MEDLINE | ID: mdl-38329382

ABSTRACT

ABSTRACT

OBJECTIVES:

Generative language models (LMs) are being evaluated in a variety of tasks in healthcare, but pediatric critical care studies are scant. Our objective was to evaluate the utility of generative LMs in the pediatric critical care setting and to determine whether domain-adapted LMs can outperform much larger general-domain LMs in generating a differential diagnosis from the admission notes of PICU patients.

DESIGN:

Single-center retrospective cohort study.

SETTING:

Quaternary 40-bed PICU. PATIENTS Notes from all patients admitted to the PICU between January 2012 and April 2023 were used for model development. One hundred thirty randomly selected admission notes were used for evaluation.

INTERVENTIONS:

None. MEASUREMENTS AND MAIN

RESULTS:

Five experts in critical care used a 5-point Likert scale to independently evaluate the overall quality of differential diagnoses 1) written by the clinician in the original notes, 2) generated by two general LMs (BioGPT-Large and LLaMa-65B), and 3) generated by two fine-tuned models (fine-tuned BioGPT-Large and fine-tuned LLaMa-7B). Differences among differential diagnoses were compared using mixed methods regression models. We used 1,916,538 notes from 32,454 unique patients for model development and validation. The mean quality scores of the differential diagnoses generated by the clinicians and fine-tuned LLaMa-7B, the best-performing LM, were 3.43 and 2.88, respectively (absolute difference 0.54 units [95% CI, 0.37-0.72], p < 0.001). Fine-tuned LLaMa-7B performed better than LLaMa-65B (absolute difference 0.23 unit [95% CI, 0.06-0.41], p = 0.009) and BioGPT-Large (absolute difference 0.86 unit [95% CI, 0.69-1.0], p < 0.001). The differential diagnosis generated by clinicians and fine-tuned LLaMa-7B were ranked as the highest quality in 144 (55%) and 74 cases (29%), respectively.

CONCLUSIONS:

A smaller LM fine-tuned using notes of PICU patients outperformed much larger models trained on general-domain data. Currently, LMs remain inferior but may serve as an adjunct to human clinicians in real-world tasks using real-world data.

Assuntos

Inteligência Artificial; Unidades de Terapia Intensiva Pediátrica; Humanos; Estudos Retrospectivos; Diagnóstico Diferencial; Criança; Masculino; Feminino; Pré-Escolar; Lactente; Cuidados Críticos/métodos; Adolescente

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Inteligência Artificial / Unidades de Terapia Intensiva Pediátrica Tipo de estudo: Diagnostic_studies / Observational_studies / Prognostic_studies Limite: Adolescent / Child / Child, preschool / Female / Humans / Infant / Male Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google