Large language models encode clinical knowledge.

Singhal, Karan; Azizi, Shekoofeh; Tu, Tao; Mahdavi, S Sara; Wei, Jason; Chung, Hyung Won; Scales, Nathan; Tanwani, Ajay; Cole-Lewis, Heather; Pfohl, Stephen; Payne, Perry; Seneviratne, Martin; Gamble, Paul; Kelly, Chris; Babiker, Abubakr; Schärli, Nathanael; Chowdhery, Aakanksha; Mansfield, Philip; Demner-Fushman, Dina; Agüera Y Arcas, Blaise; Webster, Dale; Corrado, Greg S; Matias, Yossi; Chou, Katherine; Gottweis, Juraj; Tomasev, Nenad; Liu, Yun; Rajkomar, Alvin; Barral, Joelle; Semturs, Christopher; Karthikesalingam, Alan; Natarajan, Vivek

Singhal, Karan; Azizi, Shekoofeh; Tu, Tao; Mahdavi, S Sara; Wei, Jason; Chung, Hyung Won; Scales, Nathan; Tanwani, Ajay; Cole-Lewis, Heather; Pfohl, Stephen; Payne, Perry; Seneviratne, Martin; Gamble, Paul; Kelly, Chris; Babiker, Abubakr; Schärli, Nathanael; Chowdhery, Aakanksha; Mansfield, Philip; Demner-Fushman, Dina; Agüera Y Arcas, Blaise; Webster, Dale; Corrado, Greg S; Matias, Yossi; Chou, Katherine; Gottweis, Juraj; Tomasev, Nenad; Liu, Yun; Rajkomar, Alvin; Barral, Joelle; Semturs, Christopher; Karthikesalingam, Alan; Natarajan, Vivek.

Afiliação

Singhal K; Google Research, Mountain View, CA, USA. karansinghal@google.com.
Azizi S; Google Research, Mountain View, CA, USA. shekazizi@google.com.
Tu T; Google Research, Mountain View, CA, USA.
Mahdavi SS; Google Research, Mountain View, CA, USA.
Wei J; Google Research, Mountain View, CA, USA.
Chung HW; Google Research, Mountain View, CA, USA.
Scales N; Google Research, Mountain View, CA, USA.
Tanwani A; Google Research, Mountain View, CA, USA.
Cole-Lewis H; Google Research, Mountain View, CA, USA.
Pfohl S; Google Research, Mountain View, CA, USA.
Payne P; Google Research, Mountain View, CA, USA.
Seneviratne M; Google Research, Mountain View, CA, USA.
Gamble P; Google Research, Mountain View, CA, USA.
Kelly C; Google Research, Mountain View, CA, USA.
Babiker A; Google Research, Mountain View, CA, USA.
Schärli N; Google Research, Mountain View, CA, USA.
Chowdhery A; Google Research, Mountain View, CA, USA.
Mansfield P; Google Research, Mountain View, CA, USA.
Demner-Fushman D; National Library of Medicine, Bethesda, MD, USA.
Agüera Y Arcas B; Google Research, Mountain View, CA, USA.
Webster D; Google Research, Mountain View, CA, USA.
Corrado GS; Google Research, Mountain View, CA, USA.
Matias Y; Google Research, Mountain View, CA, USA.
Chou K; Google Research, Mountain View, CA, USA.
Gottweis J; Google Research, Mountain View, CA, USA.
Tomasev N; DeepMind, London, UK.
Liu Y; Google Research, Mountain View, CA, USA.
Rajkomar A; Google Research, Mountain View, CA, USA.
Barral J; Google Research, Mountain View, CA, USA.
Semturs C; Google Research, Mountain View, CA, USA.
Karthikesalingam A; Google Research, Mountain View, CA, USA. alankarthi@google.com.
Natarajan V; Google Research, Mountain View, CA, USA. natviv@google.com.

Nature ; 620(7972): 172-180, 2023 Aug.

Article em En | MEDLINE | ID: mdl-37438534

RESUMO

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA3, MedMCQA4, PubMedQA5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics6), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

Assuntos

Benchmarking; Simulação por Computador; Conhecimento; Medicina; Processamento de Linguagem Natural; Viés; Competência Clínica; Compreensão; Conjuntos de Dados como Assunto; Licenciamento; Medicina/métodos; Medicina/normas; Segurança do Paciente; Médicos

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Simulação por Computador / Processamento de Linguagem Natural / Conhecimento / Benchmarking / Medicina Tipo de estudo: Prognostic_studies Idioma: En Revista: Nature Ano de publicação: 2023 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google