Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 3 de 3
Filtrar
Más filtros

Banco de datos
Tipo de estudio
Tipo del documento
Asunto de la revista
País de afiliación
Intervalo de año de publicación
1.
Nature ; 620(7972): 172-180, 2023 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-37438534

RESUMEN

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA3, MedMCQA4, PubMedQA5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics6), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.


Asunto(s)
Benchmarking , Simulación por Computador , Conocimiento , Medicina , Procesamiento de Lenguaje Natural , Sesgo , Competencia Clínica , Comprensión , Conjuntos de Datos como Asunto , Concesión de Licencias , Medicina/métodos , Medicina/normas , Seguridad del Paciente , Médicos
3.
Nat Commun ; 13(1): 7456, 2022 12 02.
Artículo en Inglés | MEDLINE | ID: mdl-36460656

RESUMEN

Physicians write clinical notes with abbreviations and shorthand that are difficult to decipher. Abbreviations can be clinical jargon (writing "HIT" for "heparin induced thrombocytopenia"), ambiguous terms that require expertise to disambiguate (using "MS" for "multiple sclerosis" or "mental status"), or domain-specific vernacular ("cb" for "complicated by"). Here we train machine learning models on public web data to decode such text by replacing abbreviations with their meanings. We report a single translation model that simultaneously detects and expands thousands of abbreviations in real clinical notes with accuracies ranging from 92.1%-97.1% on multiple external test datasets. The model equals or exceeds the performance of board-certified physicians (97.6% vs 88.7% total accuracy). Our results demonstrate a general method to contextually decipher abbreviations and shorthand that is built without any privacy-compromising data.


Asunto(s)
Esclerosis Múltiple , Médicos , Trombocitopenia , Humanos , Privacidad , Aprendizaje Automático , Escritura
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA