Pesquisa | Biblioteca Virtual em Saúde

Evaluating prompt engineering on GPT-3.5's performance in USMLE-style medical calculations and clinical scenarios generated by GPT-4.

Patel, Dhavalkumar; Raut, Ganesh; Zimlichman, Eyal; Cheetirala, Satya Narayan; Nadkarni, Girish N; Glicksberg, Benjamin S; Apakama, Donald U; Bell, Elijah J; Freeman, Robert; Timsina, Prem; Klang, Eyal.

Sci Rep ; 14(1): 17341, 2024 07 28.

Artigo em Inglês | MEDLINE | ID: mdl-39069520

RESUMO

This study was designed to assess how different prompt engineering techniques, specifically direct prompts, Chain of Thought (CoT), and a modified CoT approach, influence the ability of GPT-3.5 to answer clinical and calculation-based medical questions, particularly those styled like the USMLE Step 1 exams. To achieve this, we analyzed the responses of GPT-3.5 to two distinct sets of questions: a batch of 1000 questions generated by GPT-4, and another set comprising 95 real USMLE Step 1 questions. These questions spanned a range of medical calculations and clinical scenarios across various fields and difficulty levels. Our analysis revealed that there were no significant differences in the accuracy of GPT-3.5's responses when using direct prompts, CoT, or modified CoT methods. For instance, in the USMLE sample, the success rates were 61.7% for direct prompts, 62.8% for CoT, and 57.4% for modified CoT, with a p-value of 0.734. Similar trends were observed in the responses to GPT-4 generated questions, both clinical and calculation-based, with p-values above 0.05 indicating no significant difference between the prompt types. The conclusion drawn from this study is that the use of CoT prompt engineering does not significantly alter GPT-3.5's effectiveness in handling medical calculations or clinical scenario questions styled like those in USMLE exams. This finding is crucial as it suggests that performance of ChatGPT remains consistent regardless of whether a CoT technique is used instead of direct prompts. This consistency could be instrumental in simplifying the integration of AI tools like ChatGPT into medical education, enabling healthcare professionals to utilize these tools with ease, without the necessity for complex prompt engineering.

Assuntos

Avaliação Educacional , Humanos , Avaliação Educacional/métodos , Licenciamento em Medicina , Competência Clínica , Estados Unidos , Educação de Graduação em Medicina/métodos

Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room.

Glicksberg, Benjamin S; Timsina, Prem; Patel, Dhaval; Sawant, Ashwin; Vaid, Akhil; Raut, Ganesh; Charney, Alexander W; Apakama, Donald; Carr, Brendan G; Freeman, Robert; Nadkarni, Girish N; Klang, Eyal.

J Am Med Inform Assoc ; 31(9): 1921-1928, 2024 Sep 01.

Artigo em Inglês | MEDLINE | ID: mdl-38771093

RESUMO

BACKGROUND: Artificial intelligence (AI) and large language models (LLMs) can play a critical role in emergency room operations by augmenting decision-making about patient admission. However, there are no studies for LLMs using real-world data and scenarios, in comparison to and being informed by traditional supervised machine learning (ML) models. We evaluated the performance of GPT-4 for predicting patient admissions from emergency department (ED) visits. We compared performance to traditional ML models both naively and when informed by few-shot examples and/or numerical probabilities. METHODS: We conducted a retrospective study using electronic health records across 7 NYC hospitals. We trained Bio-Clinical-BERT and XGBoost (XGB) models on unstructured and structured data, respectively, and created an ensemble model reflecting ML performance. We then assessed GPT-4 capabilities in many scenarios: through Zero-shot, Few-shot with and without retrieval-augmented generation (RAG), and with and without ML numerical probabilities. RESULTS: The Ensemble ML model achieved an area under the receiver operating characteristic curve (AUC) of 0.88, an area under the precision-recall curve (AUPRC) of 0.72 and an accuracy of 82.9%. The naïve GPT-4's performance (0.79 AUC, 0.48 AUPRC, and 77.5% accuracy) showed substantial improvement when given limited, relevant data to learn from (ie, RAG) and underlying ML probabilities (0.87 AUC, 0.71 AUPRC, and 83.1% accuracy). Interestingly, RAG alone boosted performance to near peak levels (0.82 AUC, 0.56 AUPRC, and 81.3% accuracy). CONCLUSIONS: The naïve LLM had limited performance but showed significant improvement in predicting ED admissions when supplemented with real-world examples to learn from, particularly through RAG, and/or numerical probabilities from traditional ML models. Its peak performance, although slightly lower than the pure ML model, is noteworthy given its potential for providing reasoning behind predictions. Further refinement of LLMs with real-world data is necessary for successful integration as decision-support tools in care settings.

Assuntos

Registros Eletrônicos de Saúde , Serviço Hospitalar de Emergência , Admissão do Paciente , Humanos , Estudos Retrospectivos , Inteligência Artificial , Processamento de Linguagem Natural , Aprendizado de Máquina , Aprendizado de Máquina Supervisionado

Artificial intelligence-guided detection of under-recognized cardiomyopathies on point-of-care cardiac ultrasound: a multi-center study.

Oikonomou, Evangelos K; Vaid, Akhil; Holste, Gregory; Coppi, Andreas; McNamara, Robert L; Baloescu, Cristiana; Krumholz, Harlan M; Wang, Zhangyang; Apakama, Donald J; Nadkarni, Girish N; Khera, Rohan.

medRxiv ; 2024 Jun 29.

Artigo em Inglês | MEDLINE | ID: mdl-38559021

RESUMO

Background: Point-of-care ultrasonography (POCUS) enables cardiac imaging at the bedside and in communities but is limited by abbreviated protocols and variation in quality. We developed and tested artificial intelligence (AI) models to automate the detection of underdiagnosed cardiomyopathies from cardiac POCUS. Methods: In a development set of 290,245 transthoracic echocardiographic videos across the Yale-New Haven Health System (YNHHS), we used augmentation approaches and a customized loss function weighted for view quality to derive a POCUS-adapted, multi-label, video-based convolutional neural network (CNN) that discriminates HCM (hypertrophic cardiomyopathy) and ATTR-CM (transthyretin amyloid cardiomyopathy) from controls without known disease. We evaluated the final model across independent, internal and external, retrospective cohorts of individuals who underwent cardiac POCUS across YNHHS and Mount Sinai Health System (MSHS) emergency departments (EDs) (2011-2024) to prioritize key views and validate the diagnostic and prognostic performance of single-view screening protocols. Findings: We identified 33,127 patients (median age 61 [IQR: 45-75] years, n=17,276 [52·2%] female) at YNHHS and 5,624 (57 [IQR: 39-71] years, n=1,953 [34·7%] female) at MSHS with 78,054 and 13,796 eligible cardiac POCUS videos, respectively. An AI-enabled single-view screening approach successfully discriminated HCM (AUROC of 0·90 [YNHHS] & 0·89 [MSHS]) and ATTR-CM (YNHHS: AUROC of 0·92 [YNHHS] & 0·99 [MSHS]). In YNHHS, 40 (58·0%) HCM and 23 (47·9%) ATTR-CM cases had a positive screen at median of 2·1 [IQR: 0·9-4·5] and 1·9 [IQR: 1·0-3·4] years before clinical diagnosis. Moreover, among 24,448 participants without known cardiomyopathy followed over 2·2 [IQR: 1·1-5·8] years, AI-POCUS probabilities in the highest (vs lowest) quintile for HCM and ATTR-CM conferred a 15% (adj.HR 1·15 [95%CI: 1·02-1·29]) and 39% (adj.HR 1·39 [95%CI: 1·22-1·59]) higher age- and sex-adjusted mortality risk, respectively. Interpretation: We developed and validated an AI framework that enables scalable, opportunistic screening of treatable cardiomyopathies wherever POCUS is used. Funding: National Heart, Lung and Blood Institute, Doris Duke Charitable Foundation, BridgeBio.

Radial artery pseudoaneurysm diagnosed by point-of-care ultrasound five days after transradial catheterization: A case report.

Alerhand, Stephen; Apakama, Donald; Nevel, Adam; Nelson, Bret P.

World J Emerg Med ; 9(3): 223-226, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-29796148

Hip Hop HEALS: Pilot Study of a Culturally Targeted Calorie Label Intervention to Improve Food Purchases of Children.

Williams, Olajide; DeSorbo, Alexandra; Sawyer, Vanessa; Apakama, Donald; Shaffer, Michele; Gerin, William; Noble, James.

Health Educ Behav ; 43(1): 68-75, 2016 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-26272785

RESUMO

OBJECTIVES: We explored the effect of a culturally targeted calorie label intervention on food purchasing behavior of elementary school students. METHOD: We used a quasi-experimental design with two intervention schools and one control school to assess food purchases of third through fifth graders at standardized school food sales before and after the intervention (immediate and delayed) in schools. The intervention comprised three 1-hour assembly-style hip-hop-themed multimedia classes. RESULTS: A mean total of 225 children participated in two baseline preintervention sales with and without calorie labels; 149 children participated in immediate postintervention food sales, while 133 children participated in the delayed sales. No significant change in purchased calories was observed in response to labels alone before the intervention. However, a mean decline in purchased calories of 20% (p < .01) and unhealthy foods (p < .01) was seen in immediately following the intervention compared to baseline purchases, and this persisted without significant decay after 7 days and 12 days. CONCLUSION: A 3-hour culturally targeted calorie label intervention may improve food-purchasing behavior of children.

Assuntos

Ingestão de Energia , Rotulagem de Alimentos/métodos , Preferências Alimentares/psicologia , Promoção da Saúde , Serviços de Saúde Escolar , Criança , Cultura , Serviços de Alimentação , Humanos , Projetos Piloto

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA