Search | VHL Regional Portal

Retrieval-Based Diagnostic Decision Support: Mixed Methods Study.

Abdullahi, Tassallah; Mercurio, Laura; Singh, Ritambhara; Eickhoff, Carsten.

JMIR Med Inform ; 12: e50209, 2024 Jun 19.

Article in English | MEDLINE | ID: mdl-38896468

ABSTRACT

BACKGROUND: Diagnostic errors pose significant health risks and contribute to patient mortality. With the growing accessibility of electronic health records, machine learning models offer a promising avenue for enhancing diagnosis quality. Current research has primarily focused on a limited set of diseases with ample training data, neglecting diagnostic scenarios with limited data availability. OBJECTIVE: This study aims to develop an information retrieval (IR)-based framework that accommodates data sparsity to facilitate broader diagnostic decision support. METHODS: We introduced an IR-based diagnostic decision support framework called CliniqIR. It uses clinical text records, the Unified Medical Language System Metathesaurus, and 33 million PubMed abstracts to classify a broad spectrum of diagnoses independent of training data availability. CliniqIR is designed to be compatible with any IR framework. Therefore, we implemented it using both dense and sparse retrieval approaches. We compared CliniqIR's performance to that of pretrained clinical transformer models such as Clinical Bidirectional Encoder Representations from Transformers (ClinicalBERT) in supervised and zero-shot settings. Subsequently, we combined the strength of supervised fine-tuned ClinicalBERT and CliniqIR to build an ensemble framework that delivers state-of-the-art diagnostic predictions. RESULTS: On a complex diagnosis data set (DC3) without any training data, CliniqIR models returned the correct diagnosis within their top 3 predictions. On the Medical Information Mart for Intensive Care III data set, CliniqIR models surpassed ClinicalBERT in predicting diagnoses with <5 training samples by an average difference in mean reciprocal rank of 0.10. In a zero-shot setting where models received no disease-specific training, CliniqIR still outperformed the pretrained transformer models with a greater mean reciprocal rank of at least 0.10. Furthermore, in most conditions, our ensemble framework surpassed the performance of its individual components, demonstrating its enhanced ability to make precise diagnostic predictions. CONCLUSIONS: Our experiments highlight the importance of IR in leveraging unstructured knowledge resources to identify infrequently encountered diagnoses. In addition, our ensemble framework benefits from combining the complementary strengths of the supervised and retrieval-based models to diagnose a broad spectrum of diseases.

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.

Abdullahi, Tassallah; Singh, Ritambhara; Eickhoff, Carsten.

JMIR Med Educ ; 10: e51391, 2024 Feb 13.

Article in English | MEDLINE | ID: mdl-38349725

ABSTRACT

BACKGROUND: Patients with rare and complex diseases often experience delayed diagnoses and misdiagnoses because comprehensive knowledge about these diseases is limited to only a few medical experts. In this context, large language models (LLMs) have emerged as powerful knowledge aggregation tools with applications in clinical decision support and education domains. OBJECTIVE: This study aims to explore the potential of 3 popular LLMs, namely Bard (Google LLC), ChatGPT-3.5 (OpenAI), and GPT-4 (OpenAI), in medical education to enhance the diagnosis of rare and complex diseases while investigating the impact of prompt engineering on their performance. METHODS: We conducted experiments on publicly available complex and rare cases to achieve these objectives. We implemented various prompt strategies to evaluate the performance of these models using both open-ended and multiple-choice prompts. In addition, we used a majority voting strategy to leverage diverse reasoning paths within language models, aiming to enhance their reliability. Furthermore, we compared their performance with the performance of human respondents and MedAlpaca, a generative LLM specifically designed for medical tasks. RESULTS: Notably, all LLMs outperformed the average human consensus and MedAlpaca, with a minimum margin of 5% and 13%, respectively, across all 30 cases from the diagnostic case challenge collection. On the frequently misdiagnosed cases category, Bard tied with MedAlpaca but surpassed the human average consensus by 14%, whereas GPT-4 and ChatGPT-3.5 outperformed MedAlpaca and the human respondents on the moderately often misdiagnosed cases category with minimum accuracy scores of 28% and 11%, respectively. The majority voting strategy, particularly with GPT-4, demonstrated the highest overall score across all cases from the diagnostic complex case collection, surpassing that of other LLMs. On the Medical Information Mart for Intensive Care-III data sets, Bard and GPT-4 achieved the highest diagnostic accuracy scores, with multiple-choice prompts scoring 93%, whereas ChatGPT-3.5 and MedAlpaca scored 73% and 47%, respectively. Furthermore, our results demonstrate that there is no one-size-fits-all prompting approach for improving the performance of LLMs and that a single strategy does not universally apply to all LLMs. CONCLUSIONS: Our findings shed light on the diagnostic capabilities of LLMs and the challenges associated with identifying an optimal prompting strategy that aligns with each language model's characteristics and specific task requirements. The significance of prompt engineering is highlighted, providing valuable insights for researchers and practitioners who use these language models for medical training. Furthermore, this study represents a crucial step toward understanding how LLMs can enhance diagnostic reasoning in rare and complex medical cases, paving the way for developing effective educational tools and accurate diagnostic aids to improve patient care and outcomes.

Subject(s)

Learning , Problem Solving , Humans , Reproducibility of Results , Educational Status , Language

Predicting diarrhoea outbreaks with climate change.

Abdullahi, Tassallah; Nitschke, Geoff; Sweijd, Neville.

PLoS One ; 17(4): e0262008, 2022.

Article in English | MEDLINE | ID: mdl-35439258

ABSTRACT

BACKGROUND: Climate change is expected to exacerbate diarrhoea outbreaks across the developing world, most notably in Sub-Saharan countries such as South Africa. In South Africa, diseases related to diarrhoea outbreak is a leading cause of morbidity and mortality. In this study, we modelled the impacts of climate change on diarrhoea with various machine learning (ML) methods to predict daily outbreak of diarrhoea cases in nine South African provinces. METHODS: We applied two deep Learning DL techniques, Convolutional Neural Networks (CNNs) and Long-Short term Memory Networks (LSTMs); and a Support Vector Machine (SVM) to predict daily diarrhoea cases over the different South African provinces by incorporating climate information. Generative Adversarial Networks (GANs) was used to generate synthetic data which was used to augment the available data-set. Furthermore, Relevance Estimation and Value Calibration (REVAC) was used to tune the parameters of the ML methods to optimize the accuracy of their predictions. Sensitivity analysis was also performed to investigate the contribution of the different climate factors to the diarrhoea prediction method. RESULTS: Our results showed that all three ML methods were appropriate for predicting daily diarrhoea cases with respect to the selected climate variables in each South African province. However, the level of accuracy for each method varied across different experiments, with the deep learning methods outperforming the SVM method. Among the deep learning techniques, the CNN method performed best when only real-world data-set was used, while the LSTM method outperformed the other methods when the real-world data-set was augmented with synthetic data. Across the provinces, the accuracy of all three ML methods improved by at least 30 percent when data augmentation was implemented. In addition, REVAC improved the accuracy of the CNN method by about 2.5% in each province. Our parameter sensitivity analysis revealed that the most influential climate variables to be considered when predicting outbreak of diarrhoea in South Africa were precipitation, humidity, evaporation and temperature conditions. CONCLUSIONS: Overall, experiments indicated that the prediction capacity of our DL methods (Convolutional Neural Networks) was found to be superior (with statistical significance) in terms of prediction accuracy across most provinces. This study's results have important implications for the development of automated early warning systems for diarrhoea (and related disease) outbreaks across the globe.

Subject(s)

Climate Change , Neural Networks, Computer , Diarrhea/epidemiology , Disease Outbreaks , Humans , Machine Learning

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL