Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 76
Filter
1.
Int Dent J ; 2024 Aug 05.
Article in English | MEDLINE | ID: mdl-39107150

ABSTRACT

PURPOSE: Symptom checkers (SCs) are virtual health aids to assist laypersons in self-assessing dental complaints. This study aimed to investigate the triage performance, clinical efficacy, and user-perceived utility of a prototype dental SC, Toothbuddy, in assessing unscheduled dental complaints in Singapore. METHODS: A pilot trial was conducted amongst all unscheduled dental attendees to military dental facilities in Singapore from January to May 2023. The accuracy of Toothbuddy to tele-triage dental conditions into 3 categories-routine, urgent, and emergency-was determined. Based on the patient-reported symptoms input, clinical recommendations were provided to users for each category. Thereafter, all dental attendees were clinically assessed to determine the definitive category. Finally, a user questionnaire assessed the application's functionality and utility and the user's satisfaction. Sensitivity and specificity analyses were performed. RESULTS: During the study, 588 patients with unscheduled dental visits presented. Of these cases, 275 (46.8%) were evaluated to be routine dental conditions for which treatment could be delayed or self-managed, 243 (41.3%) required urgent dental care, and 60 (10.2%) required emergency dental intervention. The accuracy of Toothbuddy in identifying the correct category was 79.6% (468/588). Sensitivity and specificity in categorising routine vs non-routine conditions were 94.5% (95% confidence interval, 92.0%-97.1%) and 74.0% (95% confidence interval, 68.8%-79.2%), respectively. The app was generally well received and rated highly. CONCLUSIONS: Preliminary data suggest that Toothbuddy can perform accurate dental self-assessment for a suitable range of common dental concerns and this is a promising platform for virtual advice on spontaneous dental issues. Furthermore, dental facilities are typically not sized to handle the large volumes of unplanned dental visits that may occur in the military population. SC apps to self-manage or delay treatment without adversely affecting disease prognosis may preserve the limited bandwidth of dental facilities in providing acute care and managing true dental emergencies expediently.

2.
JMIR Med Inform ; 12: e57162, 2024 Aug 14.
Article in English | MEDLINE | ID: mdl-39149851

ABSTRACT

Background: In recent years, the implementation of artificial intelligence (AI) in health care is progressively transforming medical fields, with the use of clinical decision support systems (CDSSs) as a notable application. Laboratory tests are vital for accurate diagnoses, but their increasing reliance presents challenges. The need for effective strategies for managing laboratory test interpretation is evident from the millions of monthly searches on test results' significance. As the potential role of CDSSs in laboratory diagnostics gains significance, however, more research is needed to explore this area. Objective: The primary objective of our study was to assess the accuracy and safety of LabTest Checker (LTC), a CDSS designed to support medical diagnoses by analyzing both laboratory test results and patients' medical histories. Methods: This cohort study embraced a prospective data collection approach. A total of 101 patients aged ≥18 years, in stable condition, and requiring comprehensive diagnosis were enrolled. A panel of blood laboratory tests was conducted for each participant. Participants used LTC for test result interpretation. The accuracy and safety of the tool were assessed by comparing AI-generated suggestions to experienced doctor (consultant) recommendations, which are considered the gold standard. Results: The system achieved a 74.3% accuracy and 100% sensitivity for emergency safety and 92.3% sensitivity for urgent cases. It potentially reduced unnecessary medical visits by 41.6% (42/101) and achieved an 82.9% accuracy in identifying underlying pathologies. Conclusions: This study underscores the transformative potential of AI-based CDSSs in laboratory diagnostics, contributing to enhanced patient care, efficient health care systems, and improved medical outcomes. LTC's performance evaluation highlights the advancements in AI's role in laboratory medicine.

3.
J Med Internet Res ; 26: e56514, 2024 Aug 20.
Article in English | MEDLINE | ID: mdl-39163594

ABSTRACT

BACKGROUND: Emergency departments (EDs) are frequently overcrowded and increasingly used by nonurgent patients. Symptom checkers (SCs) offer on-demand access to disease suggestions and recommended actions, potentially improving overall patient flow. Contrary to the increasing use of SCs, there is a lack of supporting evidence based on direct patient use. OBJECTIVE: This study aimed to compare the diagnostic accuracy, safety, usability, and acceptance of 2 SCs, Ada and Symptoma. METHODS: A randomized, crossover, head-to-head, double-blinded study including consecutive adult patients presenting to the ED at University Hospital Erlangen. Patients completed both SCs, Ada and Symptoma. The primary outcome was the diagnostic accuracy of SCs. In total, 6 blinded independent expert raters classified diagnostic concordance of SC suggestions with the final discharge diagnosis as (1) identical, (2) plausible, or (3) diagnostically different. SC suggestions per patient were additionally classified as safe or potentially life-threatening, and the concordance of Ada's and physician-based triage category was assessed. Secondary outcomes were SC usability (5-point Likert-scale: 1=very easy to use to 5=very difficult to use) and SC acceptance net promoter score (NPS). RESULTS: A total of 450 patients completed the study between April and November 2021. The most common chief complaint was chest pain (160/437, 37%). The identical diagnosis was ranked first (or within the top 5 diagnoses) by Ada and Symptoma in 14% (59/437; 27%, 117/437) and 4% (16/437; 13%, 55/437) of patients, respectively. An identical or plausible diagnosis was ranked first (or within the top 5 diagnoses) by Ada and Symptoma in 58% (253/437; 75%, 329/437) and 38% (164/437; 64%, 281/437) of patients, respectively. Ada and Symptoma did not suggest potentially life-threatening diagnoses in 13% (56/437) and 14% (61/437) of patients, respectively. Ada correctly triaged, undertriaged, and overtriaged 34% (149/437), 13% (58/437), and 53% (230/437) of patients, respectively. A total of 88% (385/437) and 78% (342/437) of participants rated Ada and Symptoma as very easy or easy to use, respectively. Ada's NPS was -34 (55% [239/437] detractors; 21% [93/437] promoters) and Symptoma's NPS was -47 (63% [275/437] detractors and 16% [70/437]) promoters. CONCLUSIONS: Ada demonstrated a higher diagnostic accuracy than Symptoma, and substantially more patients would recommend Ada and assessed Ada as easy to use. The high number of unrecognized potentially life-threatening diagnoses by both SCs and inappropriate triage advice by Ada was alarming. Overall, the trustworthiness of SC recommendations appears questionable. SC authorization should necessitate rigorous clinical evaluation studies to prevent misdiagnoses, fatal triage advice, and misuse of scarce medical resources. TRIAL REGISTRATION: German Register of Clinical Trials DRKS00024830; https://drks.de/search/en/trial/DRKS00024830.


Subject(s)
Cross-Over Studies , Emergency Service, Hospital , Humans , Emergency Service, Hospital/statistics & numerical data , Double-Blind Method , Male , Female , Middle Aged , Adult , Aged , Triage/methods
4.
J Med Internet Res ; 26: e55542, 2024 Jul 23.
Article in English | MEDLINE | ID: mdl-39042425

ABSTRACT

BACKGROUND: The diagnosis of inflammatory rheumatic diseases (IRDs) is often delayed due to unspecific symptoms and a shortage of rheumatologists. Digital diagnostic decision support systems (DDSSs) have the potential to expedite diagnosis and help patients navigate the health care system more efficiently. OBJECTIVE: The aim of this study was to assess the diagnostic accuracy of a mobile artificial intelligence (AI)-based symptom checker (Ada) and a web-based self-referral tool (Rheport) regarding IRDs. METHODS: A prospective, multicenter, open-label, crossover randomized controlled trial was conducted with patients newly presenting to 3 rheumatology centers. Participants were randomly assigned to complete a symptom assessment using either Ada or Rheport. The primary outcome was the correct identification of IRDs by the DDSSs, defined as the presence of any IRD in the list of suggested diagnoses by Ada or achieving a prespecified threshold score with Rheport. The gold standard was the diagnosis made by rheumatologists. RESULTS: A total of 600 patients were included, among whom 214 (35.7%) were diagnosed with an IRD. Most frequent IRD was rheumatoid arthritis with 69 (11.5%) patients. Rheport's disease suggestion and Ada's top 1 (D1) and top 5 (D5) disease suggestions demonstrated overall diagnostic accuracies of 52%, 63%, and 58%, respectively, for IRDs. Rheport showed a sensitivity of 62% and a specificity of 47% for IRDs. Ada's D1 and D5 disease suggestions showed a sensitivity of 52% and 66%, respectively, and a specificity of 68% and 54%, respectively, concerning IRDs. Ada's diagnostic accuracy regarding individual diagnoses was heterogenous, and Ada performed considerably better in identifying rheumatoid arthritis in comparison to other diagnoses (D1: 42%; D5: 64%). The Cohen κ statistic of Rheport for agreement on any rheumatic disease diagnosis with Ada D1 was 0.15 (95% CI 0.08-0.18) and with Ada D5 was 0.08 (95% CI 0.00-0.16), indicating poor agreement for the presence of any rheumatic disease between the 2 DDSSs. CONCLUSIONS: To our knowledge, this is the largest comparative DDSS trial with actual use of DDSSs by patients. The diagnostic accuracies of both DDSSs for IRDs were not promising in this high-prevalence patient population. DDSSs may lead to a misuse of scarce health care resources. Our results underscore the need for stringent regulation and drastic improvements to ensure the safety and efficacy of DDSSs. TRIAL REGISTRATION: German Register of Clinical Trials DRKS00017642; https://drks.de/search/en/trial/DRKS00017642.


Subject(s)
Artificial Intelligence , Rheumatology , Humans , Female , Male , Middle Aged , Prospective Studies , Rheumatology/methods , Adult , Cross-Over Studies , Rheumatic Diseases/diagnosis , Internet , Aged , Referral and Consultation/statistics & numerical data
5.
JMIR AI ; 3: e46875, 2024 Apr 29.
Article in English | MEDLINE | ID: mdl-38875676

ABSTRACT

BACKGROUND: Medical self-diagnostic tools (or symptom checkers) are becoming an integral part of digital health and our daily lives, whereby patients are increasingly using them to identify the underlying causes of their symptoms. As such, it is essential to rigorously investigate and comprehensively report the diagnostic performance of symptom checkers using standard clinical and scientific approaches. OBJECTIVE: This study aims to evaluate and report the accuracies of a few known and new symptom checkers using a standard and transparent methodology, which allows the scientific community to cross-validate and reproduce the reported results, a step much needed in health informatics. METHODS: We propose a 4-stage experimentation methodology that capitalizes on the standard clinical vignette approach to evaluate 6 symptom checkers. To this end, we developed and peer-reviewed 400 vignettes, each approved by at least 5 out of 7 independent and experienced primary care physicians. To establish a frame of reference and interpret the results of symptom checkers accordingly, we further compared the best-performing symptom checker against 3 primary care physicians with an average experience of 16.6 (SD 9.42) years. To measure accuracy, we used 7 standard metrics, including M1 as a measure of a symptom checker's or a physician's ability to return a vignette's main diagnosis at the top of their differential list, F1-score as a trade-off measure between recall and precision, and Normalized Discounted Cumulative Gain (NDCG) as a measure of a differential list's ranking quality, among others. RESULTS: The diagnostic accuracies of the 6 tested symptom checkers vary significantly. For instance, the differences in the M1, F1-score, and NDCG results between the best-performing and worst-performing symptom checkers or ranges were 65.3%, 39.2%, and 74.2%, respectively. The same was observed among the participating human physicians, whereby the M1, F1-score, and NDCG ranges were 22.8%, 15.3%, and 21.3%, respectively. When compared against each other, physicians outperformed the best-performing symptom checker by an average of 1.2% using F1-score, whereas the best-performing symptom checker outperformed physicians by averages of 10.2% and 25.1% using M1 and NDCG, respectively. CONCLUSIONS: The performance variation between symptom checkers is substantial, suggesting that symptom checkers cannot be treated as a single entity. On a different note, the best-performing symptom checker was an artificial intelligence (AI)-based one, shedding light on the promise of AI in improving the diagnostic capabilities of symptom checkers, especially as AI keeps advancing exponentially.

6.
J Med Internet Res ; 26: e50344, 2024 Jun 05.
Article in English | MEDLINE | ID: mdl-38838309

ABSTRACT

The growing prominence of artificial intelligence (AI) in mobile health (mHealth) has given rise to a distinct subset of apps that provide users with diagnostic information using their inputted health status and symptom information-AI-powered symptom checker apps (AISympCheck). While these apps may potentially increase access to health care, they raise consequential ethical and legal questions. This paper will highlight notable concerns with AI usage in the health care system, further entrenchment of preexisting biases in the health care system and issues with professional accountability. To provide an in-depth analysis of the issues of bias and complications of professional obligations and liability, we focus on 2 mHealth apps as examples-Babylon and Ada. We selected these 2 apps as they were both widely distributed during the COVID-19 pandemic and make prominent claims about their use of AI for the purpose of assessing user symptoms. First, bias entrenchment often originates from the data used to train AI systems, causing the AI to replicate these inequalities through a "garbage in, garbage out" phenomenon. Users of these apps are also unlikely to be demographically representative of the larger population, leading to distorted results. Second, professional accountability poses a substantial challenge given the vast diversity and lack of regulation surrounding the reliability of AISympCheck apps. It is unclear whether these apps should be subject to safety reviews, who is responsible for app-mediated misdiagnosis, and whether these apps ought to be recommended by physicians. With the rapidly increasing number of apps, there remains little guidance available for health professionals. Professional bodies and advocacy organizations have a particularly important role to play in addressing these ethical and legal gaps. Implementing technical safeguards within these apps could mitigate bias, AIs could be trained with primarily neutral data, and apps could be subject to a system of regulation to allow users to make informed decisions. In our view, it is critical that these legal concerns are considered throughout the design and implementation of these potentially disruptive technologies. Entrenched bias and professional responsibility, while operating in different ways, are ultimately exacerbated by the unregulated nature of mHealth.


Subject(s)
Artificial Intelligence , COVID-19 , Mobile Applications , Telemedicine , Humans , Artificial Intelligence/ethics , Bias , SARS-CoV-2 , Pandemics , Social Responsibility
7.
J Med Internet Res ; 26: e58157, 2024 Jun 27.
Article in English | MEDLINE | ID: mdl-38809606

ABSTRACT

BACKGROUND: Symptom-checkers have become important tools for self-triage, assisting patients to determine the urgency of medical care. To be safe and effective, these tools must be validated, particularly to avoid potentially hazardous undertriage without leading to inefficient overtriage. Only limited safety data from studies including small sample sizes have been available so far. OBJECTIVE: The objective of our study was to prospectively investigate the safety of patients' self-triage in a large patient sample. We used SMASS (Swiss Medical Assessment System; in4medicine, Inc) pathfinder, a symptom-checker based on a computerized transparent neural network. METHODS: We recruited 2543 patients into this single-center, prospective clinical trial conducted at the cantonal hospital of Baden, Switzerland. Patients with an Emergency Severity Index of 1-2 were treated by the team of the emergency department, while those with an index of 3-5 were seen at the walk-in clinic by general physicians. We compared the triage recommendation obtained by the patients' self-triage with the assessment of clinical urgency made by 3 successive interdisciplinary panels of physicians (panels A, B, and C). Using the Clopper-Pearson CI, we assumed that to confirm the symptom-checkers' safety, the upper confidence bound for the probability of a potentially hazardous undertriage should lie below 1%. A potentially hazardous undertriage was defined as a triage in which either all (consensus criterion) or the majority (majority criterion) of the experts of the last panel (panel C) rated the triage of the symptom-checker to be "rather likely" or "likely" life-threatening or harmful. RESULTS: Of the 2543 patients, 1227 (48.25%) were female and 1316 (51.75%) male. None of the patients reached the prespecified consensus criterion for a potentially hazardous undertriage. This resulted in an upper 95% confidence bound of 0.1184%. Further, 4 cases met the majority criterion. This resulted in an upper 95% confidence bound for the probability of a potentially hazardous undertriage of 0.3616%. The 2-sided 95% Clopper-Pearson CI for the probability of overtriage (n=450 cases,17.69%) was 16.23% to 19.24%, which is considerably lower than the figures reported in the literature. CONCLUSIONS: The symptom-checker proved to be a safe triage tool, avoiding potentially hazardous undertriage in a real-life clinical setting of emergency consultations at a walk-in clinic or emergency department without causing undesirable overtriage. Our data suggest the symptom-checker may be safely used in clinical routine. TRIAL REGISTRATION: ClinicalTrials.gov NCT04055298; https://clinicaltrials.gov/study/NCT04055298.


Subject(s)
Emergency Service, Hospital , Triage , Adult , Aged , Female , Humans , Male , Middle Aged , Emergency Service, Hospital/statistics & numerical data , Patient Safety/statistics & numerical data , Prospective Studies , Switzerland , Triage/methods
8.
JMIR Form Res ; 8: e49907, 2024 May 31.
Article in English | MEDLINE | ID: mdl-38820578

ABSTRACT

BACKGROUND: The rapid growth of web-based symptom checkers (SCs) is not matched by advances in quality assurance. Currently, there are no widely accepted criteria assessing SCs' performance. Vignette studies are widely used to evaluate SCs, measuring the accuracy of outcome. Accuracy behaves as a composite metric as it is affected by a number of individual SC- and tester-dependent factors. In contrast to clinical studies, vignette studies have a small number of testers. Hence, measuring accuracy alone in vignette studies may not provide a reliable assessment of performance due to tester variability. OBJECTIVE: This study aims to investigate the impact of tester variability on the accuracy of outcome of SCs, using clinical vignettes. It further aims to investigate the feasibility of measuring isolated aspects of performance. METHODS: Healthily's SC was assessed using 114 vignettes by 3 groups of 3 testers who processed vignettes with different instructions: free interpretation of vignettes (free testers), specified chief complaints (partially free testers), and specified chief complaints with strict instruction for answering additional symptoms (restricted testers). κ statistics were calculated to assess agreement of top outcome condition and recommended triage. Crude and adjusted accuracy was measured against a gold standard. Adjusted accuracy was calculated using only results of consultations identical to the vignette, following a review and selection process. A feasibility study for assessing symptom comprehension of SCs was performed using different variations of 51 chief complaints across 3 SCs. RESULTS: Intertester agreement of most likely condition and triage was, respectively, 0.49 and 0.51 for the free tester group, 0.66 and 0.66 for the partially free group, and 0.72 and 0.71 for the restricted group. For the restricted group, accuracy ranged from 43.9% to 57% for individual testers, averaging 50.6% (SD 5.35%). Adjusted accuracy was 56.1%. Assessing symptom comprehension was feasible for all 3 SCs. Comprehension scores ranged from 52.9% and 68%. CONCLUSIONS: We demonstrated that by improving standardization of the vignette testing process, there is a significant improvement in the agreement of outcome between testers. However, significant variability remained due to uncontrollable tester-dependent factors, reflected by varying outcome accuracy. Tester-dependent factors, combined with a small number of testers, limit the reliability and generalizability of outcome accuracy when used as a composite measure in vignette studies. Measuring and reporting different aspects of SC performance in isolation provides a more reliable assessment of SC performance. We developed an adjusted accuracy measure using a review and selection process to assess data algorithm quality. In addition, we demonstrated that symptom comprehension with different input methods can be feasibly compared. Future studies reporting accuracy need to apply vignette testing standardization and isolated metrics.

9.
JMIR Form Res ; 8: e53985, 2024 May 17.
Article in English | MEDLINE | ID: mdl-38758588

ABSTRACT

BACKGROUND: Artificial intelligence (AI) symptom checker models should be trained using real-world patient data to improve their diagnostic accuracy. Given that AI-based symptom checkers are currently used in clinical practice, their performance should improve over time. However, longitudinal evaluations of the diagnostic accuracy of these symptom checkers are limited. OBJECTIVE: This study aimed to assess the longitudinal changes in the accuracy of differential diagnosis lists created by an AI-based symptom checker used in the real world. METHODS: This was a single-center, retrospective, observational study. Patients who visited an outpatient clinic without an appointment between May 1, 2019, and April 30, 2022, and who were admitted to a community hospital in Japan within 30 days of their index visit were considered eligible. We only included patients who underwent an AI-based symptom checkup at the index visit, and the diagnosis was finally confirmed during follow-up. Final diagnoses were categorized as common or uncommon, and all cases were categorized as typical or atypical. The primary outcome measure was the accuracy of the differential diagnosis list created by the AI-based symptom checker, defined as the final diagnosis in a list of 10 differential diagnoses created by the symptom checker. To assess the change in the symptom checker's diagnostic accuracy over 3 years, we used a chi-square test to compare the primary outcome over 3 periods: from May 1, 2019, to April 30, 2020 (first year); from May 1, 2020, to April 30, 2021 (second year); and from May 1, 2021, to April 30, 2022 (third year). RESULTS: A total of 381 patients were included. Common diseases comprised 257 (67.5%) cases, and typical presentations were observed in 298 (78.2%) cases. Overall, the accuracy of the differential diagnosis list created by the AI-based symptom checker was 172 (45.1%), which did not differ across the 3 years (first year: 97/219, 44.3%; second year: 32/72, 44.4%; and third year: 43/90, 47.7%; P=.85). The accuracy of the differential diagnosis list created by the symptom checker was low in those with uncommon diseases (30/124, 24.2%) and atypical presentations (12/83, 14.5%). In the multivariate logistic regression model, common disease (P<.001; odds ratio 4.13, 95% CI 2.50-6.98) and typical presentation (P<.001; odds ratio 6.92, 95% CI 3.62-14.2) were significantly associated with the accuracy of the differential diagnosis list created by the symptom checker. CONCLUSIONS: A 3-year longitudinal survey of the diagnostic accuracy of differential diagnosis lists developed by an AI-based symptom checker, which has been implemented in real-world clinical practice settings, showed no improvement over time. Uncommon diseases and atypical presentations were independently associated with a lower diagnostic accuracy. In the future, symptom checkers should be trained to recognize uncommon conditions.

10.
J Telemed Telecare ; : 1357633X241245161, 2024 Apr 22.
Article in English | MEDLINE | ID: mdl-38646705

ABSTRACT

INTRODUCTION: Online symptom checkers are a way to address patient concerns and potentially offload a burdened healthcare system. However, safety outcomes of self-triage are unknown, so we reviewed triage recommendations and outcomes of our institution's depression symptom checker. METHODS: We examined endpoint recommendations and follow-up encounters seven days afterward during 2 December 2021 to 13 December 2022. Patients with an emergency department visit or hospitalization within seven days of self-triaging had a manual review of the electronic health record to determine if the visit was related to depression, suicidal ideation, or suicide attempt. Charts were reviewed for deaths within seven days of self-triage. RESULTS: There were 287 unique encounters from 263 unique patients. In 86.1% (247/287), the endpoint was an instruction to call nurse triage; in 3.1% of encounters (9/287), instruction was to seek emergency care. Only 20.2% (58/287) followed the recommendations given. Of the 229 patients that did not follow the endpoint recommendations, 121 (52.8%) had some type of follow-up within seven days. Nearly 11% (31/287) were triaged to endpoints not requiring urgent contact and 9.1% (26/287) to an endpoint that would not need any healthcare team input. No patients died in the study period. CONCLUSIONS: Most patients did not follow the recommendations for follow-up care although ultimately most patients did receive care within seven days. Self-triage appears to appropriately sort patients with depressed mood to emergency care. On-line self-triaging tools for depression have the potential to safely offload some work from clinic personnel.

11.
J Am Med Inform Assoc ; 31(9): 2002-2009, 2024 Sep 01.
Article in English | MEDLINE | ID: mdl-38679900

ABSTRACT

OBJECTIVES: To evaluate demographic biases in diagnostic accuracy and health advice between generative artificial intelligence (AI) (ChatGPT GPT-4) and traditional symptom checkers like WebMD. MATERIALS AND METHODS: Combination symptom and demographic vignettes were developed for 27 most common symptom complaints. Standardized prompts, written from a patient perspective, with varying demographic permutations of age, sex, and race/ethnicity were entered into ChatGPT (GPT-4) between July and August 2023. In total, 3 runs of 540 ChatGPT prompts were compared to the corresponding WebMD Symptom Checker output using a mixed-methods approach. In addition to diagnostic correctness, the associated text generated by ChatGPT was analyzed for readability (using Flesch-Kincaid Grade Level) and qualitative aspects like disclaimers and demographic tailoring. RESULTS: ChatGPT matched WebMD in 91% of diagnoses, with a 24% top diagnosis match rate. Diagnostic accuracy was not significantly different across demographic groups, including age, race/ethnicity, and sex. ChatGPT's urgent care recommendations and demographic tailoring were presented significantly more to 75-year-olds versus 25-year-olds (P < .01) but were not statistically different among race/ethnicity and sex groups. The GPT text was suitable for college students, with no significant demographic variability. DISCUSSION: The use of non-health-tailored generative AI, like ChatGPT, for simple symptom-checking functions provides comparable diagnostic accuracy to commercially available symptom checkers and does not demonstrate significant demographic bias in this setting. The text accompanying differential diagnoses, however, suggests demographic tailoring that could potentially introduce bias. CONCLUSION: These results highlight the need for continued rigorous evaluation of AI-driven medical platforms, focusing on demographic biases to ensure equitable care.


Subject(s)
Artificial Intelligence , Humans , Female , Male , Demography , Sociodemographic Factors , Adult
12.
Digit Health ; 10: 20552076241231555, 2024.
Article in English | MEDLINE | ID: mdl-38434790

ABSTRACT

Background: Symptom checker apps (SCAs) offer symptom classification and low-threshold self-triage for laypeople. They are already in use despite their poor accuracy and concerns that they may negatively affect primary care. This study assesses the extent to which SCAs are used by medical laypeople in Germany and which software is most popular. We examined associations between satisfaction with the general practitioner (GP) and SCA use as well as the number of GP visits and SCA use. Furthermore, we assessed the reasons for intentional non-use. Methods: We conducted a survey comprising standardised and open-ended questions. Quantitative data were weighted, and open-ended responses were examined using thematic analysis. Results: This study included 850 participants. The SCA usage rate was 8%, and approximately 50% of SCA non-users were uninterested in trying SCAs. The most commonly used SCAs were NetDoktor and Ada. Surprisingly, SCAs were most frequently used in the age group of 51-55 years. No significant associations were found between SCA usage and satisfaction with the GP or the number of GP visits and SCA usage. Thematic analysis revealed skepticism regarding the results and recommendations of SCAs and discrepancies between users' requirements and the features of apps. Conclusion: SCAs are still widely unknown in the German population and have been sparsely used so far. Many participants were not interested in trying SCAs, and we found no positive or negative associations of SCAs and primary care.

13.
Cureus ; 16(2): e53899, 2024 Feb.
Article in English | MEDLINE | ID: mdl-38465163

ABSTRACT

Introduction With the expanding awareness and use of AI-powered chatbots, it seems possible that an increasing number of people could use them to assess and evaluate their medical symptoms. If chatbots are used for this purpose, that have not previously undergone a thorough medical evaluation for this specific use, various risks might arise. The aim of this study is to analyze and compare the performance of popular chatbots in differentiating between severe and less critical medical symptoms described from a patient's perspective and to examine the variations in substantive medical assessment accuracy and empathetic communication style among the chatbots' responses. Materials and methods Our study compared three different AI-supported chatbots - OpenAI's ChatGPT 3.5, Microsoft's Bing Chat, and Inflection's Pi AI. Three exemplary case reports for medical emergencies as well as three cases without an urgent reason for an emergency medical admission were constructed and analyzed. Each case report was accompanied by identical questions concerning the most likely suspected diagnosis and the urgency of an immediate medical evaluation. The respective answers of the chatbots were qualitatively compared with each other regarding the medical accuracy of the differential diagnoses mentioned and the conclusions drawn, as well as regarding patient-oriented and empathetic language. Results All examined chatbots were capable of providing medically plausible and probable diagnoses and classifying situations as acute or less critical. However, their responses varied slightly in the level of their urgency assessment. Clear differences could be seen in the level of detail of the differential diagnoses, the overall length of the answers, and how the chatbot dealt with the challenge of being confronted with medical issues. All given answers were comparable in terms of empathy level and comprehensibility. Conclusion Even AI chatbots that are not designed for medical applications already offer substantial guidance in assessing typical medical emergency indications but should always be provided with a disclaimer. In responding to medical queries, characteristic differences emerge among chatbots in the extent and style of their respective answers. Given the lack of medical supervision of many established chatbots, subsequent studies, and experiences are essential to clarify whether a more extensive use of these chatbots for medical concerns will have a positive impact on healthcare or rather pose major medical risks.

14.
BMC Med Ethics ; 25(1): 17, 2024 02 16.
Article in English | MEDLINE | ID: mdl-38365749

ABSTRACT

BACKGROUND: Symptom checker apps (SCAs) are mobile or online applications for lay people that usually have two main functions: symptom analysis and recommendations. SCAs ask users questions about their symptoms via a chatbot, give a list with possible causes, and provide a recommendation, such as seeing a physician. However, it is unclear whether the actual performance of a SCA corresponds to the users' experiences. This qualitative study investigates the subjective perspectives of SCA users to close the empirical gap identified in the literature and answers the following main research question: How do individuals (healthy users and patients) experience the usage of SCA, including their attitudes, expectations, motivations, and concerns regarding their SCA use? METHODS: A qualitative interview study was chosen to clarify the relatively unknown experience of SCA use. Semi-structured qualitative interviews with SCA users were carried out by two researchers in tandem via video call. Qualitative content analysis was selected as methodology for the data analysis. RESULTS: Fifteen interviews with SCA users were conducted and seven main categories identified: (1) Attitudes towards findings and recommendations, (2) Communication, (3) Contact with physicians, (4) Expectations (prior to use), (5) Motivations, (6) Risks, and (7) SCA-use for others. CONCLUSIONS: The aspects identified in the analysis emphasise the specific perspective of SCA users and, at the same time, the immense scope of different experiences. Moreover, the study reveals ethical issues, such as relational aspects, that are often overlooked in debates on mHealth. Both empirical and ethical research is more needed, as the awareness of the subjective experience of those affected is an essential component in the responsible development and implementation of health apps such as SCA. TRIAL REGISTRATION: German Clinical Trials Register (DRKS): DRKS00022465. 07/08/2020.


Subject(s)
Mobile Applications , Physicians , Telemedicine , Humans , Qualitative Research , Communication
15.
BMC Med Inform Decis Mak ; 24(1): 21, 2024 Jan 23.
Article in English | MEDLINE | ID: mdl-38262993

ABSTRACT

BACKGROUND: Symptom checker applications (SCAs) may help laypeople classify their symptoms and receive recommendations on medically appropriate actions. Further research is necessary to estimate the influence of user characteristics, attitudes and (e)health-related competencies. OBJECTIVE: The objective of this study is to identify meaningful predictors for SCA use considering user characteristics. METHODS: An explorative cross-sectional survey was conducted to investigate German citizens' demographics, eHealth literacy, hypochondria, self-efficacy, and affinity for technology using German language-validated questionnaires. A total of 869 participants were eligible for inclusion in the study. As n = 67 SCA users were assessed and matched 1:1 with non-users, a sample of n = 134 participants were assessed in the main analysis. A four-step analysis was conducted involving explorative predictor selection, model comparisons, and parameter estimates for selected predictors, including sensitivity and post hoc analyses. RESULTS: Hypochondria and self-efficacy were identified as meaningful predictors of SCA use. Hypochondria showed a consistent and significant effect across all analyses OR: 1.24-1.26 (95% CI: 1.1-1.4). Self-efficacy OR: 0.64-0.93 (95% CI: 0.3-1.4) showed inconsistent and nonsignificant results, leaving its role in SCA use unclear. Over half of the SCA users in our sample met the classification for hypochondria (cut-off on the WI of 5). CONCLUSIONS: Hypochondria has emerged as a significant predictor of SCA use with a consistently stable effect, yet according to the literature, individuals with this trait may be less likely to benefit from SCA despite their greater likelihood of using it. These users could be further unsettled by risk-averse triage and unlikely but serious diagnosis suggestions. TRIAL REGISTRATION: The study was registered in the German Clinical Trials Register (DRKS) DRKS00022465, DERR1- https://doi.org/10.2196/34026 .


Subject(s)
Mobile Applications , Humans , Cross-Sectional Studies , Language , Phenotype , Probability
16.
Stud Health Technol Inform ; 310: 514-518, 2024 Jan 25.
Article in English | MEDLINE | ID: mdl-38269862

ABSTRACT

We assessed the safety of a new clinical decision support system (CDSS) for nurses on Australia's national consumer helpline. Accuracy and safety of triage advice was assessed by testing the CDSS using 78 standardised patient vignettes (48 published and 30 proprietary). Testing was undertaken in two cycles using the CDSS vendor's online evaluation tool (Cycle 1: 47 vignettes; Cycle 2: 41 vignettes). Safety equivalence was examined by testing the existing CDSS with the 47 vignettes from Cycle 1. The new CDSS triaged 66% of vignettes correctly compared to 57% by the existing CDSS. 15% of vignettes were overtriaged by the new CDSS compared to 28% by the existing CDSS. 19% of vignettes were undertriaged by the new CDSS compared to 15% by the existing CDSS. Overall performance of the new CDSS appears consistent and comparable with current studies. The new CDSS is at least as safe as the old CDSS.


Subject(s)
Decision Support Systems, Clinical , Humans , Expert Systems , Software , Triage
17.
Rheumatol Int ; 44(1): 173-180, 2024 Jan.
Article in English | MEDLINE | ID: mdl-37316631

ABSTRACT

Patients with axial spondyloarthritis (axSpA) suffer from one of the longest diagnostic delays among all rheumatic diseases. Telemedicine (TM) may reduce this diagnostic delay by providing easy access to care. Diagnostic rheumatology telehealth studies are scarce and largely limited to traditional synchronous approaches such as resource-intensive video and telephone consultations. The aim of this study was to investigate a stepwise asynchronous telemedicine-based diagnostic approach in patients with suspected axSpA. Patients with suspected axSpA completed a fully automated digital symptom assessment using two symptom checkers (SC) (bechterew-check and Ada). Secondly, a hybrid stepwise asynchronous TM approach was investigated. Three physicians and two medical students were given sequential access to SC symptom reports, laboratory and imaging results. After each step, participants had to state if axSpA was present or not (yes/no) and had to rate their perceived decision confidence. Results were compared to the final diagnosis of the treating rheumatologist. 17 (47.2%) of 36 included patients were diagnosed with axSpA. Diagnostic accuracy of bechterew-check, Ada, TM students and TM physicians was 47.2%, 58.3%, 76.4% and 88.9% respectively. Access to imaging results significantly increased sensitivity of TM-physicians (p < 0.05). Mean diagnostic confidence of false axSpA classification was not significantly lower compared to correct axSpA classification for both students and physicians. This study underpins the potential of asynchronous physician-based telemedicine for patients with suspected axSpA. Similarly, the results highlight the need for sufficient information, especially imaging results to ensure a correct diagnosis. Further studies are needed to investigate other rheumatic diseases and telediagnostic approaches.


Subject(s)
Axial Spondyloarthritis , Rheumatic Diseases , Spondylarthritis , Spondylitis, Ankylosing , Telemedicine , Humans , Spondylarthritis/diagnosis , Pilot Projects , Delayed Diagnosis , Spondylitis, Ankylosing/diagnosis
18.
JMIR Mhealth Uhealth ; 11: e46718, 2023 12 05.
Article in English | MEDLINE | ID: mdl-38051574

ABSTRACT

BACKGROUND: Reproductive health conditions such as endometriosis, uterine fibroids, and polycystic ovary syndrome (PCOS) affect a large proportion of women and people who menstruate worldwide. Prevalence estimates for these conditions range from 5% to 40% of women of reproductive age. Long diagnostic delays, up to 12 years, are common and contribute to health complications and increased health care costs. Symptom checker apps provide users with information and tools to better understand their symptoms and thus have the potential to reduce the time to diagnosis for reproductive health conditions. OBJECTIVE: This study aimed to evaluate the agreement between clinicians and 3 symptom checkers (developed by Flo Health UK Limited) in assessing symptoms of endometriosis, uterine fibroids, and PCOS using vignettes. We also aimed to present a robust example of vignette case creation, review, and classification in the context of predeployment testing and validation of digital health symptom checker tools. METHODS: Independent general practitioners were recruited to create clinical case vignettes of simulated users for the purpose of testing each condition symptom checker; vignettes created for each condition contained a mixture of condition-positive and condition-negative outcomes. A second panel of general practitioners then reviewed, approved, and modified (if necessary) each vignette. A third group of general practitioners reviewed each vignette case and designated a final classification. Vignettes were then entered into the symptom checkers by a fourth, different group of general practitioners. The outcomes of each symptom checker were then compared with the final classification of each vignette to produce accuracy metrics including percent agreement, sensitivity, specificity, positive predictive value, and negative predictive value. RESULTS: A total of 24 cases were created per condition. Overall, exact matches between the vignette general practitioner classification and the symptom checker outcome were 83% (n=20) for endometriosis, 83% (n=20) for uterine fibroids, and 88% (n=21) for PCOS. For each symptom checker, sensitivity was reported as 81.8% for endometriosis, 84.6% for uterine fibroids, and 100% for PCOS; specificity was reported as 84.6% for endometriosis, 81.8% for uterine fibroids, and 75% for PCOS; positive predictive value was reported as 81.8% for endometriosis, 84.6% for uterine fibroids, 80% for PCOS; and negative predictive value was reported as 84.6% for endometriosis, 81.8% for uterine fibroids, and 100% for PCOS. CONCLUSIONS: The single-condition symptom checkers have high levels of agreement with general practitioner classification for endometriosis, uterine fibroids, and PCOS. Given long delays in diagnosis for many reproductive health conditions, which lead to increased medical costs and potential health complications for individuals and health care providers, innovative health apps and symptom checkers hold the potential to improve care pathways.


Subject(s)
Endometriosis , Leiomyoma , Humans , Female , Endometriosis/diagnosis , Endometriosis/complications , Reproductive Health , Leiomyoma/diagnosis , Leiomyoma/complications , Prevalence
19.
JMIR Mhealth Uhealth ; 11: e49995, 2023 10 03.
Article in English | MEDLINE | ID: mdl-37788063

ABSTRACT

BACKGROUND: Diagnosis is a core component of effective health care, but misdiagnosis is common and can put patients at risk. Diagnostic decision support systems can play a role in improving diagnosis by physicians and other health care workers. Symptom checkers (SCs) have been designed to improve diagnosis and triage (ie, which level of care to seek) by patients. OBJECTIVE: The aim of this study was to evaluate the performance of the new large language model ChatGPT (versions 3.5 and 4.0), the widely used WebMD SC, and an SC developed by Ada Health in the diagnosis and triage of patients with urgent or emergent clinical problems compared with the final emergency department (ED) diagnoses and physician reviews. METHODS: We used previously collected, deidentified, self-report data from 40 patients presenting to an ED for care who used the Ada SC to record their symptoms prior to seeing the ED physician. Deidentified data were entered into ChatGPT versions 3.5 and 4.0 and WebMD by a research assistant blinded to diagnoses and triage. Diagnoses from all 4 systems were compared with the previously abstracted final diagnoses in the ED as well as with diagnoses and triage recommendations from three independent board-certified ED physicians who had blindly reviewed the self-report clinical data from Ada. Diagnostic accuracy was calculated as the proportion of the diagnoses from ChatGPT, Ada SC, WebMD SC, and the independent physicians that matched at least one ED diagnosis (stratified as top 1 or top 3). Triage accuracy was calculated as the number of recommendations from ChatGPT, WebMD, or Ada that agreed with at least 2 of the independent physicians or were rated "unsafe" or "too cautious." RESULTS: Overall, 30 and 37 cases had sufficient data for diagnostic and triage analysis, respectively. The rate of top-1 diagnosis matches for Ada, ChatGPT 3.5, ChatGPT 4.0, and WebMD was 9 (30%), 12 (40%), 10 (33%), and 12 (40%), respectively, with a mean rate of 47% for the physicians. The rate of top-3 diagnostic matches for Ada, ChatGPT 3.5, ChatGPT 4.0, and WebMD was 19 (63%), 19 (63%), 15 (50%), and 17 (57%), respectively, with a mean rate of 69% for physicians. The distribution of triage results for Ada was 62% (n=23) agree, 14% unsafe (n=5), and 24% (n=9) too cautious; that for ChatGPT 3.5 was 59% (n=22) agree, 41% (n=15) unsafe, and 0% (n=0) too cautious; that for ChatGPT 4.0 was 76% (n=28) agree, 22% (n=8) unsafe, and 3% (n=1) too cautious; and that for WebMD was 70% (n=26) agree, 19% (n=7) unsafe, and 11% (n=4) too cautious. The unsafe triage rate for ChatGPT 3.5 (41%) was significantly higher (P=.009) than that of Ada (14%). CONCLUSIONS: ChatGPT 3.5 had high diagnostic accuracy but a high unsafe triage rate. ChatGPT 4.0 had the poorest diagnostic accuracy, but a lower unsafe triage rate and the highest triage agreement with the physicians. The Ada and WebMD SCs performed better overall than ChatGPT. Unsupervised patient use of ChatGPT for diagnosis and triage is not recommended without improvements to triage accuracy and extensive clinical evaluation.


Subject(s)
Physicians , Triage , Humans , Triage/methods , Emergency Service, Hospital , Health Personnel , Self Report
20.
Telemed Rep ; 4(1): 292-306, 2023.
Article in English | MEDLINE | ID: mdl-37817871

ABSTRACT

Objective: To complete a review of the literature on patient experience and satisfaction as relates to the potential for virtual triage (VT) or symptom checkers to enhance and enable improvements in these important health care delivery objectives. Methods: Review and synthesis of the literature on patient experience and satisfaction as informed by emerging evidence, indicating potential for VT to favorably impact these clinical care objectives and outcomes. Results/Conclusions: VT enhances potential clinical effectiveness through early detection and referral, can reduce avoidable care delivery due to late clinical presentation, and can divert primary care needs to more clinically appropriate outpatient settings rather than high-acuity emergency departments. Delivery of earlier and faster, more acuity level-appropriate care, as well as patient avoidance of excess care acuity (and associated cost), offer promise as contributors to improved patient experience and satisfaction. The application of digital triage as a front door to health care delivery organizations offers care engagement that can help reduce patient need to visit a medical facility for low-acuity conditions more suitable for self-care, thus avoiding unpleasant queues and reducing microbiological and other patient risks associated with visits to medical facilities. VT also offers an opportunity for providers to make patient health care experiences more personalized.

SELECTION OF CITATIONS
SEARCH DETAIL