Pesquisa | Biblioteca Virtual em Saúde

1.

Accuracy of Large Language Models in Answering Ophthalmology Board-Style Questions: A Meta-Analysis.

Wu, Jo-Hsuan; Nishida, Takashi; Liu, T Y Alvin.

Asia Pac J Ophthalmol (Phila) ; : 100106, 2024 Oct 05.

Artigo em Inglês | MEDLINE | ID: mdl-39374807

RESUMO

PURPOSE: To evaluate the accuracy of large language models (LLMs) in answering ophthalmology board-style questions. DESIGN: Meta-analysis. METHODS: Literature search was conducted using PubMed and Embase in March 2024. We included full-length articles and research letters published in English that reported the accuracy of LLMs in answering ophthalmology board-style questions. Data on LLM performance, including the number of questions submitted and correct responses generated, were extracted for each question set from individual studies. Pooled accuracy was calculated using a random-effects model. Subgroup analyses were performed based on the LLMs used and specific ophthalmology topics addressed. RESULTS: Among the 14 studies retrieved, 13 (93%) tested LLMs on multiple ophthalmology topics. ChatGPT-3.5, ChatGPT-4, Bard, and Bing Chat were assessed in 12 (86%), 11 (79%), 4 (29%), and 4 (29%) studies, respectively. The overall pooled accuracy of LLMs was 0.65 (95% CI: 0.61-0.69). Among the different LLMs, ChatGPT-4 achieved the highest pooled accuracy at 0.74 (95% CI: 0.73-0.79), while ChatGPT-3.5 recorded the lowest at 0.52 (95% CI: 0.51-0.54). LLMs performed best in "pathology" (0.78 [95% CI: 0.70-0.86]) and worst in "fundamentals and principles of ophthalmology" (0.52 [95% CI: 0.48-0.56]). CONCLUSIONS: The overall accuracy of LLMs in answering ophthalmology board-style questions was acceptable but not exceptional, with ChatGPT-4 and Bing Chat being top-performing models. Performance varied significantly based on specific ophthalmology topics tested. Inconsistent performances are of concern, highlighting the need for future studies to include ophthalmology board-style questions with images to more comprehensively examine the competency of LLMs.

2.

Clinical Reasoning Skills Among Second-Phase Medical Students in West Bengal, India: An Exploratory Study.

Mukhopadhyay, Diptakanti; Choudhari, Sonali G.

Cureus ; 16(9): e68839, 2024 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-39376810

RESUMO

Introduction Proper application of clinical reasoning skills is essential to reduce diagnostic and management errors. Explicit inclusion of training and assessment of clinical reasoning skills is the demand of time. The study intended to measure the clinical reasoning skills of second-phase undergraduate students in a medical college in West Bengal, India, and its distribution across several individual variables. Methods The clinical reasoning skills of 105 undergraduate medical students were assessed in a cross-sectional exploratory study using key feature questions (KFQs) with the partial credit scoring system. Six case vignettes aligned to the core competencies in the subject of pharmacology, pathology, and microbiology were designed and validated by the subject material experts for this purpose. The responses of the participants were collected through Google Forms (Google, Mountain View, CA) after obtaining written informed consent. The scores obtained in all KFQs were added and expressed in percentage of the maximum attainable score. Results The mean (±SD) clinical reasoning score of the participants was 42.5 (±12.6). Only 29.6% of respondents scored ≥ 50. Students with higher subjective economic status (p-value = 0.01) and perceived autonomy (p-value < 0.001) were more likely to have higher clinical reasoning scores. The marks obtained in previous summative examinations were significantly correlated with clinical reasoning scores. Conclusion Average score < 50.0 and inability to score ≥ 50.0 by more than two-thirds of the participants reflected the deficit in the clinical reasoning skills of second-phase MBBS students. The association of clinical reasoning skills with economic status, autonomy, and previous academic performances needs further exploration.

3.

Amplifying Their Voice: Inclusive Healthcare Provider Perspectives to Improve Advancement, Resilience, and Retention.

Breuner, Cora; Moore, Emily; Walsh, Elaine; Hilman, Stephanie; Mitzel, Julia; Thomas, Anita; Walker-Harding, Leslie.

Cureus ; 16(8): e66028, 2024 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-39221340

RESUMO

BACKGROUND AND OBJECTIVES: Addressing the issues of workplace advancement, resilience, and retention within medicine is crucial for creating a culture of equity, respect, and inclusivity especially towards women and nonbinary (WNB) providers including advanced practice providers (APPs), most notably those from marginalized groups. This also directly impacts healthcare quality, patient outcomes, and overall patient and employee satisfaction. The purpose of this study was to amplify the voices on challenges faced by WNB providers within a pediatric academic healthcare organization, to rank workplace interventions addressing advancement, resilience, and retention highlighting urgency towards addressing these issues, and, lastly, to provide suggestions on how to improve inclusivity. METHODS: Participants were self-identified WNB providers employed by a pediatric healthcare organization and its affiliated medical university. An eligibility screener was completed by 150 qualified respondents, and 40 WNBs actually participated in study interviews. Interviews were conducted using a semi-structured interview guide to rank interventions targeted at improving equity, with time allotted for interviewees to discuss their personal lives and how individual circumstances impacted their professional experiences. RESULTS: WNB providers called for efficient workflows and reducing uncompensated job demands. Support for family responsibilities, flexible financial/compensation models, and improved job resources all were endorsed similarly. Participants ranked direct supervisor and leader support substantially lower than other interventions. Conclusions: Career mentorship and academic support for WNB individuals are recognized interventions for advancement and retention but were not ranked as top priorities. Respondents focused on personal supports as they relate to family, job resources, and flexible compensation models. Future studies should focus on implementing realistic expectations and structures that support whole lives including professional ambitions, time with family, personal pursuits, and self-care.

4.

Comparing suicide risk screening strategies in Spanish-speaking pediatric patients.

Papávero, Eliana Belén; Rodante, Demian Emanuel; Ingratta, Adriana Virginia; Gorrini, Antonio; Ralli, Eugenia; Rodante, Eliana; Arismendi, Mariana; Lowry, Nathan J; Ryan, Patrick; Bridge, Jeffrey A; Horowitz, Lisa; Daray, Federico Manuel.

Gen Hosp Psychiatry ; 91: 18-24, 2024 Aug 17.

Artigo em Inglês | MEDLINE | ID: mdl-39260188

RESUMO

BACKGROUND: Suicide and suicidal behaviors pose significant global public health challenges, especially among young individuals. Effective screening strategies are crucial for addressing this crisis, with depression screening and suicide-specific tools being common approaches. This study compares their effectiveness by evaluating the Ask Suicide-Screening Questions (ASQ) against item 9 of the Patient Health Questionnaire-A (PHQ-A). METHODS: This study is a secondary analysis of the Argentinean-Spanish version of the ASQ validation study, an observational, cross-sectional, and multicenter study conducted in medical settings in Buenos Aires, Argentina. A convenience sample of pediatric outpatients/inpatients aged 10 to 18 years completed the ASQ, PHQ-A, and Suicide Ideation Questionnaire (SIQ) along with clinical and sociodemographic questions. RESULTS: A sample of 267 children and adolescents were included in this secondary analysis. Results show that the ASQ exhibited higher sensitivity (95.1%; 95% CI: 83% - 99%) compared to PHQ-A item 9 (73.1%; 95% CI: 57% - 85%), and superior performance in identifying suicide risk in youth. LIMITATIONS: The study included a convenience sampling and was geographically restricted to Buenos Aires, Argentina. The study also lacked longitudinal follow-up to assess the predictive validity of these screening tools for suicide risk. CONCLUSION: The study highlights the ASQ's effectiveness in identifying suicide risk among youth, emphasizing the importance of specialized screening tools over depression screening tools alone for accurate risk assessment in this population.

5.

Performance of ChatGPT 4.0 on Japan's National Physical Therapist Examination: A Comprehensive Analysis of Text and Visual Question Handling.

Sawamura, Shogo; Kohiyama, Kengo; Takenaka, Takahiro; Sera, Tatsuya; Inoue, Tadatoshi; Nagai, Takashi.

Cureus ; 16(8): e67347, 2024 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-39310431

RESUMO

INTRODUCTION: ChatGPT 4.0, a large-scale language model (LLM) developed by OpenAI, has demonstrated the capability to pass Japan's national medical examination and other medical assessments. However, the impact of imaging-based questions and different question types on its performance has not been thoroughly examined. This study evaluated ChatGPT 4.0's performance on Japan's national examination for physical therapists, particularly its ability to handle complex questions involving images and tables. The study also assessed the model's potential in the field of rehabilitation and its performance with Japanese language inputs. METHODS: The evaluation utilized 1,000 questions from the 54th to 58th national exams for physical therapists in Japan, comprising 160 general questions and 40 practical questions per exam. All questions were input in Japanese and included additional information such as images or tables. The answers generated by ChatGPT were then compared with the official correct answers. ANALYSIS: ChatGPT's performance was evaluated based on accuracy rates using various criteria: general and practical questions were analyzed with Fisher's exact test, A-type (single correct answer) and X2-type (two correct answers) questions, text-only questions versus questions with images and tables, and different question lengths using Student's t-test. RESULTS: ChatGPT 4.0 met the passing criteria with an overall accuracy of 73.4%. The accuracy rates for general and practical questions were 80.1% and 46.6%, respectively. No significant difference was found between the accuracy rates for A-type (74.3%) and X2-type (67.4%) questions. However, a significant difference was observed between the accuracy rates for text-only questions (80.5%) and questions with images and tables (35.4%). DISCUSSION: The results indicate that ChatGPT 4.0 satisfies the passing criteria for the national exam and demonstrates adequate knowledge and application skills. However, its performance on practical questions and those with images and tables is lower, indicating areas for improvement. The effective handling of Japanese inputs suggests its potential use in non-English-speaking regions. CONCLUSION: ChatGPT 4.0 can pass the national examination for physical therapists, particularly with text-based questions. However, improvements are needed for specialized practical questions and those involving images and tables. The model shows promise for supporting clinical rehabilitation and medical education in Japanese-speaking contexts, though further enhancements are required for a comprehensive application.

6.

Snapshot of Examination Usage in Emergency Medicine Clerkships.

Alley, William D; Husain, Iltifat; Briggs, Blake; Story, David.

Cureus ; 16(8): e67301, 2024 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-39310468

RESUMO

Objective Emergency Medicine (EM) clerkships often use a written exam to assess the knowledge gained over the course of an EM rotation in medical school. Clerkship Directors (CDs) may choose the National Board of Medical Examiners (NBME) EM Advanced Clinical Science Subject Exam (ACE), the Society for Academic Emergency Medicine (SAEM) M4 exam, which has two versions, the SAEM M3 exam, or departmental exams. There are currently no published guidelines or consensus regarding their utility. This survey-based study was designed to collect data regarding current practices of EM clerkship exam usage to analyze trends and variability in what exams are used and how. Methods The authors designed a cross-sectional observational survey to collect data from EM CDs on exam utilization in clerkships. The survey population consisted of clerkship directors, assistant clerkship directors, or faculty familiar assessments in their EM clerkship. Initial dissemination was by electronic distribution to subscribers of the Clerkship Directors in Emergency Medicine (CDEM) list-serve on the SAEM website. Subsequently, contact information of CD's from institutions that had not responded was obtained by manual search of the Emergency Medicine Residents' Association (EMRA) Match website and individual correspondence was sent at regular intervals. Data obtained include clerkship characteristics, exam used, weight of the exam relative to the overall grade, and alternatives if the preferred exam was previously taken. Results Eighty-seven programs (42% response rate) completed the survey between August 2019 and February 2021. Of the 87 responses, 71 (82%) were completed by a CD. Forty-six (53%) institutions required an EM rotation. Students were tested in 34 (74%) required EM clerkships and 48 (69%) out of 70 EM electives. In required rotations that used an exam, 20 (59%) used the NBME EM ACE, while 28 of 46 (61%) of EM electives that reported an exam used the SAEM M4 Exam. Five (15%) of the required clerkships used a departmental exam. Of clerkships requiring an exam, 46 (57%) weighed the score at 11-30% of the final grade. Data for extramural rotations mirrored that of EM electives. One-third of respondents indicated they do not inquire about previously taken exams. Conclusion This survey demonstrates significant variability in the type of exam, the weighting of the score, and alternatives if the preferred exam was previously taken. The lack of a consistent approach in how these exams are used in determining students' final EM grades diminishes the reliability of the EM clerkship grade as a factor used by residency directors in choosing future residents. Further research on optimal usage of these exams is needed.

7.

ChatGPT is capable of providing satisfactory responses to frequently asked questions regarding total shoulder arthroplasty.

Yeramosu, Teja; Johns, William L; Onor, Gabriel; Menendez, Mariano E; Namdari, Surena; Hammoud, Sommer.

Shoulder Elbow ; 16(4): 407-412, 2024 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-39318407

RESUMO

Background: The rising prominence of artificial intelligence in healthcare has revolutionized patient access to medical information. This cross-sectional study sought to assess if ChatGPT could satisfactorily address common patient questions about total shoulder arthroplasty (TSA). Methods: Ten commonly encountered questions in TSA practice were selected and posed to ChatGPT. Each response was assessed for accuracy and clarity using the Mika et al. scoring system, which ranges from "excellent response not requiring clarification" to "unsatisfactory response requiring substantial clarification," and a modified DISCERN score. The readability was further evaluated using the Flesch Reading Ease Score and the Flesch-Kincaid Grade Level. Results: The mean Mika et al. score was 2.93, corresponding to an overall subjective rating of "satisfactory but requiring moderate clarification." The mean DISCERN score was 46.60, which is considered "fair." The readability analysis suggested that the responses were at a college-graduate level, higher than the recommended level for patient educational materials. Discussion: Our results suggest that ChatGPT has the potential to supplement the collaborative decision-making process between patients and experienced orthopedic surgeons for TSA-related inquiries. Ultimately, while tools like ChatGPT can enhance traditional patient education methods, they should not replace direct consultations with medical professionals.

8.

Accuracy assessment of ChatGPT responses to frequently asked questions regarding anterior cruciate ligament surgery.

Villarreal-Espinosa, Juan Bernardo; Berreta, Rodrigo Saad; Allende, Felicitas; Garcia, José Rafael; Ayala, Salvador; Familiari, Filippo; Chahla, Jorge.

Knee ; 51: 84-92, 2024 Sep 05.

Artigo em Inglês | MEDLINE | ID: mdl-39241674

RESUMO

BACKGROUND: The emergence of artificial intelligence (AI) has allowed users to have access to large sources of information in a chat-like manner. Thereby, we sought to evaluate ChatGPT-4 response's accuracy to the 10 patient most frequently asked questions (FAQs) regarding anterior cruciate ligament (ACL) surgery. METHODS: A list of the top 10 FAQs pertaining to ACL surgery was created after conducting a search through all Sports Medicine Fellowship Institutions listed on the Arthroscopy Association of North America (AANA) and American Orthopaedic Society of Sports Medicine (AOSSM) websites. A Likert scale was used to grade response accuracy by two sports medicine fellowship-trained surgeons. Cohen's kappa was used to assess inter-rater agreement. Reproducibility of the responses over time was also assessed. RESULTS: Five of the 10 responses received a 'completely accurate' grade by two-fellowship trained surgeons with three additional replies receiving a 'completely accurate' status by at least one. Moreover, inter-rater reliability accuracy assessment revealed a moderate agreement between fellowship-trained attending physicians (weighted kappa = 0.57, 95% confidence interval 0.15-0.99). Additionally, 80% of the responses were reproducible over time. CONCLUSION: ChatGPT can be considered an accurate additional tool to answer general patient questions regarding ACL surgery. None the less, patient-surgeon interaction should not be deferred and must continue to be the driving force for information retrieval. Thus, the general recommendation is to address any questions in the presence of a qualified specialist.

9.

Retrospective study assessing student utilization of optional practice questions on pharmacy calculations final examination performance.

McMillen, Andrew; Brosch, Henry; Zakhary, Kirolos; Juzkiw, Stefanie; Fredrickson, Liz; Tromp, Katherine M; Fujiwara, Ryoichi.

Curr Pharm Teach Learn ; 16(12): 102203, 2024 Sep 18.

Artigo em Inglês | MEDLINE | ID: mdl-39298994

RESUMO

OBJECTIVE: Pharmacists are often the last line of defense from medical errors caused by inaccurate calculations. Effective teaching and assessment of pharmaceutical calculations is essential in preparing students for successful pharmacy careers. This study aimed to elucidate the potential benefit of self-testing practice questions on final examination performance in a first-year pharmaceutical calculations course. METHODS: One-hundred and sixteen students across the class of 2026 and 2027 were given access to 110 online practice calculation questions eight days prior to the final examination. Retrospective analysis using Pearson's Correlation Coefficient and an Unpaired t-test was used to assess the effect of self-study practice questions on exam performance. RESULTS: A correlation between higher quiz scores and enhanced final examination scores was observed for both the class of 2026 and 2027. A greater number of attempts on practice quiz questions correlated with a higher score on the final examination for the class of 2026, but not the class of 2027. Also, an earlier first access date was associated with higher final examination scores specifically for the class of 2026. CONCLUSION: This retrospective study was conducted to evaluate the use of practice calculation questions on final examination performance, and results reveal that the utilization of practice calculation questions positively correlates with improved final examination performance, notably observed in the class of 2026 but not in 2027. These findings suggest the potential efficacy of this preparatory method across various pharmaceutical courses and other calculation-based disciplines internationally.

10.

Comparison of mistakes on multiple-choice question and fill-in-the-blank examinations: A retrospective analysis.

He, Xiaohua; Zhang, Niu.

J Chiropr Educ ; 2024 Sep 13.

Artigo em Inglês | MEDLINE | ID: mdl-39265994

RESUMO

OBJECTIVE: The objective was to compare the average number of mistakes made on multiple-choice (MCQ) and fill-in-the-blank (FIB) questions in anatomy lab exams. METHODS: The study was conducted retrospectively; every exam had both MCQs and FIBs. The study cohorts were divided into 3 tiers based on the number and percentage of mistakes in answering sheets: low (21-32, >40%), middle (11-20, 40%-20%), and high (1-9, <20%) tiers. The study used an independent 2-sample t test to compare the number of mistakes between MCQs and FIBs overall and per tier and a 1-way analysis of variance to compare the number of mistakes in both formats across the 3 tiers. RESULTS: The results show that there was a significant difference in the number of mistakes between the 2 formats overall with more mistakes found on FIBs (p < .001). The number of mistakes made in the high and middle tiers had a statistical difference, being higher on MCQs (p < .001). There was no significant difference in the number of mistakes made in the low tier between formats (p > .05). Furthermore, the study found significant differences in the number of mistakes made on MCQs and FIBs across the 3 tiers, being highest in the low-tier group (p < .001). CONCLUSION: There were fewer mistakes on the MCQ than the FIB format in exams. It also suggests that, in the low tier answering sheets, both formats could be used to identify students at academic risk who need more attention.

11.

Quality Measurement of Consumer Health Questions: Content and Language Perspectives.

Alasmari, Ashwag; Zhou, Lina.

J Med Internet Res ; 26: e48257, 2024 Sep 12.

Artigo em Inglês | MEDLINE | ID: mdl-39265162

RESUMO

BACKGROUND: Health information consumers increasingly rely on question-and-answer (Q&A) communities to address their health concerns. However, the quality of questions posted significantly impacts the likelihood and relevance of received answers. OBJECTIVE: This study aims to improve our understanding of the quality of health questions within web-based Q&A communities. METHODS: We develop a novel framework for defining and measuring question quality within web-based health communities, incorporating content- and language-based variables. This framework leverages k-means clustering and establishes automated metrics to assess overall question quality. To validate our framework, we analyze questions related to kidney disease from expert-curated and community-based Q&A platforms. Expert evaluations confirm the validity of our quality construct, while regression analysis helps identify key variables. RESULTS: High-quality questions were more likely to include demographic and medical information than lower-quality questions (P<.001). In contrast, asking questions at the various stages of disease development was less likely to reflect high-quality questions (P<.001). Low-quality questions were generally shorter with lengthier sentences than high-quality questions (P<.01). CONCLUSIONS: Our findings empower consumers to formulate more effective health information questions, ultimately leading to better engagement and more valuable insights within web-based Q&A communities. Furthermore, our findings provide valuable insights for platform developers and moderators seeking to enhance the quality of user interactions and foster a more trustworthy and informative environment for health information exchange.

Assuntos

Informação de Saúde ao Consumidor , Humanos , Informação de Saúde ao Consumidor/normas , Idioma , Internet , Inquéritos e Questionários/normas

12.

Proceedings of the 1st biannual bridging the gaps in lung cancer conference.

Florez, Narjust; Patel, Sandip P; Wakelee, Heather; Bazhenova, Lyudmila; Massarelli, Erminia; Salgia, Ravi; Stiles, Brendon; Peters, Solange; Malhotra, Jyoti; Gadgeel, Shirish M; Nieva, Jorge J; Afkhami, Michelle; Hirsch, Fred R; Gubens, Matthew; Cascone, Tina; Levy, Benjamin; Sabari, Joshua; Husain, Hatim; Ma, Patrick C; Backhus, Leah M; Iyengar, Puneeth; Lee, Percy; Miller, Russell; Sands, Jacob; Kim, Edward.

Oncologist ; 2024 Sep 05.

Artigo em Inglês | MEDLINE | ID: mdl-39237103

RESUMO

Lung cancer is the leading cause of cancer death in the US and globally. The mortality from lung cancer has been declining, due to a reduction in incidence and advances in treatment. Although recent success in developing targeted and immunotherapies for lung cancer has benefitted patients, it has also expanded the complexity of potential treatment options for health care providers. To aid in reducing such complexity, experts in oncology convened a conference (Bridging the Gaps in Lung Cancer) to identify current knowledge gaps and controversies in the diagnosis, treatment, and outcomes of various lung cancer scenarios, as described here. Such scenarios relate to biomarkers and testing in lung cancer, small cell lung cancer, EGFR mutations and targeted therapy in non-small cell lung cancer (NSCLC), early-stage NSCLC, KRAS/BRAF/MET and other genomic alterations in NSCLC, and immunotherapy in advanced NSCLC.

13.

Introducing the high-context communication style interview protocol to detect deception in pairs.

Leal, Sharon; Vrij, Aldert; Ashkenazi, Tzachi; Vernham, Zarah; Fisher, Ronald P; Palena, Nicola.

Acta Psychol (Amst) ; 249: 104440, 2024 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-39167909

RESUMO

In four experiments, we examined whether pairs of truth tellers could be distinguished from pairs of lie tellers by taking advantage of the fact that only pairs of truth tellers can refer to shared events by using brief expressions (high-context communication style). In Experiments 1 and 2, pairs of friends and pairs of strangers pretending to be friends answered (i) questions they likely had expected to be asked (e.g., 'How did you first meet'?) and (ii) unexpected questions (e.g., 'First, describe a shared event in a few words. Then elaborate on it'). Pairs were interviewed individually (Experiment 1, N = 134 individuals) or collectively (Experiment 2, N = 130 individuals). Transcripts were coded for the verbal cues details, complications, plausibility, predictability, and overlap (Experiment 1 only) or repetitions (Experiment 2 only). In two lie detection experiments observers read the individual transcripts in Experiment 3 (N = 146) or the collective transcripts in Experiment 4 (N = 138). The verbal cues were more diagnostic of veracity and observers were better at distinguishing between truths and lies in the unexpected than in the expected questions condition, but only when the pair members were interviewed individually.

Assuntos

Comunicação , Enganação , Detecção de Mentiras , Humanos , Feminino , Masculino , Adulto , Adulto Jovem , Sinais (Psicologia) , Entrevistas como Assunto

14.

ChatGPT-3.5 Versus Google Bard: Which Large Language Model Responds Best to Commonly Asked Pregnancy Questions?

Khromchenko, Keren; Shaikh, Sameeha; Singh, Meghana; Vurture, Gregory; Rana, Rima A; Baum, Jonathan D.

Cureus ; 16(7): e65543, 2024 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-39188430

RESUMO

Large language models (LLM) have been widely used to provide information in many fields, including obstetrics and gynecology. Which model performs best in providing answers to commonly asked pregnancy questions is unknown. A qualitative analysis of Chat Generative Pre-Training Transformer Version 3.5 (ChatGPT-3.5) (OpenAI, Inc., San Francisco, California, United States) and Bard, recently renamed Google Gemini (Google LLC, Mountain View, California, United States), was performed in August of 2023. Each LLM was queried on 12 commonly asked pregnancy questions and asked for their references. Review and grading of the responses and references for both LLMs were performed by the co-authors individually and then as a group to formulate a consensus. Query responses were graded as "acceptable" or "not acceptable" based on correctness and completeness in comparison to American College of Obstetricians and Gynecologists (ACOG) publications, PubMed-indexed evidence, and clinical experience. References were classified as "verified," "broken," "irrelevant," "non-existent," and "no references." Grades of "acceptable" were given to 58% of ChatGPT-3.5 responses (seven out of 12) and 83% of Bard responses (10 out of 12). In regard to references, ChatGPT-3.5 had reference issues in 100% of its references, and Bard had discrepancies in 8% of its references (one out of 12). When comparing ChatGPT-3.5 responses between May 2023 and August 2023, a change in "acceptable" responses was noted: 50% versus 58%, respectively. Bard answered more questions correctly than ChatGPT-3.5 when queried on a small sample of commonly asked pregnancy questions. ChatGPT-3.5 performed poorly in terms of reference verification. The overall performance of ChatGPT-3.5 remained stable over time, with approximately one-half of responses being "acceptable" in both May and August of 2023. Both LLMs need further evaluation and vetting before being accepted as accurate and reliable sources of information for pregnant women.

15.

Assessing ChatGPT's Capability for Multiple Choice Questions Using RaschOnline: Observational Study.

Chow, Julie Chi; Cheng, Teng Yun; Chien, Tsair-Wei; Chou, Willy.

JMIR Form Res ; 8: e46800, 2024 Aug 08.

Artigo em Inglês | MEDLINE | ID: mdl-39115919

RESUMO

BACKGROUND: ChatGPT (OpenAI), a state-of-the-art large language model, has exhibited remarkable performance in various specialized applications. Despite the growing popularity and efficacy of artificial intelligence, there is a scarcity of studies that assess ChatGPT's competence in addressing multiple-choice questions (MCQs) using KIDMAP of Rasch analysis-a website tool used to evaluate ChatGPT's performance in MCQ answering. OBJECTIVE: This study aims to (1) showcase the utility of the website (Rasch analysis, specifically RaschOnline), and (2) determine the grade achieved by ChatGPT when compared to a normal sample. METHODS: The capability of ChatGPT was evaluated using 10 items from the English tests conducted for Taiwan college entrance examinations in 2023. Under a Rasch model, 300 simulated students with normal distributions were simulated to compete with ChatGPT's responses. RaschOnline was used to generate 5 visual presentations, including item difficulties, differential item functioning, item characteristic curve, Wright map, and KIDMAP, to address the research objectives. RESULTS: The findings revealed the following: (1) the difficulty of the 10 items increased in a monotonous pattern from easier to harder, represented by logits (-2.43, -1.78, -1.48, -0.64, -0.1, 0.33, 0.59, 1.34, 1.7, and 2.47); (2) evidence of differential item functioning was observed between gender groups for item 5 (P=.04); (3) item 5 displayed a good fit to the Rasch model (P=.61); (4) all items demonstrated a satisfactory fit to the Rasch model, indicated by Infit mean square errors below the threshold of 1.5; (5) no significant difference was found in the measures obtained between gender groups (P=.83); (6) a significant difference was observed among ability grades (P<.001); and (7) ChatGPT's capability was graded as A, surpassing grades B to E. CONCLUSIONS: By using RaschOnline, this study provides evidence that ChatGPT possesses the ability to achieve a grade A when compared to a normal sample. It exhibits excellent proficiency in answering MCQs from the English tests conducted in 2023 for the Taiwan college entrance examinations.

16.

Modeling Evasive Response Bias in Randomized Response: Cheater Detection Versus Self-protective No-Saying.

Sayed, Khadiga H A; Cruyff, Maarten J L F; van der Heijden, Peter G M.

Psychometrika ; 2024 Aug 30.

Artigo em Inglês | MEDLINE | ID: mdl-39212867

RESUMO

Randomized response is an interview technique for sensitive questions designed to eliminate evasive response bias. Since this elimination is only partially successful, two models have been proposed for modeling evasive response bias: the cheater detection model for a design with two sub-samples with different randomization probabilities and the self-protective no sayers model for a design with multiple sensitive questions. This paper shows the correspondence between these models, and introduces models for the new, hybrid "ever/last year" design that account for self-protective no saying and cheating. The model for one set of ever/last year questions has a degree of freedom that can be used for the inclusion of a response bias parameter. Models with multiple degrees of freedom are introduced for extensions of the design with a third randomized response question and a second set of ever/last year questions. The models are illustrated with two surveys on doping use. We conclude with a discussion of the pros and cons of the ever/last year design and its potential for future research.

17.

Faculty Perception of Scenario-Based MCQs, SAQs, and MEQs in Medical Education at an Apex Institute.

Singh, Veena K; Tiwari, Meenakshi; Singh, Shruti; Kumar, Santosh.

Med Sci Educ ; 34(4): 865-871, 2024 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-39099861

RESUMO

Purpose: This study explores the current knowledge and overall awareness of the faculty at an Apex institute about the use and difficulties of scenario-based multiple-choice questions (SB-MCQs), short-answer questions (SB-SAQs), and modified essay questions (SB-MEQs) in the assessment of the undergraduate and postgraduate students. Objectives: To assess faculty perception of awareness and use of SB-MCQs, SB-SAQs, and SB-MEQs and to understand the challenges encountered while designing scenario-based questions (SBQs) and the ways to overcome them. Study Procedure: The tool used for data collection was a Google form questionnaire designed with a total of 16 questions-12 Likert-scale format items and four open-ended questions. The quantitative data collected as a response to close-ended questions was analyzed by descriptive statistics and percentage values. For qualitative data, thematic analysis was done for open-ended questions. Conclusion: The study showed that the faculty has the motivation and agreeability to switch over from traditional questions to scenario-based questions but constant training in the form of regular faculty development programs and workshops is required for its effective implementation. On the administrative level, challenges like lack of sufficient faculty and proper inter-departmental integration for designing scenarios must be addressed. Supplementary Information: The online version contains supplementary material available at 10.1007/s40670-024-02052-6.

18.

Assessing Health-Related Quality of Life in Children With Spina Bifida in Lithuania.

Ali, Faris; Bakaniene, Indre; Dafalla, Hytham; Prasauskiene, Audrone.

Cureus ; 16(7): e63742, 2024 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-39099921

RESUMO

Introduction In recent years, more emphasis has been placed on improving the health-related quality of life (HRQOL) in children with spina bifida (SB). Chronic disability is understood to impact various aspects of the person's life, family, and social functioning, in addition to the specific needs of the disease. The HRQOL is done to assess the patient's quality of life (QOL) in various domains including physical and mental. Back in the 1900s, few children survived SB, whereas today, they almost have normal life expectancy. By understanding the contributing factors to the quality of life (QOL), more targeted interventions can be put in place in order to maximize the psychological and social well-being of these patients. Aim The aim of this study is to estimate the health-related quality of life (HRQOL) in Lithuanian children with spina bifida (SB) in relation to comorbidities, level of lesions, and mobility. Objectives The objectives of this study are to investigate the HRQOL of Lithuanian children with SB born between 1999 and 2012; to analyze the relation between the HRQOL and its comorbidities, including hydrocephalus, Chiari II malformation, incontinence, and epilepsy; and to determine the relationship of health variables, the level of lesions, and mobility to the HRQOL. Methods This was a quantitative cross-sectional study on children with spina bifida across Lithuania to assess the HRQOL. Subjects were chosen and interviewed from various cities including Kaunas, Vilnius, Marijampole, Gargzdai, Birzai, Panevezys, Palanga, and Alytus. A questionnaire was used as an instrument to measure the HRQOL. The level of lesions, comorbidities, and other health variables were obtained from the medical files and directly from the patient's history. Results Regarding the HRQOL, our study population showed the highest scores in the emotional, medical, intellectual, and social domains. The lowest sub-scores were in recreational, vocational, environmental, and then physical domains. We also found that certain comorbidities including hydrocephalus, epilepsy, and incontinence negatively affected the QOL. In our study group, we also found that the ambulatory group scored significantly higher in the overall QOL. However, when comparing the level of lesions to the HRQOL, we found no statistically significant difference. Conclusion Positive results were obtained regarding the medical, emotional, intellectual, and social aspects of patients with SB in Lithuania as they scored high in this domain. However, the environmental and vocational domains scored low, suggesting that further examination needs to be carried in these domains. We concluded that having various comorbidities including hydrocephalus and incontinence has negative impacts on the QOL. Patients who suffered from epilepsy had a statistically significant lower QOL. No significant difference was found in the association between the level of lesion and the QOL in our study.

19.

Attorneys' Questions About Time in Criminal Cases of Alleged Child Sexual Abuse.

Cameron, McKenna N; Merriwether, Ella P; Katzman, Jacqueline; Stolzenberg, Stacia N; Evans, Angela D; McWilliams, Kelly.

Child Maltreat ; : 10775595241271426, 2024 Aug 07.

Artigo em Inglês | MEDLINE | ID: mdl-39110439

RESUMO

In cases of alleged child sexual abuse, information about the timing of events is often needed. However, published developmental laboratory research has demonstrated that children struggle to provide accurate and reliable testimony about time and there is currently a lack of field research examining how attorneys actually question child witnesses about time in court. The current study analyzed 130 trial transcripts from cases of alleged child sexual abuse containing a child witness between the ages of 5-17 years old to determine the frequency, style, and content of attorneys' questions and child responses about time. We found that attorneys primarily ask closed-ended temporal location questions (i.e., asking when an event took place using a temporal construct such as day, month, and year) to child witnesses. Additionally, children, of all ages, rarely said "I don't know" or expressed uncertainty in response to temporal questions. These findings are concerning as researchers find that children tend to struggle with temporally locating past events.

20.

Predicting neuroticism with open-ended response using natural language processing.

Yoon, Seowon; Jang, Jihee; Son, Gaeun; Park, Soohyun; Hwang, Jueun; Choeh, Joon Yeon; Choi, Kee-Hong.

Front Psychiatry ; 15: 1437569, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-39149156

RESUMO

Introduction: With rapid advancements in natural language processing (NLP), predicting personality using this technology has become a significant research interest. In personality prediction, exploring appropriate questions that elicit natural language is particularly important because questions determine the context of responses. This study aimed to predict levels of neuroticism-a core psychological trait known to predict various psychological outcomes-using responses to a series of open-ended questions developed based on the five-factor model of personality. This study examined the model's accuracy and explored the influence of item content in predicting neuroticism. Methods: A total of 425 Korean adults were recruited and responded to 18 open-ended questions about their personalities, along with the measurement of the Five-Factor Model traits. In total, 30,576 Korean sentences were collected. To develop the prediction models, the pre-trained language model KoBERT was used. Accuracy, F1 Score, Precision, and Recall were calculated as evaluation metrics. Results: The results showed that items inquiring about social comparison, unintended harm, and negative feelings performed better in predicting neuroticism than other items. For predicting depressivity, items related to negative feelings, social comparison, and emotions showed superior performance. For dependency, items related to unintended harm, social dominance, and negative feelings were the most predictive. Discussion: We identified items that performed better at neuroticism prediction than others. Prediction models developed based on open-ended questions that theoretically aligned with neuroticism exhibited superior predictive performance.

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA