Búsqueda | BVS Bolivia

The impact of systematically repairing multiple choice questions with low discrimination on assessment reliability: an interrupted time series analysis.

Desy, Janeve; Harvey, Adrian; Weeks, Sarah; Busche, Kevin D; Martin, Kerri; Paget, Michael; Naugler, Christopher; McLaughlin, Kevin.

Can Med Educ J ; 15(3): 52-56, 2024 Jul.

Artículo en Inglés | MEDLINE | ID: mdl-39114792

RESUMEN

At our centre, we introduced a continuous quality improvement (CQI) initiative during academic year 2018-19 targeting for repair multiple choice question (MCQ) items with discrimination index (D) < 0.1. The purpose of this study was to assess the impact of this initiative on reliability/internal consistency of our assessments. Our participants were medical students during academic years 2015-16 to 2020-21 and our data were summative MCQ assessments during this time. Since the goal was to systematically review and improve summative assessments in our undergraduate program on an ongoing basis, we used interrupted time series analysis to assess the impact on reliability. Between 2015-16 and 2017-18 there was a significant negative trend in the mean alpha coefficient for MCQ exams (regression coefficient -0.027 [-0.008, -0.047], p = 0.024). In the academic year following the introduction of our initiative (2018-19) there was a significant increase in the mean alpha coefficient (regression coefficient 0.113 [0.063, 0.163], p = 0.010) which was then followed by a significant positive post-intervention trend (regression coefficient 0.056 [0.037, 0.075], p = 0.006). In conclusion, our CQI intervention resulted in an immediate and progressive improvement reliability of our MCQ assessments.

Dans notre centre, nous avons introduit une initiative d'amélioration continue de la qualité (ACQ) au cours de l'année académique 2018-19 ciblant la correction des questions à choix multiples (QCM) dont l'indice de discrimination (D) est < 0,1. Le but de cette étude était d'évaluer l'impact de cette initiative sur la fiabilité/cohérence interne de nos évaluations. Nos participants étaient des étudiants en médecine au cours des années académiques 2015-16 à 2020-21 et nos données provenaient d'évaluations sommatives par QCM au cours de cette période. Comme l'objectif était de revoir et d'améliorer systématiquement les évaluations sommatives dans notre programme prégradué sur une base continue, nous avons utilisé une analyse basée sur des séries chronologies interrompues pour évaluer l'impact sur la fiabilité. Entre 2015-16 et 2017-18, il y a eu une tendance négative significative dans le coefficient alpha moyen pour les examens utilisant des QCM (coefficient de régression -0,027 [-0,008, -0,047], p = 0,024). Au cours de l'année académique suivant l'introduction de notre initiative (2018-19), il y a eu une augmentation significative du coefficient alpha moyen (coefficient de régression 0,113 [0,063, 0,163], p = 0,010) qui a été suivie d'une tendance positive significative après l'intervention (coefficient de régression 0,056 [0,037, 0,075], p = 0,006). En conclusion, notre intervention d'ACQ a entraîné une amélioration immédiate et progressive de la fiabilité de nos évaluations par QCM.

Asunto(s)

Evaluación Educacional , Análisis de Series de Tiempo Interrumpido , Humanos , Reproducibilidad de los Resultados , Evaluación Educacional/métodos , Educación de Pregrado en Medicina/métodos , Mejoramiento de la Calidad , Estudiantes de Medicina

Giving partial credit during a multiple-choice question assessment reappraisal does not make the assessment process fairer.

Desy, Janeve; Harvey, Adrian; Martin, Kerri; Naugler, Christopher; McLaughlin, Kevin.

Can Med Educ J ; 15(2): 95-96, 2024 May.

Artículo en Inglés | MEDLINE | ID: mdl-38827902

Asunto(s)

Evaluación Educacional , Humanos , Evaluación Educacional/métodos

Assessing the research landscape and clinical utility of large language models: a scoping review.

Park, Ye-Jean; Pillai, Abhinav; Deng, Jiawen; Guo, Eddie; Gupta, Mehul; Paget, Mike; Naugler, Christopher.

BMC Med Inform Decis Mak ; 24(1): 72, 2024 Mar 12.

Artículo en Inglés | MEDLINE | ID: mdl-38475802

RESUMEN

IMPORTANCE: Large language models (LLMs) like OpenAI's ChatGPT are powerful generative systems that rapidly synthesize natural language responses. Research on LLMs has revealed their potential and pitfalls, especially in clinical settings. However, the evolving landscape of LLM research in medicine has left several gaps regarding their evaluation, application, and evidence base. OBJECTIVE: This scoping review aims to (1) summarize current research evidence on the accuracy and efficacy of LLMs in medical applications, (2) discuss the ethical, legal, logistical, and socioeconomic implications of LLM use in clinical settings, (3) explore barriers and facilitators to LLM implementation in healthcare, (4) propose a standardized evaluation framework for assessing LLMs' clinical utility, and (5) identify evidence gaps and propose future research directions for LLMs in clinical applications. EVIDENCE REVIEW: We screened 4,036 records from MEDLINE, EMBASE, CINAHL, medRxiv, bioRxiv, and arXiv from January 2023 (inception of the search) to June 26, 2023 for English-language papers and analyzed findings from 55 worldwide studies. Quality of evidence was reported based on the Oxford Centre for Evidence-based Medicine recommendations. FINDINGS: Our results demonstrate that LLMs show promise in compiling patient notes, assisting patients in navigating the healthcare system, and to some extent, supporting clinical decision-making when combined with human oversight. However, their utilization is limited by biases in training data that may harm patients, the generation of inaccurate but convincing information, and ethical, legal, socioeconomic, and privacy concerns. We also identified a lack of standardized methods for evaluating LLMs' effectiveness and feasibility. CONCLUSIONS AND RELEVANCE: This review thus highlights potential future directions and questions to address these limitations and to further explore LLMs' potential in enhancing healthcare delivery.

Asunto(s)

Toma de Decisiones Clínicas , Medicina Basada en la Evidencia , Humanos , Instituciones de Salud , Lenguaje , MEDLINE

Evaluating the coding accuracy of type 2 diabetes mellitus among patients with non-alcoholic fatty liver disease.

Lee, Seungwon; Shaheen, Abdel Aziz; Campbell, David J T; Naugler, Christopher; Jiang, Jason; Walker, Robin L; Quan, Hude; Lee, Joon.

BMC Health Serv Res ; 24(1): 218, 2024 Feb 16.

Artículo en Inglés | MEDLINE | ID: mdl-38365631

RESUMEN

BACKGROUND: Non-alcoholic fatty liver disease (NAFLD) describes a spectrum of chronic fattening of liver that can lead to fibrosis and cirrhosis. Diabetes has been identified as a major comorbidity that contributes to NAFLD progression. Health systems around the world make use of administrative data to conduct population-based prevalence studies. To that end, we sought to assess the accuracy of diabetes International Classification of Diseases (ICD) coding in administrative databases among a cohort of confirmed NAFLD patients in Calgary, Alberta, Canada. METHODS: The Calgary NAFLD Pathway Database was linked to the following databases: Physician Claims, Discharge Abstract Database, National Ambulatory Care Reporting System, Pharmaceutical Information Network database, Laboratory, and Electronic Medical Records. Hemoglobin A1c and diabetes medication details were used to classify diabetes groups into absent, prediabetes, meeting glycemic targets, and not meeting glycemic targets. The performance of ICD codes among these groups was compared to this standard. Within each group, the total numbers of true positives, false positives, false negatives, and true negatives were calculated. Descriptive statistics and bivariate analysis were conducted on identified covariates, including demographics and types of interacted physicians. RESULTS: A total of 12,012 NAFLD patients were registered through the Calgary NAFLD Pathway Database and 100% were successfully linked to the administrative databases. Overall, diabetes coding showed a sensitivity of 0.81 and a positive predictive value of 0.87. False negative rates in the absent and not meeting glycemic control groups were 4.5% and 6.4%, respectively, whereas the meeting glycemic control group had a 42.2% coding error. Visits to primary and outpatient services were associated with most encounters. CONCLUSION: Diabetes ICD coding in administrative databases can accurately detect true diabetic cases. However, patients with diabetes who meets glycemic control targets are less likely to be coded in administrative databases. A detailed understanding of the clinical context will require additional data linkage from primary care settings.

Asunto(s)

Diabetes Mellitus Tipo 2 , Enfermedad del Hígado Graso no Alcohólico , Humanos , Diabetes Mellitus Tipo 2/complicaciones , Diabetes Mellitus Tipo 2/diagnóstico , Diabetes Mellitus Tipo 2/epidemiología , Enfermedad del Hígado Graso no Alcohólico/complicaciones , Enfermedad del Hígado Graso no Alcohólico/diagnóstico , Enfermedad del Hígado Graso no Alcohólico/epidemiología , Comorbilidad , Alta del Paciente , Alberta/epidemiología

Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study.

Guo, Eddie; Gupta, Mehul; Deng, Jiawen; Park, Ye-Jean; Paget, Michael; Naugler, Christopher.

J Med Internet Res ; 26: e48996, 2024 Jan 12.

Artículo en Inglés | MEDLINE | ID: mdl-38214966

RESUMEN

BACKGROUND: The systematic review of clinical research papers is a labor-intensive and time-consuming process that often involves the screening of thousands of titles and abstracts. The accuracy and efficiency of this process are critical for the quality of the review and subsequent health care decisions. Traditional methods rely heavily on human reviewers, often requiring a significant investment of time and resources. OBJECTIVE: This study aims to assess the performance of the OpenAI generative pretrained transformer (GPT) and GPT-4 application programming interfaces (APIs) in accurately and efficiently identifying relevant titles and abstracts from real-world clinical review data sets and comparing their performance against ground truth labeling by 2 independent human reviewers. METHODS: We introduce a novel workflow using the Chat GPT and GPT-4 APIs for screening titles and abstracts in clinical reviews. A Python script was created to make calls to the API with the screening criteria in natural language and a corpus of title and abstract data sets filtered by a minimum of 2 human reviewers. We compared the performance of our model against human-reviewed papers across 6 review papers, screening over 24,000 titles and abstracts. RESULTS: Our results show an accuracy of 0.91, a macro F1-score of 0.60, a sensitivity of excluded papers of 0.91, and a sensitivity of included papers of 0.76. The interrater variability between 2 independent human screeners was κ=0.46, and the prevalence and bias-adjusted κ between our proposed methods and the consensus-based human decisions was κ=0.96. On a randomly selected subset of papers, the GPT models demonstrated the ability to provide reasoning for their decisions and corrected their initial decisions upon being asked to explain their reasoning for incorrect classifications. CONCLUSIONS: Large language models have the potential to streamline the clinical review process, save valuable time and effort for researchers, and contribute to the overall quality of clinical reviews. By prioritizing the workflow and acting as an aid rather than a replacement for researchers and reviewers, models such as GPT-4 can enhance efficiency and lead to more accurate and reliable conclusions in medical research.

Asunto(s)

Inteligencia Artificial , Investigación Biomédica , Revisiones Sistemáticas como Asunto , Humanos , Consenso , Análisis de Datos , Solución de Problemas , Procesamiento de Lenguaje Natural , Flujo de Trabajo

RESUMEN

Asunto(s)

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA