Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 300
Filtrar
1.
JAMA Netw Open ; 7(5): e248895, 2024 May 01.
Artículo en Inglés | MEDLINE | ID: mdl-38713466

RESUMEN

Importance: The introduction of large language models (LLMs), such as Generative Pre-trained Transformer 4 (GPT-4; OpenAI), has generated significant interest in health care, yet studies evaluating their performance in a clinical setting are lacking. Determination of clinical acuity, a measure of a patient's illness severity and level of required medical attention, is one of the foundational elements of medical reasoning in emergency medicine. Objective: To determine whether an LLM can accurately assess clinical acuity in the emergency department (ED). Design, Setting, and Participants: This cross-sectional study identified all adult ED visits from January 1, 2012, to January 17, 2023, at the University of California, San Francisco, with a documented Emergency Severity Index (ESI) acuity level (immediate, emergent, urgent, less urgent, or nonurgent) and with a corresponding ED physician note. A sample of 10 000 pairs of ED visits with nonequivalent ESI scores, balanced for each of the 10 possible pairs of 5 ESI scores, was selected at random. Exposure: The potential of the LLM to classify acuity levels of patients in the ED based on the ESI across 10 000 patient pairs. Using deidentified clinical text, the LLM was queried to identify the patient with a higher-acuity presentation within each pair based on the patients' clinical history. An earlier LLM was queried to allow comparison with this model. Main Outcomes and Measures: Accuracy score was calculated to evaluate the performance of both LLMs across the 10 000-pair sample. A 500-pair subsample was manually classified by a physician reviewer to compare performance between the LLMs and human classification. Results: From a total of 251 401 adult ED visits, a balanced sample of 10 000 patient pairs was created wherein each pair comprised patients with disparate ESI acuity scores. Across this sample, the LLM correctly inferred the patient with higher acuity for 8940 of 10 000 pairs (accuracy, 0.89 [95% CI, 0.89-0.90]). Performance of the comparator LLM (accuracy, 0.84 [95% CI, 0.83-0.84]) was below that of its successor. Among the 500-pair subsample that was also manually classified, LLM performance (accuracy, 0.88 [95% CI, 0.86-0.91]) was comparable with that of the physician reviewer (accuracy, 0.86 [95% CI, 0.83-0.89]). Conclusions and Relevance: In this cross-sectional study of 10 000 pairs of ED visits, the LLM accurately identified the patient with higher acuity when given pairs of presenting histories extracted from patients' first ED documentation. These findings suggest that the integration of an LLM into ED workflows could enhance triage processes while maintaining triage quality and warrants further investigation.


Asunto(s)
Servicio de Urgencia en Hospital , Gravedad del Paciente , Humanos , Servicio de Urgencia en Hospital/estadística & datos numéricos , Estudios Transversales , Adulto , Masculino , Femenino , Persona de Mediana Edad , Índice de Severidad de la Enfermedad , San Francisco
3.
medRxiv ; 2024 Apr 04.
Artículo en Inglés | MEDLINE | ID: mdl-38633805

RESUMEN

Importance: Large language models (LLMs) possess a range of capabilities which may be applied to the clinical domain, including text summarization. As ambient artificial intelligence scribes and other LLM-based tools begin to be deployed within healthcare settings, rigorous evaluations of the accuracy of these technologies are urgently needed. Objective: To investigate the performance of GPT-4 and GPT-3.5-turbo in generating Emergency Department (ED) discharge summaries and evaluate the prevalence and type of errors across each section of the discharge summary. Design: Cross-sectional study. Setting: University of California, San Francisco ED. Participants: We identified all adult ED visits from 2012 to 2023 with an ED clinician note. We randomly selected a sample of 100 ED visits for GPT-summarization. Exposure: We investigate the potential of two state-of-the-art LLMs, GPT-4 and GPT-3.5-turbo, to summarize the full ED clinician note into a discharge summary. Main Outcomes and Measures: GPT-3.5-turbo and GPT-4-generated discharge summaries were evaluated by two independent Emergency Medicine physician reviewers across three evaluation criteria: 1) Inaccuracy of GPT-summarized information; 2) Hallucination of information; 3) Omission of relevant clinical information. On identifying each error, reviewers were additionally asked to provide a brief explanation for their reasoning, which was manually classified into subgroups of errors. Results: From 202,059 eligible ED visits, we randomly sampled 100 for GPT-generated summarization and then expert-driven evaluation. In total, 33% of summaries generated by GPT-4 and 10% of those generated by GPT-3.5-turbo were entirely error-free across all evaluated domains. Summaries generated by GPT-4 were mostly accurate, with inaccuracies found in only 10% of cases, however, 42% of the summaries exhibited hallucinations and 47% omitted clinically relevant information. Inaccuracies and hallucinations were most commonly found in the Plan sections of GPT-generated summaries, while clinical omissions were concentrated in text describing patients' Physical Examination findings or History of Presenting Complaint. Conclusions and Relevance: In this cross-sectional study of 100 ED encounters, we found that LLMs could generate accurate discharge summaries, but were liable to hallucination and omission of clinically relevant information. A comprehensive understanding of the location and type of errors found in GPT-generated clinical text is important to facilitate clinician review of such content and prevent patient harm.

4.
Lancet Digit Health ; 2024 Apr 23.
Artículo en Inglés | MEDLINE | ID: mdl-38658283

RESUMEN

With the rapid growth of interest in and use of large language models (LLMs) across various industries, we are facing some crucial and profound ethical concerns, especially in the medical field. The unique technical architecture and purported emergent abilities of LLMs differentiate them substantially from other artificial intelligence (AI) models and natural language processing techniques used, necessitating a nuanced understanding of LLM ethics. In this Viewpoint, we highlight ethical concerns stemming from the perspectives of users, developers, and regulators, notably focusing on data privacy and rights of use, data provenance, intellectual property contamination, and broad applications and plasticity of LLMs. A comprehensive framework and mitigating strategies will be imperative for the responsible integration of LLMs into medical practice, ensuring alignment with ethical principles and safeguarding against potential societal risks.

5.
PLoS One ; 19(4): e0298906, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38625909

RESUMEN

Detecting epistatic drivers of human phenotypes is a considerable challenge. Traditional approaches use regression to sequentially test multiplicative interaction terms involving pairs of genetic variants. For higher-order interactions and genome-wide large-scale data, this strategy is computationally intractable. Moreover, multiplicative terms used in regression modeling may not capture the form of biological interactions. Building on the Predictability, Computability, Stability (PCS) framework, we introduce the epiTree pipeline to extract higher-order interactions from genomic data using tree-based models. The epiTree pipeline first selects a set of variants derived from tissue-specific estimates of gene expression. Next, it uses iterative random forests (iRF) to search training data for candidate Boolean interactions (pairwise and higher-order). We derive significance tests for interactions, based on a stabilized likelihood ratio test, by simulating Boolean tree-structured null (no epistasis) and alternative (epistasis) distributions on hold-out test data. Finally, our pipeline computes PCS epistasis p-values that probabilisticly quantify improvement in prediction accuracy via bootstrap sampling on the test set. We validate the epiTree pipeline in two case studies using data from the UK Biobank: predicting red hair and multiple sclerosis (MS). In the case of predicting red hair, epiTree recovers known epistatic interactions surrounding MC1R and novel interactions, representing non-linearities not captured by logistic regression models. In the case of predicting MS, a more complex phenotype than red hair, epiTree rankings prioritize novel interactions surrounding HLA-DRB1, a variant previously associated with MS in several populations. Taken together, these results highlight the potential for epiTree rankings to help reduce the design space for follow up experiments.


Asunto(s)
Epistasis Genética , Estudio de Asociación del Genoma Completo , Humanos , Estudio de Asociación del Genoma Completo/métodos , Fenotipo , Herencia Multifactorial/genética , Modelos Logísticos , Polimorfismo de Nucleótido Simple
6.
Clin Pharmacol Ther ; 115(6): 1391-1399, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38459719

RESUMEN

Outpatient clinical notes are a rich source of information regarding drug safety. However, data in these notes are currently underutilized for pharmacovigilance due to methodological limitations in text mining. Large language models (LLMs) like Bidirectional Encoder Representations from Transformers (BERT) have shown progress in a range of natural language processing tasks but have not yet been evaluated on adverse event (AE) detection. We adapted a new clinical LLM, University of California - San Francisco (UCSF)-BERT, to identify serious AEs (SAEs) occurring after treatment with a non-steroid immunosuppressant for inflammatory bowel disease (IBD). We compared this model to other language models that have previously been applied to AE detection. We annotated 928 outpatient IBD notes corresponding to 928 individual patients with IBD for all SAE-associated hospitalizations occurring after treatment with a non-steroid immunosuppressant. These notes contained 703 SAEs in total, the most common of which was failure of intended efficacy. Out of eight candidate models, UCSF-BERT achieved the highest numerical performance on identifying drug-SAE pairs from this corpus (accuracy 88-92%, macro F1 61-68%), with 5-10% greater accuracy than previously published models. UCSF-BERT was significantly superior at identifying hospitalization events emergent to medication use (P < 0.01). LLMs like UCSF-BERT achieve numerically superior accuracy on the challenging task of SAE detection from clinical notes compared with prior methods. Future work is needed to adapt this methodology to improve model performance and evaluation using multicenter data and newer architectures like Generative pre-trained transformer (GPT). Our findings support the potential value of using large language models to enhance pharmacovigilance.


Asunto(s)
Algoritmos , Inmunosupresores , Enfermedades Inflamatorias del Intestino , Procesamiento de Lenguaje Natural , Farmacovigilancia , Humanos , Proyectos Piloto , Enfermedades Inflamatorias del Intestino/tratamiento farmacológico , Inmunosupresores/efectos adversos , Minería de Datos/métodos , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos/diagnóstico , Sistemas de Registro de Reacción Adversa a Medicamentos , Registros Electrónicos de Salud , Femenino , Masculino , Hospitalización/estadística & datos numéricos
7.
Inflamm Bowel Dis ; 2024 Mar 26.
Artículo en Inglés | MEDLINE | ID: mdl-38533919

RESUMEN

BACKGROUND: The Mayo endoscopic subscore (MES) is an important quantitative measure of disease activity in ulcerative colitis. Colonoscopy reports in routine clinical care usually characterize ulcerative colitis disease activity using free text description, limiting their utility for clinical research and quality improvement. We sought to develop algorithms to classify colonoscopy reports according to their MES. METHODS: We annotated 500 colonoscopy reports from 2 health systems. We trained and evaluated 4 classes of algorithms. Our primary outcome was accuracy in identifying scorable reports (binary) and assigning an MES (ordinal). Secondary outcomes included learning efficiency, generalizability, and fairness. RESULTS: Automated machine learning models achieved 98% and 97% accuracy on the binary and ordinal prediction tasks, outperforming other models. Binary models trained on the University of California, San Francisco data alone maintained accuracy (96%) on validation data from Zuckerberg San Francisco General. When using 80% of the training data, models remained accurate for the binary task (97% [n = 320]) but lost accuracy on the ordinal task (67% [n = 194]). We found no evidence of bias by gender (P = .65) or area deprivation index (P = .80). CONCLUSIONS: We derived a highly accurate pair of models capable of classifying reports by their MES and recognizing when to abstain from prediction. Our models were generalizable on outside institution validation. There was no evidence of algorithmic bias. Our methods have the potential to enable retrospective studies of treatment effectiveness, prospective identification of patients meeting study criteria, and quality improvement efforts in inflammatory bowel diseases.


Our accurate pair of models automatically classify colonoscopy reports by Mayo endoscopic subscore and abstain from prediction appropriately. Our methods can enable large-scale electronic health record studies of treatment effectiveness, prospective identification of patients for clinical trials, and quality improvement efforts in ulcerative colitis.

8.
J Pediatr Gastroenterol Nutr ; 78(5): 1126-1134, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38482890

RESUMEN

OBJECTIVES: Vedolizumab (VDZ) and ustekinumab (UST) are second-line treatments in pediatric patients with ulcerative colitis (UC) refractory to antitumor necrosis factor (anti-TNF) therapy. Pediatric studies comparing the effectiveness of these medications are lacking. Using a registry from ImproveCareNow (ICN), a global research network in pediatric inflammatory bowel disease, we compared the effectiveness of UST and VDZ in anti-TNF refractory UC. METHODS: We performed a propensity-score weighted regression analysis to compare corticosteroid-free clinical remission (CFCR) at 6 months from starting second-line therapy. Sensitivity analyses tested the robustness of our findings to different ways of handling missing outcome data. Secondary analyses evaluated alternative proxies of response and infection risk. RESULTS: Our cohort included 262 patients on VDZ and 74 patients on UST. At baseline, the two groups differed on their mean pediatric UC activity index (PUCAI) (p = 0.03) but were otherwise similar. At Month 6, 28.3% of patients on VDZ and 25.8% of those on UST achieved CFCR (p = 0.76). Our primary model showed no difference in CFCR (odds ratio: 0.81; 95% confidence interval [CI]: 0.41-1.59) (p = 0.54). The time to biologic discontinuation was similar in both groups (hazard ratio: 1.26; 95% CI: 0.76-2.08) (p = 0.36), with the reference group being VDZ, and we found no differences in clinical response, growth parameters, hospitalizations, surgeries, infections, or malignancy risk. Sensitivity analyses supported these findings of similar effectiveness. CONCLUSIONS: UST and VDZ are similarly effective for inducing clinical remission in anti-TNF refractory UC in pediatric patients. Providers should consider safety, tolerability, cost, and comorbidities when deciding between these therapies.


Asunto(s)
Anticuerpos Monoclonales Humanizados , Colitis Ulcerosa , Fármacos Gastrointestinales , Ustekinumab , Humanos , Colitis Ulcerosa/tratamiento farmacológico , Ustekinumab/uso terapéutico , Femenino , Masculino , Niño , Anticuerpos Monoclonales Humanizados/uso terapéutico , Adolescente , Fármacos Gastrointestinales/uso terapéutico , Resultado del Tratamiento , Factor de Necrosis Tumoral alfa/antagonistas & inhibidores , Inducción de Remisión/métodos , Puntaje de Propensión , Sistema de Registros
9.
Clin Pharmacol Ther ; 115(4): 847-859, 2024 04.
Artículo en Inglés | MEDLINE | ID: mdl-38345264

RESUMEN

Electronic health records (EHRs) provide meaningful knowledge of drug-related adverse events (AEs) that are not captured in standard drug development and postmarketing surveillance. Using variables obtained from EHR data in the University of California San Francisco de-identified Clinical Data Warehouse, we aimed to evaluate the potential of machine learning to predict two hematological AEs, thrombocytopenia and anemia, in a cohort of patients treated with linezolid for 3 or more days. Features for model input were extracted at linezolid initiation (index), and outcomes were characterized from index to 14 days post-treatment. Random forest classification (RFC) was used for AE prediction, and reduced feature models were evaluated using cumulative importance (cImp) for feature selection. Grade 3+ thrombocytopenia and anemia occurred in 31% of 2,171 and 56% of 2,170 evaluable patients, respectively. Of the total 53 features, as few as 7 contributed at least 50% cImp, resulting in prediction accuracies of 70% or higher and area under the receiver operating characteristic curves of 0.886 for grade 3+ thrombocytopenia and 0.759 for grade 3+ anemia. Sensitivity analyses in strictly defined patient subgroups revealed similarly high predictive performance in full and reduced feature models. A logistic regression model with the same 50% cImp features showed similar predictive performance as RFC and good concordance with RFC probability predictions after isotonic calibration, adding interpretability. Collectively, this work demonstrates potential for machine learning prediction of AE risk in real-world patients using few variables regularly available in EHRs, which may aid in clinical decision making and/or monitoring.


Asunto(s)
Anemia , Trombocitopenia , Humanos , Linezolid/efectos adversos , Anemia/inducido químicamente , Anemia/epidemiología , Trombocitopenia/inducido químicamente , Trombocitopenia/diagnóstico , Trombocitopenia/epidemiología , Modelos Logísticos , San Francisco
10.
Res Sq ; 2024 Feb 06.
Artículo en Inglés | MEDLINE | ID: mdl-38405831

RESUMEN

Although supervised machine learning is popular for information extraction from clinical notes, creating large, annotated datasets requires extensive domain expertise and is time-consuming. Meanwhile, large language models (LLMs) have demonstrated promising transfer learning capability. In this study, we explored whether recent LLMs can reduce the need for large-scale data annotations. We curated a manually labeled dataset of 769 breast cancer pathology reports, labeled with 13 categories, to compare zero-shot classification capability of the GPT-4 model and the GPT-3.5 model with supervised classification performance of three model architectures: random forests classifier, long short-term memory networks with attention (LSTM-Att), and the UCSF-BERT model. Across all 13 tasks, the GPT-4 model performed either significantly better than or as well as the best supervised model, the LSTM-Att model (average macro F1 score of 0.83 vs. 0.75). On tasks with a high imbalance between labels, the differences were more prominent. Frequent sources of GPT-4 errors included inferences from multiple samples and complex task design. On complex tasks where large annotated datasets cannot be easily collected, LLMs can reduce the burden of large-scale data labeling. However, if the use of LLMs is prohibitive, the use of simpler supervised models with large annotated datasets can provide comparable results. LLMs demonstrated the potential to speed up the execution of clinical NLP studies by reducing the need for curating large annotated datasets. This may increase the utilization of NLP-based variables and outcomes in observational clinical studies.

12.
Lancet Digit Health ; 6(3): e222-e229, 2024 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-38395542

RESUMEN

Digital therapeutics (DTx) are a somewhat novel class of US Food and Drug Administration-regulated software that help patients prevent, manage, or treat disease. Here, we use natural language processing to characterise registered DTx clinical trials and provide insights into the clinical development landscape for these novel therapeutics. We identified 449 DTx clinical trials, initiated or expected to be initiated between 2010 and 2030, from ClinicalTrials.gov using 27 search terms, and available data were analysed, including trial durations, locations, MeSH categories, enrolment, and sponsor types. Topic modelling of eligibility criteria, done with BERTopic, showed that DTx trials frequently exclude patients on the basis of age, comorbidities, pregnancy, language barriers, and digital determinants of health, including smartphone or data plan access. Our comprehensive overview of the DTx development landscape highlights challenges in designing inclusive DTx clinical trials and presents opportunities for clinicians and researchers to address these challenges. Finally, we provide an interactive dashboard for readers to conduct their own analyses.


Asunto(s)
Procesamiento de Lenguaje Natural , Teléfono Inteligente , Humanos , Programas Informáticos
13.
J Clin Epidemiol ; 167: 111258, 2024 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-38219811

RESUMEN

OBJECTIVES: Natural language processing (NLP) of clinical notes in electronic medical records is increasingly used to extract otherwise sparsely available patient characteristics, to assess their association with relevant health outcomes. Manual data curation is resource intensive and NLP methods make these studies more feasible. However, the methodology of using NLP methods reliably in clinical research is understudied. The objective of this study is to investigate how NLP models could be used to extract study variables (specifically exposures) to reliably conduct exposure-outcome association studies. STUDY DESIGN AND SETTING: In a convenience sample of patients admitted to the intensive care unit of a US academic health system, multiple association studies are conducted, comparing the association estimates based on NLP-extracted vs. manually extracted exposure variables. The association studies varied in NLP model architecture (Bidirectional Encoder Decoder from Transformers, Long Short-Term Memory), training paradigm (training a new model, fine-tuning an existing external model), extracted exposures (employment status, living status, and substance use), health outcomes (having a do-not-resuscitate/intubate code, length of stay, and in-hospital mortality), missing data handling (multiple imputation vs. complete case analysis), and the application of measurement error correction (via regression calibration). RESULTS: The study was conducted on 1,174 participants (median [interquartile range] age, 61 [50, 73] years; 60.6% male). Additionally, up to 500 discharge reports of participants from the same health system and 2,528 reports of participants from an external health system were used to train the NLP models. Substantial differences were found between the associations based on NLP-extracted and manually extracted exposures under all settings. The error in association was only weakly correlated with the overall F1 score of the NLP models. CONCLUSION: Associations estimated using NLP-extracted exposures should be interpreted with caution. Further research is needed to set conditions for reliable use of NLP in medical association studies.


Asunto(s)
Unidades de Cuidados Intensivos , Procesamiento de Lenguaje Natural , Humanos , Masculino , Persona de Mediana Edad , Femenino , Registros Electrónicos de Salud
14.
JAMIA Open ; 7(1): ooad112, 2024 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-38223407

RESUMEN

Objective: Existing research on social determinants of health (SDoH) predominantly focuses on physician notes and structured data within electronic medical records. This study posits that social work notes are an untapped, potentially rich source for SDoH information. We hypothesize that clinical notes recorded by social workers, whose role is to ameliorate social and economic factors, might provide a complementary information source of data on SDoH compared to physician notes, which primarily concentrate on medical diagnoses and treatments. We aimed to use word frequency analysis and topic modeling to identify prevalent terms and robust topics of discussion within a large cohort of social work notes including both outpatient and in-patient consultations. Materials and methods: We retrieved a diverse, deidentified corpus of 0.95 million clinical social work notes from 181 644 patients at the University of California, San Francisco. We conducted word frequency analysis related to ICD-10 chapters to identify prevalent terms within the notes. We then applied Latent Dirichlet Allocation (LDA) topic modeling analysis to characterize this corpus and identify potential topics of discussion, which was further stratified by note types and disease groups. Results: Word frequency analysis primarily identified medical-related terms associated with specific ICD10 chapters, though it also detected some subtle SDoH terms. In contrast, the LDA topic modeling analysis extracted 11 topics explicitly related to social determinants of health risk factors, such as financial status, abuse history, social support, risk of death, and mental health. The topic modeling approach effectively demonstrated variations between different types of social work notes and across patients with different types of diseases or conditions. Discussion: Our findings highlight LDA topic modeling's effectiveness in extracting SDoH-related themes and capturing variations in social work notes, demonstrating its potential for informing targeted interventions for at-risk populations. Conclusion: Social work notes offer a wealth of unique and valuable information on an individual's SDoH. These notes present consistent and meaningful topics of discussion that can be effectively analyzed and utilized to improve patient care and inform targeted interventions for at-risk populations.

15.
J Med Internet Res ; 26: e47430, 2024 Jan 19.
Artículo en Inglés | MEDLINE | ID: mdl-38241075

RESUMEN

BACKGROUND: Diabetes mellitus (DM) is a major health concern among children with the widespread adoption of advanced technologies. However, concerns are growing about the transparency, replicability, biasedness, and overall validity of artificial intelligence studies in medicine. OBJECTIVE: We aimed to systematically review the reporting quality of machine learning (ML) studies of pediatric DM using the Minimum Information About Clinical Artificial Intelligence Modelling (MI-CLAIM) checklist, a general reporting guideline for medical artificial intelligence studies. METHODS: We searched the PubMed and Web of Science databases from 2016 to 2020. Studies were included if the use of ML was reported in children with DM aged 2 to 18 years, including studies on complications, screening studies, and in silico samples. In studies following the ML workflow of training, validation, and testing of results, reporting quality was assessed via MI-CLAIM by consensus judgments of independent reviewer pairs. Positive answers to the 17 binary items regarding sufficient reporting were qualitatively summarized and counted as a proxy measure of reporting quality. The synthesis of results included testing the association of reporting quality with publication and data type, participants (human or in silico), research goals, level of code sharing, and the scientific field of publication (medical or engineering), as well as with expert judgments of clinical impact and reproducibility. RESULTS: After screening 1043 records, 28 studies were included. The sample size of the training cohort ranged from 5 to 561. Six studies featured only in silico patients. The reporting quality was low, with great variation among the 21 studies assessed using MI-CLAIM. The number of items with sufficient reporting ranged from 4 to 12 (mean 7.43, SD 2.62). The items on research questions and data characterization were reported adequately most often, whereas items on patient characteristics and model examination were reported adequately least often. The representativeness of the training and test cohorts to real-world settings and the adequacy of model performance evaluation were the most difficult to judge. Reporting quality improved over time (r=0.50; P=.02); it was higher than average in prognostic biomarker and risk factor studies (P=.04) and lower in noninvasive hypoglycemia detection studies (P=.006), higher in studies published in medical versus engineering journals (P=.004), and higher in studies sharing any code of the ML pipeline versus not sharing (P=.003). The association between expert judgments and MI-CLAIM ratings was not significant. CONCLUSIONS: The reporting quality of ML studies in the pediatric population with DM was generally low. Important details for clinicians, such as patient characteristics; comparison with the state-of-the-art solution; and model examination for valid, unbiased, and robust results, were often the weak points of reporting. To assess their clinical utility, the reporting standards of ML studies must evolve, and algorithms for this challenging population must become more transparent and replicable.


Asunto(s)
Inteligencia Artificial , Diabetes Mellitus , Humanos , Niño , Reproducibilidad de los Resultados , Aprendizaje Automático , Diabetes Mellitus/diagnóstico , Lista de Verificación
16.
Lancet Digit Health ; 6(1): e12-e22, 2024 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-38123252

RESUMEN

BACKGROUND: Large language models (LLMs) such as GPT-4 hold great promise as transformative tools in health care, ranging from automating administrative tasks to augmenting clinical decision making. However, these models also pose a danger of perpetuating biases and delivering incorrect medical diagnoses, which can have a direct, harmful impact on medical care. We aimed to assess whether GPT-4 encodes racial and gender biases that impact its use in health care. METHODS: Using the Azure OpenAI application interface, this model evaluation study tested whether GPT-4 encodes racial and gender biases and examined the impact of such biases on four potential applications of LLMs in the clinical domain-namely, medical education, diagnostic reasoning, clinical plan generation, and subjective patient assessment. We conducted experiments with prompts designed to resemble typical use of GPT-4 within clinical and medical education applications. We used clinical vignettes from NEJM Healer and from published research on implicit bias in health care. GPT-4 estimates of the demographic distribution of medical conditions were compared with true US prevalence estimates. Differential diagnosis and treatment planning were evaluated across demographic groups using standard statistical tests for significance between groups. FINDINGS: We found that GPT-4 did not appropriately model the demographic diversity of medical conditions, consistently producing clinical vignettes that stereotype demographic presentations. The differential diagnoses created by GPT-4 for standardised clinical vignettes were more likely to include diagnoses that stereotype certain races, ethnicities, and genders. Assessment and plans created by the model showed significant association between demographic attributes and recommendations for more expensive procedures as well as differences in patient perception. INTERPRETATION: Our findings highlight the urgent need for comprehensive and transparent bias assessments of LLM tools such as GPT-4 for intended use cases before they are integrated into clinical care. We discuss the potential sources of these biases and potential mitigation strategies before clinical implementation. FUNDING: Priscilla Chan and Mark Zuckerberg.


Asunto(s)
Educación Médica , Instituciones de Salud , Femenino , Humanos , Masculino , Toma de Decisiones Clínicas , Diagnóstico Diferencial , Atención a la Salud
17.
Res Sq ; 2023 Nov 20.
Artículo en Inglés | MEDLINE | ID: mdl-38045390

RESUMEN

The combinatorial effect of genetic variants is often assumed to be additive. Although genetic variation can clearly interact non-additively, methods to uncover epistatic relationships remain in their infancy. We develop low-signal signed iterative random forests to elucidate the complex genetic architecture of cardiac hypertrophy. We derive deep learning-based estimates of left ventricular mass from the cardiac MRI scans of 29,661 individuals enrolled in the UK Biobank. We report epistatic genetic variation including variants close to CCDC141, IGF1R, TTN, and TNKS. Several loci not prioritized by univariate genome-wide association analysis are identified. Functional genomic and integrative enrichment analyses reveal a complex gene regulatory network in which genes mapped from these loci share biological processes and myogenic regulatory factors. Through a network analysis of transcriptomic data from 313 explanted human hearts, we show that these interactions are preserved at the level of the cardiac transcriptome. We assess causality of epistatic effects via RNA silencing of gene-gene interactions in human induced pluripotent stem cell-derived cardiomyocytes. Finally, single-cell morphology analysis using a novel high-throughput microfluidic system shows that cardiomyocyte hypertrophy is non-additively modifiable by specific pairwise interactions between CCDC141 and both TTN and IGF1R. Our results expand the scope of genetic regulation of cardiac structure to epistasis.

18.
medRxiv ; 2023 Nov 12.
Artículo en Inglés | MEDLINE | ID: mdl-37986977

RESUMEN

BACKGROUND: Meta-analyses have found anti-TNF drugs to be the best treatment, on average, for Crohn's disease. We performed a subgroup analysis to determine if it is possible to achieve more efficacious outcomes by individualizing treatment selection. METHODS: We obtained participant-level data from 15 trials of FDA-approved treatments (N=5703). We used sequential regression and simulation to model week six disease activity as a function of drug class, demographics, and disease-related features. We performed hypothesis testing to define subgroups based on rank-ordered preferences for treatments. We queried health records from University of California Health (UCH) to estimate the impacts these models could have on practice. We computed the sample size needed to prospectively test a prediction of our models. RESULTS: 45% of the participants (N=2561) showed greater efficacy with at least one drug class (anti-TNF, anti-IL-12/23, anti-integrin) over another. They were classifiable into 6 subgroups, two showing greatest efficacy with anti-TNFs (36%, N=2064). Women over 50 showed superior responses with anti-IL-12/23s. Although they represented only 2% of the trial-based cohort, 25% of Crohn's patients at UCH are women over 50 (N=5,647), consistent with potential selection bias in trials. Moreover, 75% of biologic-exposed women over 50 did not receive an anti-IL12/23 first-line, supporting the potential value of these models. A future trial with 250 patients per arm will have 97% power to confirm the superiority of anti-IL-12/23s over anti-TNFs in these patients. A treatment recommendation tool is available at https://crohnsrx.org. CONCLUSIONS: Personalizing treatment can improve outcomes in Crohn's disease. Future work is needed to confirm these findings, and improve representativeness in Crohn's trials.

19.
medRxiv ; 2023 Nov 08.
Artículo en Inglés | MEDLINE | ID: mdl-37987017

RESUMEN

The combinatorial effect of genetic variants is often assumed to be additive. Although genetic variation can clearly interact non-additively, methods to uncover epistatic relationships remain in their infancy. We develop low-signal signed iterative random forests to elucidate the complex genetic architecture of cardiac hypertrophy. We derive deep learning-based estimates of left ventricular mass from the cardiac MRI scans of 29,661 individuals enrolled in the UK Biobank. We report epistatic genetic variation including variants close to CCDC141, IGF1R, TTN, and TNKS. Several loci not prioritized by univariate genome-wide association analysis are identified. Functional genomic and integrative enrichment analyses reveal a complex gene regulatory network in which genes mapped from these loci share biological processes and myogenic regulatory factors. Through a network analysis of transcriptomic data from 313 explanted human hearts, we show that these interactions are preserved at the level of the cardiac transcriptome. We assess causality of epistatic effects via RNA silencing of gene-gene interactions in human induced pluripotent stem cell-derived cardiomyocytes. Finally, single-cell morphology analysis using a novel high-throughput microfluidic system shows that cardiomyocyte hypertrophy is non-additively modifiable by specific pairwise interactions between CCDC141 and both TTN and IGF1R. Our results expand the scope of genetic regulation of cardiac structure to epistasis.

20.
JAMA Netw Open ; 6(10): e2336613, 2023 10 02.
Artículo en Inglés | MEDLINE | ID: mdl-37782497

RESUMEN

Importance: Assessing the relative effectiveness and safety of additional treatments when metformin monotherapy is insufficient remains a limiting factor in improving treatment choices in type 2 diabetes. Objective: To determine whether data from electronic health records across the University of California Health system could be used to assess the comparative effectiveness and safety associated with 4 treatments in diabetes when added to metformin monotherapy. Design, Setting, and Participants: This multicenter, new user, multidimensional propensity score-matched retrospective cohort study with leave-one-medical-center-out (LOMCO) sensitivity analysis used principles of emulating target trial. Participants included patients with diabetes receiving metformin who were then additionally prescribed either a sulfonylurea, dipeptidyl peptidase-4 inhibitor (DPP4I), sodium-glucose cotransporter-2 inhibitor (SGLT2I), or glucagon-like peptide-1 receptor agonist (GLP1RA) for the first time and followed-up over a 5-year monitoring period. Data were analyzed between January 2022 and April 2023. Exposure: Treatment with sulfonylurea, DPP4I, SGLT2I, or GLP1RA added to metformin monotherapy. Main Outcomes and Measures: The main effectiveness outcome was the ability of patients to maintain glycemic control, represented as time to metabolic failure (hemoglobin A1c [HbA1c] ≥7.0%). A secondary effectiveness outcome was assessed by monitoring time to new incidence of any of 28 adverse outcomes, including diabetes-related complications while treated with the assigned drug. Sensitivity analysis included LOMCO. Results: This cohort study included 31 852 patients (16 635 [52.2%] male; mean [SD] age, 61.4 [12.6] years) who were new users of diabetes treatments added on to metformin monotherapy. Compared with sulfonylurea in random-effect meta-analysis, treatment with SGLT2I (summary hazard ratio [sHR], 0.75 [95% CI, 0.69-0.83]; I2 = 37.5%), DPP4I (sHR, 0.79 [95% CI, 0.75-0.84]; I2 = 0%), GLP1RA (sHR, 0.62 [95% CI, 0.57-0.68]; I2 = 23.6%) were effective in glycemic control; findings from LOMCO sensitivity analysis were similar. Treatment with SGLT2I showed no significant difference in effectiveness compared with GLP1RA (sHR, 1.26 [95% CI, 1.12-1.42]; I2 = 47.3%; no LOMCO) or DPP4I (sHR, 0.97 [95% CI, 0.90-1.04]; I2 = 0%). Patients treated with DPP4I and SGLT2I had fewer cardiovascular events compared with those treated with sulfonylurea (DPP4I: sHR, 0.84 [95% CI, 0.74-0.96]; I2 = 0%; SGLT2I: sHR, 0.78 [95% CI, 0.62-0.98]; I2 = 0%). Patients treated with a GLP1RA or SGLT2I were less likely to develop chronic kidney disease (GLP1RA: sHR, 0.75 [95% CI 0.6-0.94]; I2 = 0%; SGLT2I: sHR, 0.77 [95% CI, 0.61-0.97]; I2 = 0%), kidney failure (GLP1RA: sHR, 0.69 [95% CI, 0.56-0.86]; I2 = 9.1%; SGLT2I: sHR, 0.72 [95% CI, 0.59-0.88]; I2 = 0%), or hypertension (GLP1RA: sHR, 0.82 [95% CI, 0.68-0.97]; I2 = 0%; SGLT2I: sHR, 0.73 [95% CI, 0.58-0.92]; I2 = 38.5%) compared with those treated with a sulfonylurea. Patients treated with an SGLT2I, vs a DPP4I, GLP1RA, or sulfonylurea, were less likely to develop indicators of chronic hepatic dysfunction (sHR vs DPP4I, 0.68 [95% CI, 0.49-0.95]; I2 = 0%; sHR vs GLP1RA, 0.66 [95% CI, 0.48-0.91]; I2 = 0%; sHR vs sulfonylurea, 0.60 [95% CI, 0.44-0.81]; I2 = 0%), and those treated with a DPP4I were less likely to develop new incidence of hypoglycemia (sHR, 0.48 [95% CI, 0.36-0.65]; I2 = 22.7%) compared with those treated with a sulfonylurea. Conclusions and Relevance: These findings highlight familiar medication patterns, including those mirroring randomized clinical trials, as well as providing new insights underscoring the value of robust clinical data analytics in swiftly generating evidence to help guide treatment choices in diabetes.


Asunto(s)
Diabetes Mellitus Tipo 2 , Inhibidores de la Dipeptidil-Peptidasa IV , Metformina , Inhibidores del Cotransportador de Sodio-Glucosa 2 , Anciano , Femenino , Humanos , Masculino , Persona de Mediana Edad , Antivirales , Estudios de Cohortes , Diabetes Mellitus Tipo 2/tratamiento farmacológico , Inhibidores de la Dipeptidil-Peptidasa IV/uso terapéutico , Hipoglucemiantes/uso terapéutico , Metformina/uso terapéutico , Inhibidores de Proteasas , Estudios Retrospectivos , Compuestos de Sulfonilurea/uso terapéutico , Metaanálisis en Red
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...