RESUMO
OBJECTIVE: To develop machine learning models using patient and migraine features that can predict treatment responses to commonly used migraine preventive medications. BACKGROUND: Currently, there is no accurate way to predict response to migraine preventive medications, and the standard trial-and-error approach is inefficient. METHODS: In this cohort study, we analyzed data from the Mayo Clinic Headache database prospectively collected from 2001 to December 2023. Adult patients with migraine completed questionnaires during their initial headache consultation to record detailed clinical features and then at each follow-up to track preventive medication changes and monthly headache days. We included patients treated with at least one of the following migraine preventive medications: topiramate, beta-blockers (propranolol, metoprolol, atenolol, nadolol, timolol), tricyclic antidepressants (amitriptyline, nortriptyline), verapamil, gabapentin, onabotulinumtoxinA, and calcitonin gene-related peptide (CGRP) monoclonal antibodies (mAbs) (erenumab, fremanezumab, galcanezumab, eptinezumab). We pre-trained a deep neural network, "TabNet," using 145 variables, then employed TabNet-embedded data to construct prediction models for each medication to predict binary outcomes (responder vs. non-responder). A treatment responder was defined as having at least a 30% reduction in monthly headache days from baseline. All model performances were evaluated, and metrics were reported in the held-out test set (train 85%, test 15%). SHapley Additive exPlanations (SHAP) were conducted to determine variable importance. RESULTS: Our final analysis included 4260 patients. The responder rate for each medication ranged from 28.7% to 34.9%, and the mean time to treatment outcome for each medication ranged from 151.3 to 209.5 days. The CGRP mAb prediction model achieved a high area under the receiver operating characteristics curve (AUC) of 0.825 (95% confidence interval [CI] 0.726, 0.920) and an accuracy of 0.80 (95% CI 0.70, 0.88). The AUCs of prediction models for beta-blockers, tricyclic antidepressants, topiramate, verapamil, gabapentin, and onabotulinumtoxinA were: 0.664 (95% CI 0.579, 0.745), 0.611 (95% CI 0.562, 0.682), 0.605 (95% CI 0.520, 0.688), 0.673 (95% CI 0.569, 0.724), 0.628 (0.533, 0.661), and 0.581 (95% CI 0.550, 0.632), respectively. Baseline monthly headache days, age, body mass index (BMI), duration of migraine attacks, responses to previous medication trials, cranial autonomic symptoms, family history of headache, and migraine attack triggers were among the most important variables across all models. A variable could have different contributions; for example, lower BMI predicts responsiveness to CGRP mAbs and beta-blockers, while higher BMI predicts responsiveness to onabotulinumtoxinA, topiramate, and gabapentin. CONCLUSION: We developed an accurate prediction model for CGRP mAbs treatment response, leveraging detailed migraine features gathered from a headache questionnaire before starting treatment. Employing the same methods, the model performances for other medications were less impressive, though similar to the machine learning models reported in the literature for other diseases. This may be due to CGRP mAbs being migraine-specific. Incorporating medical comorbidities, genomic, and imaging factors might enhance the model performance. We demonstrated that migraine characteristics are important in predicting treatment responses and identified the most crucial predictors for each of the seven types of preventive medications. Our results suggest that precision migraine treatment is feasible.
Assuntos
Aprendizado de Máquina , Transtornos de Enxaqueca , Humanos , Transtornos de Enxaqueca/tratamento farmacológico , Transtornos de Enxaqueca/prevenção & controle , Feminino , Masculino , Adulto , Pessoa de Meia-Idade , Antidepressivos Tricíclicos/uso terapêutico , Estudos de Coortes , Medicina de Precisão , Antagonistas Adrenérgicos beta/uso terapêutico , Topiramato/administração & dosagem , Topiramato/farmacologia , Resultado do TratamentoRESUMO
PURPOSE: Patient-centered outcomes (PCOs) are pivotal in cancer treatment, as they directly reflect patients' quality of life. Although multiple studies suggest that factors affecting breast cancer-related morbidity and survival are influenced by treatment side effects and adherence to long-term treatment, such data are generally only available on a smaller scale or from a single center. The primary challenge with collecting these data is that the outcomes are captured as free text in clinical narratives written by clinicians. MATERIALS AND METHODS: Given the complexity of PCO documentation in these narratives, computerized methods are necessary to unlock the wealth of information buried in unstructured text notes that often document PCOs. Inspired by the success of large language models (LLMs), we examined the adaptability of three LLMs, GPT-2, BioGPT, and PMC-LLaMA, on PCO tasks across three institutions, Mayo Clinic, Emory University Hospital, and Stanford University. We developed an open-source framework for fine-tuning LLM that can directly extract the five different categories of PCO from the clinic notes. RESULTS: We found that these LLMs without fine-tuning (zero-shot) struggle with challenging PCO extraction tasks, displaying almost random performance, even with some task-specific examples (few-shot learning). The performance of our fine-tuned, task-specific models is notably superior compared with their non-fine-tuned LLM models. Moreover, the fine-tuned GPT-2 model has demonstrated a significantly better performance than the other two larger LLMs. CONCLUSION: Our discovery indicates that although LLMs serve as effective general-purpose models for tasks across various domains, they require fine-tuning when applied to the clinician domain. Our proposed approach has the potential to lead more efficient, adaptable models for PCO information extraction, reducing reliance on extensive computational resources while still delivering superior performance for specific tasks.
Assuntos
Neoplasias da Mama , Processamento de Linguagem Natural , Humanos , Neoplasias da Mama/psicologia , Feminino , Assistência Centrada no Paciente , Registros Eletrônicos de Saúde , Qualidade de Vida , Avaliação de Resultados da Assistência ao PacienteRESUMO
Artificial intelligence (AI) is the integration of human tasks into machine processes. The role of AI in kidney cancer evaluation, management, and outcome predictions are constantly evolving. We performed a narrative review utilizing PubMed electronic database to query AI as a method of analysis in kidney cancer research. Key search-words included: Artificial Intelligence, Supervised/Unsupervised Machine Learning, Deep Learning, Natural Language Processing, Neural Networks, radiomics, pathomics, and kidney or renal neoplasms or cancer. 72 clinically relevant and impactful studies related to imaging, histopathology, and outcomes were recognized. We anticipate the incorporation of AI tools into future clinical decision-making for kidney cancer.
RESUMO
Accurate prediction of future clinical events such as discharge from hospital can not only improve hospital resource management but also provide an indicator of a patient's clinical condition. Within the scope of this work, we perform a comparative analysis of deep learning based fusion strategies against traditional single source models for prediction of discharge from hospital by fusing information encoded in two diverse but relevant data modalities, i.e., chest X-ray images and tabular electronic health records (EHR). We evaluate multiple fusion strategies including late, early and joint fusion in terms of their efficacy for target prediction compared to EHR-only and Image-only predictive models. Results indicated the importance of merging information from two modalities for prediction as fusion models tended to outperform single modality models and indicate that the joint fusion scheme was the most effective for target prediction. Joint fusion model merges the two modalities through a branched neural network that is jointly trained in an end-to-end fashion to extract target-relevant information from both modalities.
RESUMO
BACKGROUND: Breast imaging clinics in the United States (U.S.) are increasingly implementing breast cancer risk assessment (BCRA) to align with evolving guideline recommendations but with limited uptake of risk-reduction care. Effectively communicating risk information to women is central to implementation efforts, but remains understudied in the U.S. This study aims to characterize, and identify factors associated with women's interest in and preferences for breast cancer risk communication. METHODS: This is a cross-sectional survey study of U.S. women presenting for a mammogram between January and March of 2021 at a large, tertiary breast imaging clinic. Survey items assessed women's interest in knowing their risk and preferences for risk communication if considered to be at high risk in hypothetical situations. Multivariable logistic regression modeling assessed factors associated with women's interest in knowing their personal risk and preferences for details around exact risk estimates. RESULTS: Among 1119 women, 72.7% were interested in knowing their breast cancer risk. If at high risk, 77% preferred to receive their exact risk estimate and preferred verbal (52.9% phone/47% in-person) vs. written (26.5% online/19.5% letter) communications. Adjusted regression analyses found that those with a primary family history of breast cancer were significantly more interested in knowing their risk (OR 1.5, 95% CI 1.0, 2.1, p = 0.04), while those categorized as "more than one race or other" were significantly less interested in knowing their risk (OR 0.4, 95% CI 0.2, 0.9, p = 0.02). Women 60 + years of age were significantly less likely to prefer exact estimates of their risk (OR 0.6, 95% CI 0.5, 0.98, p < 0.01), while women with greater than a high school education were significantly more likely to prefer exact risk estimates (OR 2.5, 95% CI 1.5, 4.2, p < 0.001). CONCLUSION: U.S. women in this study expressed strong interest in knowing their risk and preferred to receive exact risk estimates verbally if found to be at high risk. Sociodemographic and family history influenced women's interest and preferences for risk communication. Breast imaging centers implementing risk assessment should consider strategies tailored to women's preferences to increase interest in risk estimates and improve risk communication.
Assuntos
Neoplasias da Mama , Mamografia , Preferência do Paciente , Humanos , Feminino , Neoplasias da Mama/diagnóstico , Neoplasias da Mama/psicologia , Neoplasias da Mama/diagnóstico por imagem , Estudos Transversais , Pessoa de Meia-Idade , Preferência do Paciente/estatística & dados numéricos , Preferência do Paciente/psicologia , Estados Unidos , Adulto , Mamografia/estatística & dados numéricos , Mamografia/psicologia , Medição de Risco/métodos , Idoso , Comunicação , Inquéritos e Questionários , Centros de Atenção Terciária , Conhecimentos, Atitudes e Prática em SaúdeRESUMO
Background: Differential kidney function assessment is an important part of preoperative evaluation of various urological interventions. It is obtained through dedicated nuclear medical imaging and is not yet implemented through conventional Imaging. Objective: We assess if differential kidney function can be obtained through evaluation of contrast-enhanced computed tomography(CT) using a combination of deep learning and (2D and 3D) radiomic features. Methods: All patients who underwent kidney nuclear scanning at Mayo Clinic sites between 2018-2022 were collected. CT scans of the kidneys were obtained within a 3-month interval before or after the nuclear scans were extracted. Patients who underwent a urological or radiological intervention within this time frame were excluded. A segmentation model was used to segment both kidneys. 2D and 3D radiomics features were extracted and compared between the two kidneys to compute delta radiomics and assess its ability to predict differential kidney function. Performance was reported using receiver operating characteristics, sensitivity, and specificity. Results: Studies from Arizona & Rochester formed our internal dataset (n = 1,159). Studies from Florida were separately processed as an external test set to validate generalizability. We obtained 323 studies from our internal sites and 39 studies from external sites. The best results were obtained by a random forest model trained on 3D delta radiomics features. This model achieved an area under curve (AUC) of 0.85 and 0.81 on internal and external test sets, while specificity and sensitivity were 0.84,0.68 on the internal set, 0.70, and 0.65 on the external set. Conclusion: This proposed automated pipeline can derive important differential kidney function information from contrast-enhanced CT and reduce the need for dedicated nuclear scans for early-stage differential kidney functional assessment. Clinical Impact: We establish a machine learning methodology for assessing differential kidney function from routine CT without the need for expensive and radioactive nuclear medicine scans.
Assuntos
Aprendizado Profundo , Rim , Tomografia Computadorizada por Raios X , Humanos , Tomografia Computadorizada por Raios X/métodos , Rim/diagnóstico por imagem , Feminino , Masculino , Pessoa de Meia-Idade , Idoso , Testes de Função Renal/métodos , Automação , Processamento de Imagem Assistida por Computador/métodos , RadiômicaRESUMO
Deep learning may detect biologically important signals embedded in tumor morphologic features that confer distinct prognoses. Tumor morphologic features were quantified to enhance patient risk stratification within DNA mismatch repair (MMR) groups using deep learning. Using a quantitative segmentation algorithm (QuantCRC) that identifies 15 distinct morphologic features, we analyzed 402 resected stage III colon carcinomas [191 deficient (d)-MMR; 189 proficient (p)-MMR] from participants in a phase III trial of FOLFOX-based adjuvant chemotherapy. Results were validated in an independent cohort (176 d-MMR; 1,094 p-MMR). Association of morphologic features with clinicopathologic variables, MMR, KRAS, BRAFV600E, and time-to-recurrence (TTR) was determined. Multivariable Cox proportional hazards models were developed to predict TTR. Tumor morphologic features differed significantly by MMR status. Cancers with p-MMR had more immature desmoplastic stroma. Tumors with d-MMR had increased inflammatory stroma, epithelial tumor-infiltrating lymphocytes (TIL), high-grade histology, mucin, and signet ring cells. Stromal subtype did not differ by BRAFV600E or KRAS status. In p-MMR tumors, multivariable analysis identified tumor-stroma ratio (TSR) as the strongest feature associated with TTR [HRadj 2.02; 95% confidence interval (CI), 1.14-3.57; P = 0.018; 3-year recurrence: 40.2% vs. 20.4%; Q1 vs. Q2-4]. Among d-MMR tumors, extent of inflammatory stroma (continuous HRadj 0.98; 95% CI, 0.96-0.99; P = 0.028; 3-year recurrence: 13.3% vs. 33.4%, Q4 vs. Q1) and N stage were the most robust prognostically. Association of TSR with TTR was independently validated. In conclusion, QuantCRC can quantify morphologic differences within MMR groups in routine tumor sections to determine their relative contributions to patient prognosis, and may elucidate relevant pathophysiologic mechanisms driving prognosis. SIGNIFICANCE: A deep learning algorithm can quantify tumor morphologic features that may reflect underlying mechanisms driving prognosis within MMR groups. TSR was the most robust morphologic feature associated with TTR in p-MMR colon cancers. Extent of inflammatory stroma and N stage were the strongest prognostic features in d-MMR tumors. TIL density was not independently prognostic in either MMR group.
Assuntos
Neoplasias do Colo , Reparo de Erro de Pareamento de DNA , Aprendizado Profundo , Recidiva Local de Neoplasia , Microambiente Tumoral , Humanos , Neoplasias do Colo/patologia , Neoplasias do Colo/genética , Masculino , Recidiva Local de Neoplasia/patologia , Feminino , Pessoa de Meia-Idade , Idoso , Prognóstico , Protocolos de Quimioterapia Combinada Antineoplásica/uso terapêutico , Fluoruracila/uso terapêutico , Leucovorina/uso terapêutico , Compostos Organoplatínicos/uso terapêutico , Quimioterapia AdjuvanteRESUMO
Training Large Language Models (LLMs) with billions of parameters on a dataset and publishing the model for public access is the standard practice currently. Despite their transformative impact on natural language processing, public LLMs present notable vulnerabilities given the source of training data is often web-based or crowdsourced, and hence can be manipulated by perpetrators. We delve into the vulnerabilities of clinical LLMs, particularly BioGPT which is trained on publicly available biomedical literature and clinical notes from MIMIC-III, in the realm of data poisoning attacks. Exploring susceptibility to data poisoning-based attacks on de-identified breast cancer clinical notes, our approach is the first one to assess the extent of such attacks and our findings reveal successful manipulation of LLM outputs. Through this work, we emphasize on the urgency of comprehending these vulnerabilities in LLMs, and encourage the mindful and responsible usage of LLMs in the clinical domain.
RESUMO
OBJECTIVE: To develop a natural language processing (NLP) algorithm that can accurately extract headache frequency from free-text clinical notes. BACKGROUND: Headache frequency, defined as the number of days with any headache in a month (or 4 weeks), remains a key parameter in the evaluation of treatment response to migraine preventive medications. However, due to the variations and inconsistencies in documentation by clinicians, significant challenges exist to accurately extract headache frequency from the electronic health record (EHR) by traditional NLP algorithms. METHODS: This was a retrospective cross-sectional study with patients identified from two tertiary headache referral centers, Mayo Clinic Arizona and Mayo Clinic Rochester. All neurology consultation notes written by 15 specialized clinicians (11 headache specialists and 4 nurse practitioners) between 2012 and 2022 were extracted and 1915 notes were used for model fine-tuning (90%) and testing (10%). We employed four different NLP frameworks: (1) ClinicalBERT (Bidirectional Encoder Representations from Transformers) regression model, (2) Generative Pre-Trained Transformer-2 (GPT-2) Question Answering (QA) model zero-shot, (3) GPT-2 QA model few-shot training fine-tuned on clinical notes, and (4) GPT-2 generative model few-shot training fine-tuned on clinical notes to generate the answer by considering the context of included text. RESULTS: The mean (standard deviation) headache frequency of our training and testing datasets were 13.4 (10.9) and 14.4 (11.2), respectively. The GPT-2 generative model was the best-performing model with an accuracy of 0.92 (0.91, 0.93, 95% confidence interval [CI]) and R2 score of 0.89 (0.87, 0.90, 95% CI), and all GPT-2-based models outperformed the ClinicalBERT model in terms of exact matching accuracy. Although the ClinicalBERT regression model had the lowest accuracy of 0.27 (0.26, 0.28), it demonstrated a high R2 score of 0.88 (0.85, 0.89), suggesting the ClinicalBERT model can reasonably predict the headache frequency within a range of ≤ ± 3 days, and the R2 score was higher than the GPT-2 QA zero-shot model or GPT-2 QA model few-shot training fine-tuned model. CONCLUSION: We developed a robust information extraction model based on a state-of-the-art large language model, a GPT-2 generative model that can extract headache frequency from EHR free-text clinical notes with high accuracy and R2 score. It overcame several challenges related to different ways clinicians document headache frequency that were not easily achieved by traditional NLP models. We also showed that GPT-2-based frameworks outperformed ClinicalBERT in terms of accuracy in extracting headache frequency from clinical notes. To facilitate research in the field, we released the GPT-2 generative model and inference code with open-source license of community use in GitHub. Additional fine-tuning of the algorithm might be required when applied to different health-care systems for various clinical use cases.
Assuntos
Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Humanos , Estudos Retrospectivos , Estudos Transversais , Masculino , Feminino , Cefaleia , Adulto , Pessoa de Meia-Idade , AlgoritmosRESUMO
BACKGROUND: The Mayo endoscopic subscore (MES) is an important quantitative measure of disease activity in ulcerative colitis. Colonoscopy reports in routine clinical care usually characterize ulcerative colitis disease activity using free text description, limiting their utility for clinical research and quality improvement. We sought to develop algorithms to classify colonoscopy reports according to their MES. METHODS: We annotated 500 colonoscopy reports from 2 health systems. We trained and evaluated 4 classes of algorithms. Our primary outcome was accuracy in identifying scorable reports (binary) and assigning an MES (ordinal). Secondary outcomes included learning efficiency, generalizability, and fairness. RESULTS: Automated machine learning models achieved 98% and 97% accuracy on the binary and ordinal prediction tasks, outperforming other models. Binary models trained on the University of California, San Francisco data alone maintained accuracy (96%) on validation data from Zuckerberg San Francisco General. When using 80% of the training data, models remained accurate for the binary task (97% [n = 320]) but lost accuracy on the ordinal task (67% [n = 194]). We found no evidence of bias by gender (Pâ =â .65) or area deprivation index (Pâ =â .80). CONCLUSIONS: We derived a highly accurate pair of models capable of classifying reports by their MES and recognizing when to abstain from prediction. Our models were generalizable on outside institution validation. There was no evidence of algorithmic bias. Our methods have the potential to enable retrospective studies of treatment effectiveness, prospective identification of patients meeting study criteria, and quality improvement efforts in inflammatory bowel diseases.
Our accurate pair of models automatically classify colonoscopy reports by Mayo endoscopic subscore and abstain from prediction appropriately. Our methods can enable large-scale electronic health record studies of treatment effectiveness, prospective identification of patients for clinical trials, and quality improvement efforts in ulcerative colitis.
RESUMO
PURPOSE: There is a need to improve current risk stratification of stage II colorectal cancer to better inform risk of recurrence and guide adjuvant chemotherapy. We sought to examine whether integration of QuantCRC, a digital pathology biomarker utilizing hematoxylin and eosin-stained slides, provides improved risk stratification over current American Society of Clinical Oncology (ASCO) guidelines. EXPERIMENTAL DESIGN: ASCO and QuantCRC-integrated schemes were applied to a cohort of 398 mismatch-repair proficient (MMRP) stage II colorectal cancers from three large academic medical centers. The ASCO stage II scheme was taken from recent guidelines. The QuantCRC-integrated scheme utilized pT3 versus pT4 and a QuantCRC-derived risk classification. Evaluation of recurrence-free survival (RFS) according to these risk schemes was compared using the log-rank test and HR. RESULTS: Integration of QuantCRC provides improved risk stratification compared with the ASCO scheme for stage II MMRP colorectal cancers. The QuantCRC-integrated scheme placed more stage II tumors in the low-risk group compared with the ASCO scheme (62.5% vs. 42.2%) without compromising excellent 3-year RFS. The QuantCRC-integrated scheme provided larger HR for both intermediate-risk (2.27; 95% CI, 1.32-3.91; P = 0.003) and high-risk (3.27; 95% CI, 1.42-7.55; P = 0.006) groups compared with ASCO intermediate-risk (1.58; 95% CI, 0.87-2.87; P = 0.1) and high-risk (2.24; 95% CI, 1.09-4.62; P = 0.03) groups. The QuantCRC-integrated risk groups remained prognostic in the subgroup of patients that did not receive any adjuvant chemotherapy. CONCLUSIONS: Incorporation of QuantCRC into risk stratification provides a powerful predictor of RFS that has potential to guide subsequent treatment and surveillance for stage II MMRP colorectal cancers.
Assuntos
Biomarcadores Tumorais , Neoplasias Colorretais , Reparo de Erro de Pareamento de DNA , Estadiamento de Neoplasias , Humanos , Neoplasias Colorretais/patologia , Neoplasias Colorretais/diagnóstico , Feminino , Masculino , Pessoa de Meia-Idade , Medição de Risco/métodos , Idoso , Prognóstico , Recidiva Local de Neoplasia/patologia , AdultoRESUMO
BACKGROUND: Artificial intelligence (AI) and machine learning (ML) are poised to have a substantial impact in the health care space. While a plethora of web-based resources exist to teach programming skills and ML model development, there are few introductory curricula specifically tailored to medical students without a background in data science or programming. Programs that do exist are often restricted to a specific specialty. OBJECTIVE: We hypothesized that a 1-month elective for fourth-year medical students, composed of high-quality existing web-based resources and a project-based structure, would empower students to learn about the impact of AI and ML in their chosen specialty and begin contributing to innovation in their field of interest. This study aims to evaluate the success of this elective in improving self-reported confidence scores in AI and ML. The authors also share our curriculum with other educators who may be interested in its adoption. METHODS: This elective was offered in 2 tracks: technical (for students who were already competent programmers) and nontechnical (with no technical prerequisites, focusing on building a conceptual understanding of AI and ML). Students established a conceptual foundation of knowledge using curated web-based resources and relevant research papers, and were then tasked with completing 3 projects in their chosen specialty: a data set analysis, a literature review, and an AI project proposal. The project-based nature of the elective was designed to be self-guided and flexible to each student's interest area and career goals. Students' success was measured by self-reported confidence in AI and ML skills in pre and postsurveys. Qualitative feedback on students' experiences was also collected. RESULTS: This web-based, self-directed elective was offered on a pass-or-fail basis each month to fourth-year students at Emory University School of Medicine beginning in May 2021. As of June 2022, a total of 19 students had successfully completed the elective, representing a wide range of chosen specialties: diagnostic radiology (n=3), general surgery (n=1), internal medicine (n=5), neurology (n=2), obstetrics and gynecology (n=1), ophthalmology (n=1), orthopedic surgery (n=1), otolaryngology (n=2), pathology (n=2), and pediatrics (n=1). Students' self-reported confidence scores for AI and ML rose by 66% after this 1-month elective. In qualitative surveys, students overwhelmingly reported enthusiasm and satisfaction with the course and commented that the self-direction and flexibility and the project-based design of the course were essential. CONCLUSIONS: Course participants were successful in diving deep into applications of AI in their widely-ranging specialties, produced substantial project deliverables, and generally reported satisfaction with their elective experience. The authors are hopeful that a brief, 1-month investment in AI and ML education during medical school will empower this next generation of physicians to pave the way for AI and ML innovation in health care.
Assuntos
Inteligência Artificial , Educação Médica , Humanos , Currículo , Internet , Estudantes de MedicinaRESUMO
Large language models (LLMs) may facilitate and expedite systematic reviews, although the approach to integrate LLMs in the review process is unclear. This study evaluates GPT-4 agreement with human reviewers in assessing the risk of bias using the Risk Of Bias In Non-randomised Studies of Interventions (ROBINS-I) tool and proposes a framework for integrating LLMs into systematic reviews. The case study demonstrated that raw per cent agreement was the highest for the ROBINS-I domain of 'Classification of Intervention'. Kendall agreement coefficient was highest for the domains of 'Participant Selection', 'Missing Data' and 'Measurement of Outcomes', suggesting moderate agreement in these domains. Raw agreement about the overall risk of bias across domains was 61% (Kendall coefficient=0.35). The proposed framework for integrating LLMs into systematic reviews consists of four domains: rationale for LLM use, protocol (task definition, model selection, prompt engineering, data entry methods, human role and success metrics), execution (iterative revisions to the protocol) and reporting. We identify five basic task types relevant to systematic reviews: selection, extraction, judgement, analysis and narration. Considering the agreement level with a human reviewer in the case study, pairing artificial intelligence with an independent human reviewer remains required.
RESUMO
Background: To create an opportunistic screening strategy by multitask deep learning methods to stratify prediction for coronary artery calcium (CAC) and associated cardiovascular risk with frontal chest x-rays (CXR) and minimal data from electronic health records (EHR). Methods: In this retrospective study, 2,121 patients with available computed tomography (CT) scans and corresponding CXR images were collected internally (Mayo Enterprise) with calculated CAC scores binned into 3 categories (0, 1-99, and 100+) as ground truths for model training. Results from the internal training were tested on multiple external datasets (domestic (EUH) and foreign (VGHTPE)) with significant racial and ethnic differences and classification performance was compared. Findings: Classification performance between 0, 1-99, and 100+ CAC scores performed moderately on both the internal test and external datasets, reaching average f1-score of 0.66 for Mayo, 0.62 for EUH and 0.61 for VGHTPE. For the clinically relevant binary task of 0 vs 400+ CAC classification, the performance of our model on the internal test and external datasets reached an average AUCROC of 0.84. Interpretation: The fusion model trained on CXR performed better (0.84 average AUROC on internal and external dataset) than existing state-of-the-art models on predicting CAC scores only on internal (0.73 AUROC), with robust performance on external datasets. Thus, our proposed model may be used as a robust, first-pass opportunistic screening method for cardiovascular risk from regular chest radiographs. For community use, trained model and the inference code can be downloaded with an academic open-source license from https://github.com/jeong-jasonji/MTL_CAC_classification . Funding: The study was partially supported by National Institute of Health 1R01HL155410-01A1 award.
RESUMO
BACKGROUND: A major hurdle for the real time deployment of the AI models is ensuring trustworthiness of these models for the unseen population. More often than not, these complex models are black boxes in which promising results are generated. However, when scrutinized, these models begin to reveal implicit biases during the decision making, particularly for the minority subgroups. METHOD: We develop an efficient adversarial de-biasing approach with partial learning by incorporating the existing concept activation vectors (CAV) methodology, to reduce racial disparities while preserving the performance of the targeted task. CAV is originally a model interpretability technique which we adopted to identify convolution layers responsible for learning race and only fine-tune up to that layer instead of fine-tuning the complete network, limiting the drop in performance RESULTS:: The methodology has been evaluated on two independent medical image case-studies - chest X-ray and mammograms, and we also performed external validation on a different racial population. On the external datasets for the chest X-ray use-case, debiased models (averaged AUC 0.87 ) outperformed the baseline convolution models (averaged AUC 0.57 ) as well as the models trained with the popular fine-tuning strategy (averaged AUC 0.81). Moreover, the mammogram models is debiased using a single dataset (white, black and Asian) and improved the performance on an external datasets (averaged AUC 0.8 to 0.86 ) with completely different population (primarily Hispanic patients). CONCLUSION: In this study, we demonstrated that the adversarial models trained only with internal data performed equally or often outperformed the standard fine-tuning strategy with data from an external setting. The adversarial training approach described can be applied regardless of predictor's model architecture, as long as the convolution model is trained using a gradient-based method. We release the training code with academic open-source license - https://github.com/ramon349/JBI2023_TCAV_debiasing.
Assuntos
Inteligência Artificial , Tomada de Decisão Clínica , Diagnóstico por Imagem , Grupos Raciais , Humanos , Mamografia , Grupos Minoritários , Viés , Disparidades em Assistência à SaúdeRESUMO
BACKGROUND: Constrictive pericarditis (CP) is an uncommon but reversible cause of diastolic heart failure if appropriately identified and treated. However, its diagnosis remains a challenge for clinicians. Artificial intelligence may enhance the identification of CP. OBJECTIVES: The authors proposed a deep learning approach based on transthoracic echocardiography to differentiate CP from restrictive cardiomyopathy. METHODS: Patients with a confirmed diagnosis of CP and cardiac amyloidosis (CA) (as the representative disease of restrictive cardiomyopathy) at Mayo Clinic Rochester from January 2003 to December 2021 were identified to extract baseline demographics. The apical 4-chamber view from transthoracic echocardiography studies was used as input data. The patients were split into a 60:20:20 ratio for training, validation, and held-out test sets of the ResNet50 deep learning model. The model performance (differentiating CP and CA) was evaluated in the test set with the area under the curve. GradCAM was used for model interpretation. RESULTS: A total of 381 patients were identified, including 184 (48.3%) CP, and 197 (51.7%) CA cases. The mean age was 68.7 ± 11.4 years, and 72.8% were male. ResNet50 had a performance with an area under the curve of 0.97 to differentiate the 2-class classification task (CP vs CA). The GradCAM heatmap showed activation around the ventricular septal area. CONCLUSIONS: With a standard apical 4-chamber view, our artificial intelligence model provides a platform to facilitate the detection of CP, allowing for improved workflow efficiency and prompt referral for more advanced evaluation and intervention of CP.
Assuntos
Cardiomiopatia Restritiva , Aprendizado Profundo , Pericardite Constritiva , Humanos , Masculino , Pessoa de Meia-Idade , Idoso , Idoso de 80 Anos ou mais , Feminino , Cardiomiopatia Restritiva/diagnóstico por imagem , Pericardite Constritiva/diagnóstico por imagem , Inteligência Artificial , Valor Preditivo dos Testes , Ecocardiografia , Diagnóstico DiferencialRESUMO
PURPOSE: Liver biopsy was considered the gold standard for diagnosing liver fibrosis; however, with advancements in medical technology and increasing awareness of potential complications, the reliance on liver biopsy has diminished. Ultrasound is gaining popularity due to its wider availability and cost-effectiveness. This study examined the machine learning / deep learning (ML/DL) models for non-invasive liver fibrosis classification from ultrasound. METHODS: Following the preferred reporting items for systematic reviews and meta-analyses (PRISMA) protocol, we searched five academic databases using the query. We defined population, intervention, comparison, outcomes, and study design (PICOS) framework for the inclusion. Furthermore, Joana Briggs Institute (JBI) checklist for analytical cross-sectional studies is used for quality assessment. RESULTS: Among the 188 screened studies, 17 studies are selected. The methods are categorized as off-the-shelf (OTS), attention, generative, and ensemble classifiers. Most studies used OTS classifiers that combined pre-trained ML/DL methods with radiomics features to determine fibrosis staging. Although machine learning shows potential for fibrosis classification, there are limited external comparisons of interventions and prospective clinical trials, which limits their applicability. CONCLUSION: With the recent success of ML/DL toward biomedical image analysis, automated solutions using ultrasound are developed for predicting liver diseases. However, their applicability is bounded by the limited and imbalanced retrospective studies having high heterogeneity. This challenge could be addressed by generating a standard protocol for study design by selecting appropriate population, interventions, outcomes, and comparison.
Assuntos
Cirrose Hepática , Aprendizado de Máquina , Humanos , Estudos Prospectivos , Estudos Retrospectivos , Estudos Transversais , Cirrose Hepática/diagnóstico por imagem , Cirrose Hepática/patologiaRESUMO
OBJECTIVE: The objective of this article is to describe the effects of patient demographics and examination factors on patient-reported experience in outpatient MRI examinations. METHODS: This institutional review board-waived, HIPPA-compliant quality improvement study evaluated outpatient MRI appointments from March 2021 to January 2022 using a postappointment survey consisting of a 5-point emoji scale and text-based feedback. Patient demographics and examination information were extracted from electronic medical records. Ratings ≤ 3 were categorized as negative, and ratings ≥ 4 were categorized as positive. Continuous variables were analyzed using the Kruskal-Wallis test, and categorical variables were analyzed using the Fisher's exact test. A P value less than .05 was considered significant. A natural language processing algorithm was trained and validated to categorize patient feedback. RESULTS: A total of 3,636 patients responded to the survey. Positive ratings had a higher proportion of male respondents compared with negative ratings (47.9% versus 37.0%, P = .004). Examination characteristics were also grouped by positive or negative rating. Patients who endured longer examination time (median 54.0 min versus 44.0 min, P < .001) and longer wait time after check-in (median 61.6 min versus 46.2 min, P < .001) were more likely to give negative ratings. The most common themes of free text feedback included excellent service (84.3%), on-time service (8.4%), and comfortable intravenous line placement (0.4%). Most common negative feedback included long wait times (10.5%), poor communication (8.4%), and physical discomfort during the examination (4.2%). CONCLUSION: Male gender, short examination duration, and on-time start were associated with positive patient ratings.
Assuntos
Pacientes Ambulatoriais , Satisfação do Paciente , Humanos , Masculino , Imageamento por Ressonância Magnética , Avaliação de Resultados da Assistência ao Paciente , DemografiaRESUMO
OBJECTIVE: This study aims to develop and evaluate a semi-automated workflow using natural language processing (NLP) for sharing positive patient feedback with radiology staff, assessing its efficiency and impact on radiology staff morale. METHODS: The HIPAA-compliant, institutional review board-waived implementation study was conducted from April 2022 to June 2023 and introduced a Patient Praises program to distribute positive patient feedback to radiology staff collected from patient surveys. The study transitioned from an initial manual workflow to a hybrid process using an NLP model trained on 1,034 annotated comments and validated on 260 holdout reports. The times to generate Patient Praises e-mails were compared between manual and hybrid workflows. Impact of Patient Praises on radiology staff was measured using a four-question Likert scale survey and an open text feedback box. Kruskal-Wallis test and post hoc Dunn's test were performed to evaluate differences in time for different workflows. RESULTS: From April 2022 to June 2023, the radiology department received 10,643 patient surveys. Of those surveys, 95.6% contained positive comments, with 9.6% (n = 978) shared as Patient Praises to staff. After implementation of the hybrid workflow in March 2023, 45.8% of Patient Praises were sent through the hybrid workflow and 54.2% were sent manually. Time efficiency analysis on 30-case subsets revealed that the hybrid workflow without edits was the most efficient, taking a median of 0.7 min per case. A high proportion of staff found the praises made them feel appreciated (94%) and valued (90%) responding with a 5/5 agreement on 5-point Likert scale responses. CONCLUSION: A hybrid workflow incorporating NLP significantly improves time efficiency for the Patient Praises program while increasing feelings of acknowledgment and value among staff.