RESUMO
BACKGROUND: Two classes of newer glucose-lowering drugs (GLDs), sodium-glucose cotransporter-2 inhibitors and glucagon-like peptide-1 receptor agonists, improve cardiovascular and renal outcomes among patients with type 2 diabetes (T2D). However, racial and ethnic minority groups carry higher cardiovascular risks but have lower access to newer GLDs. Contextual-level social determinants of health (SDOH) may be the underlying factor associated with newer GLD adoption. OBJECTIVE: To identify the association between contextual-level SDOH and real-world adoption of newer GLDs among Medicare beneficiaries and to examine the nonstationarity in the associations. METHODS: Data were from 15% random samples of January 2017 to December 2018 nationwide Medicare beneficiaries. We identified patients with T2D who did not use newer GLDs in the year before the index date-January 1, 2018-and followed the cohort for 1 year to record their status of initiating a newer GLD. We used a geographically weighted multivariable Poisson regression model to determine to what extent the SDOH-newer GLD initiation association (ß coefficient) varied geographically. RESULTS: We identified 795,469 eligible Medicare beneficiaries with T2D during the study period from our dataset. Of the study cohort, mean age was 73.1 (SD = 10.5) years, 424,312 (53.3%) were female, 562,994 (70.8%) were non-Hispanic White, 96,891 (12.2%) were non-Hispanic Black, 84,744 (10.6%) were Hispanic, and 29,645 (3.7%) were Asian/Pacific Islander. Newer GLD initiation was negatively associated with the percentage of the population reporting non-Hispanic Black race, Hispanic ethnicity, and unemployment, as revealed by nonspatial regression analyses. The county-level median household income was also associated with higher newer GLD initiation. The spatial analysis presented distinct distributions of local parameter estimates for each contextual-level SDOH. CONCLUSIONS: We identified key contextual-level SDOH associated with real-world adoption of newer GLDs and explored their geographic variation through spatially explicit, data-driven analytical approaches. Identifying areas of strong association between SDOH and newer GLD initiation is crucial for policymakers to allocate resources and develop interventions that address structural inequities.
Assuntos
Diabetes Mellitus Tipo 2 , Hipoglicemiantes , Medicare , Determinantes Sociais da Saúde , Humanos , Estados Unidos/epidemiologia , Diabetes Mellitus Tipo 2/tratamento farmacológico , Feminino , Masculino , Idoso , Hipoglicemiantes/uso terapêutico , Inibidores do Transportador 2 de Sódio-Glicose/uso terapêutico , Receptor do Peptídeo Semelhante ao Glucagon 1/agonistas , Pessoa de Meia-Idade , Idoso de 80 Anos ou maisRESUMO
Pulmonary nodules and nodule characteristics are important indicators of lung nodule malignancy. However, nodule information is often documented as free text in clinical narratives such as radiology reports in electronic health record systems. Natural language processing (NLP) is the key technology to extract and standardize patient information from radiology reports into structured data elements. This study aimed to develop an NLP system using state-of-the-art transformer models to extract pulmonary nodules and associated nodule characteristics from radiology reports. We identified a cohort of 3080 patients who underwent LDCT at the University of Florida health system and collected their radiology reports. We manually annotated 394 reports as the gold standard. We explored eight pretrained transformer models from three transformer architectures including bidirectional encoder representations from transformers (BERT), robustly optimized BERT approach (RoBERTa), and A Lite BERT (ALBERT), for clinical concept extraction, relation identification, and negation detection. We examined general transformer models pretrained using general English corpora, transformer models fine-tuned using a clinical corpus, and a large clinical transformer model, GatorTron, which was trained from scratch using 90 billion words of clinical text. We compared transformer models with two baseline models including a recurrent neural network implemented using bidirectional long short-term memory with a conditional random fields layer and support vector machines. RoBERTa-mimic achieved the best F1-score of 0.9279 for nodule concept and nodule characteristics extraction. ALBERT-base and GatorTron achieved the best F1-score of 0.9737 in linking nodule characteristics to pulmonary nodules. Seven out of eight transformers achieved the best F1-score of 1.0000 for negation detection. Our end-to-end system achieved an overall F1-score of 0.8869. This study demonstrated the advantage of state-of-the-art transformer models for pulmonary nodule information extraction from radiology reports. Supplementary Information: The online version contains supplementary material available at 10.1007/s41666-024-00166-5.
RESUMO
Recent advancements in large language models (LLMs) such as ChatGPT and LLaMA have hinted at their potential to revolutionize medical applications, yet their application in clinical settings often reveals limitations due to a lack of specialized training on medical-specific data. In response to this challenge, this study introduces Me-LLaMA, a novel medical LLM family that includes foundation models - Me-LLaMA 13/70B, along with their chat-enhanced versions - Me-LLaMA 13/70B-chat, developed through continual pre-training and instruction tuning of LLaMA2 using large medical datasets. Our methodology leverages a comprehensive domain-specific data suite, including a large-scale, continual pre-training dataset with 129B tokens, an instruction tuning dataset with 214k samples, and a new medical evaluation benchmark (MIBE) across six critical medical tasks with 12 datasets. Our extensive evaluation using the MIBE shows that Me-LLaMA models achieve overall better performance than existing open-source medical LLMs in zero-shot, few-shot and supervised learning abilities. With task-specific instruction tuning, Me-LLaMA models outperform ChatGPT on 7 out of 8 datasets and GPT-4 on 5 out of 8 datasets. In addition, we investigated the catastrophic forgetting problem, and our results show that Me-LLaMA models outperform other open-source medical LLMs in mitigating this issue. Me-LLaMA is one of the largest open-source medical foundation LLMs that use both biomedical and clinical data. It exhibits superior performance across both general and medical tasks compared to other open-source medical LLMs, rendering it an attractive choice for medical AI applications. We release our models, datasets, and evaluation scripts at: https://github.com/BIDS-Xu-Lab/Me-LLaMA.
RESUMO
This study investigates the impact of clinical trial eligibility criteria on patient survival and serious adverse events (SAEs) in colorectal cancer (CRC) drug trials using real-world data. We utilized the OneFlorida+ network's data repository, conducting a retrospective analysis of CRC patients receiving FDA-approved first-line metastatic treatments. Propensity score matching created balanced case-control groups, which were evaluated using survival analysis and machine learning algorithms to assess the effects of eligibility criteria. Our study included 68,375 patients, with matched case-control groups comprising 1,126 patients each. Survival analysis revealed ethnicity and race, along with specific medical history (eligibility criteria), as significant survival outcome predictors. Machine learning models, particularly the XgBoost regressor, were employed to analyze SAEs, indicating that age and study groups were notable factors in SAEs occurrence. The study's findings highlight the importance of considering patient demographics and medical history in CRC trial designs.
RESUMO
OBJECTIVE: To solve major clinical natural language processing (NLP) tasks using a unified text-to-text learning architecture based on a generative large language model (LLM) via prompt tuning. METHODS: We formulated 7 key clinical NLP tasks as text-to-text learning and solved them using one unified generative clinical LLM, GatorTronGPT, developed using GPT-3 architecture and trained with up to 20 billion parameters. We adopted soft prompts (ie, trainable vectors) with frozen LLM, where the LLM parameters were not updated (ie, frozen) and only the vectors of soft prompts were updated, known as prompt tuning. We added additional soft prompts as a prefix to the input layer, which were optimized during the prompt tuning. We evaluated the proposed method using 7 clinical NLP tasks and compared them with previous task-specific solutions based on Transformer models. RESULTS AND CONCLUSION: The proposed approach achieved state-of-the-art performance for 5 out of 7 major clinical NLP tasks using one unified generative LLM. Our approach outperformed previous task-specific transformer models by â¼3% for concept extraction and 7% for relation extraction applied to social determinants of health, 3.4% for clinical concept normalization, 3.4%-10% for clinical abbreviation disambiguation, and 5.5%-9% for natural language inference. Our approach also outperformed a previously developed prompt-based machine reading comprehension (MRC) model, GatorTron-MRC, for clinical concept and relation extraction. The proposed approach can deliver the "one model for all" promise from training to deployment using a unified generative LLM.
Assuntos
Processamento de Linguagem Natural , Registros Eletrônicos de Saúde , Humanos , Aprendizado de MáquinaRESUMO
OBJECTIVE: To develop soft prompt-based learning architecture for large language models (LLMs), examine prompt-tuning using frozen/unfrozen LLMs, and assess their abilities in transfer learning and few-shot learning. METHODS: We developed a soft prompt-based learning architecture and compared 4 strategies including (1) fine-tuning without prompts; (2) hard-prompting with unfrozen LLMs; (3) soft-prompting with unfrozen LLMs; and (4) soft-prompting with frozen LLMs. We evaluated GatorTron, a clinical LLM with up to 8.9 billion parameters, and compared GatorTron with 4 existing transformer models for clinical concept and relation extraction on 2 benchmark datasets for adverse drug events and social determinants of health (SDoH). We evaluated the few-shot learning ability and generalizability for cross-institution applications. RESULTS AND CONCLUSION: When LLMs are unfrozen, GatorTron-3.9B with soft prompting achieves the best strict F1-scores of 0.9118 and 0.8604 for concept extraction, outperforming the traditional fine-tuning and hard prompt-based models by 0.6 â¼ 3.1 % and 1.2 â¼ 2.9 %, respectively; GatorTron-345 M with soft prompting achieves the best F1-scores of 0.8332 and 0.7488 for end-to-end relation extraction, outperforming other two models by 0.2 â¼ 2 % and 0.6 â¼ 11.7 %, respectively. When LLMs are frozen, small LLMs have a big gap to be competitive with unfrozen models; scaling LLMs up to billions of parameters makes frozen LLMs competitive with unfrozen models. Soft prompting with a frozen GatorTron-8.9B model achieved the best performance for cross-institution evaluation. We demonstrate that (1) machines can learn soft prompts better than hard prompts composed by human, (2) frozen LLMs have good few-shot learning ability and generalizability for cross-institution applications, (3) frozen LLMs reduce computing cost to 2.5 â¼ 6 % of previous methods using unfrozen LLMs, and (4) frozen LLMs require large models (e.g., over several billions of parameters) for good performance.
Assuntos
Processamento de Linguagem Natural , Humanos , Aprendizado de Máquina , Mineração de Dados/métodos , Algoritmos , Determinantes Sociais da Saúde , Efeitos Colaterais e Reações Adversas Relacionados a MedicamentosRESUMO
A comprehensive view of factors associated with AD/ADRD will significantly aid in studies to develop new treatments for AD/ADRD and identify high-risk populations and patients for prevention efforts. In our study, we summarized the risk factors for AD/ADRD by reviewing existing meta-analyses and review articles on risk and preventive factors for AD/ADRD. In total, we extracted 477 risk factors in 10 categories from 537 studies. We constructed an interactive knowledge map to disseminate our study results. Most of the risk factors are accessible from structured Electronic Health Records (EHRs), and clinical narratives show promise as information sources. However, evaluating genomic risk factors using RWD remains a challenge, as genetic testing for AD/ADRD is still not a common practice and is poorly documented in both structured and unstructured EHRs. Considering the constantly evolving research on AD/ADRD risk factors, literature mining via NLP methods offers a solution to automatically update our knowledge map.
RESUMO
There are enormous enthusiasm and concerns in applying large language models (LLMs) to healthcare. Yet current assumptions are based on general-purpose LLMs such as ChatGPT, which are not developed for medical use. This study develops a generative clinical LLM, GatorTronGPT, using 277 billion words of text including (1) 82 billion words of clinical text from 126 clinical departments and approximately 2 million patients at the University of Florida Health and (2) 195 billion words of diverse general English text. We train GatorTronGPT using a GPT-3 architecture with up to 20 billion parameters and evaluate its utility for biomedical natural language processing (NLP) and healthcare text generation. GatorTronGPT improves biomedical natural language processing. We apply GatorTronGPT to generate 20 billion words of synthetic text. Synthetic NLP models trained using synthetic text generated by GatorTronGPT outperform models trained using real-world clinical text. Physicians' Turing test using 1 (worst) to 9 (best) scale shows that there are no significant differences in linguistic readability (p = 0.22; 6.57 of GatorTronGPT compared with 6.93 of human) and clinical relevance (p = 0.91; 7.0 of GatorTronGPT compared with 6.97 of human) and that physicians cannot differentiate them (p < 0.001). This study provides insights into the opportunities and challenges of LLMs for medical research and healthcare.
RESUMO
OBJECTIVE: To develop a natural language processing (NLP) system to extract medications and contextual information that help understand drug changes. This project is part of the 2022 n2c2 challenge. MATERIALS AND METHODS: We developed NLP systems for medication mention extraction, event classification (indicating medication changes discussed or not), and context classification to classify medication changes context into 5 orthogonal dimensions related to drug changes. We explored 6 state-of-the-art pretrained transformer models for the three subtasks, including GatorTron, a large language model pretrained using > 90 billion words of text (including > 80 billion words from > 290 million clinical notes identified at the University of Florida Health). We evaluated our NLP systems using annotated data and evaluation scripts provided by the 2022 n2c2 organizers. RESULTS: Our GatorTron models achieved the best F1-scores of 0.9828 for medication extraction (ranked 3rd), 0.9379 for event classification (ranked 2nd), and the best micro-average accuracy of 0.9126 for context classification. GatorTron outperformed existing transformer models pretrained using smaller general English text and clinical text corpora, indicating the advantage of large language models. CONCLUSION: This study demonstrated the advantage of using large transformer models for contextual medication information extraction from clinical narratives.
Assuntos
Aprendizado Profundo , Processamento de Linguagem Natural , Armazenamento e Recuperação da InformaçãoRESUMO
Clinical trials were vital tools to prove the effectiveness and safety of medications. To maximize generalizability, the study sample should represent the sample population and the target population. However, the clinical trial design tends to favor the evaluation of drug safety and procedure (i.e., internal validity) without clear knowledge of its penalty on trial generalizability (i.e., external validity). Alzheimer's Disease (AD) trials are known to have generalizability issues. Thus, in this study, we explore the effect of eligibility criteria on the AD severity patients and the severe adverse event (SAE) among the eligible patients.
RESUMO
There is an increasing interest in developing artificial intelligence (AI) systems to process and interpret electronic health records (EHRs). Natural language processing (NLP) powered by pretrained language models is the key technology for medical AI systems utilizing clinical narratives. However, there are few clinical language models, the largest of which trained in the clinical domain is comparatively small at 110 million parameters (compared with billions of parameters in the general domain). It is not clear how large clinical language models with billions of parameters can help medical AI systems utilize unstructured EHRs. In this study, we develop from scratch a large clinical language model-GatorTron-using >90 billion words of text (including >82 billion words of de-identified clinical text) and systematically evaluate it on five clinical NLP tasks including clinical concept extraction, medical relation extraction, semantic textual similarity, natural language inference (NLI), and medical question answering (MQA). We examine how (1) scaling up the number of parameters and (2) scaling up the size of the training data could benefit these NLP tasks. GatorTron models scale up the clinical language model from 110 million to 8.9 billion parameters and improve five clinical NLP tasks (e.g., 9.6% and 9.5% improvement in accuracy for NLI and MQA), which can be applied to medical AI systems to improve healthcare delivery. The GatorTron models are publicly available at: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/gatortron_og .
RESUMO
Breast cancer screening (BCS) with mammography is a crucial method for improving cancer survival. In this study, we examined the association of Alzheimer's disease (AD) and AD-related dementias (ADRD) diagnosis and race-ethnicity with mammography use in BCS-eligible women. In the real-world data from the OneFlorida+ Clinical Research Network, we extracted a cohort of 21,715 BCS-eligible women with ADRD and a matching comparison cohort of 65,145 BCS-eligible women without ADRD. In multivariable regression analysis, BCS-eligible women with ADRD were more likely to undergo a mammography than the BCS-eligible women without ADRD (odds ratio [OR] = 1.19, 95% confidence interval [CI] = 1.13-1.26). Stratified by race-ethnicity, BCS-eligible Hispanic women with ADRD were more likely to undergo a mammography (OR = 1.56, 95% CI = 1.39-1.75), whereas BCS-eligible non-Hispanic black (OR = 0.72, 95% CI = 0.62-0.83) and non-Hispanic other (OR = 0.65, 95% CI = 0.45-0.93) women with ADRD were less likely to undergo a mammography. This study was the first to report the impact of ADRD diagnosis and race-ethnicity on mammography use in BCS-eligible women using real-world data. Our results suggest ADRD patients might be undergoing BCS without detailed guidelines to maximize benefits and avoid harms.
RESUMO
This study examined the incidence trends of new-onset type 1 and type 2 diabetes in children and adolescents in Florida before and during the coronavirus disease 2019 (COVID-19) pandemic. In this observational descriptive cohort study, we used a validated computable phenotype to identify incident diabetes cases among individuals <18 years of age in the OneFlorida+ network of the national Patient-Centered Clinical Research Network between January 2017 and June 2021. We conducted an interrupted time series analysis based on the autoregressive integrated moving average model to compare changes in age-adjusted incidence rates of type 1 and type 2 diabetes before and after March 2020, when COVID-19 was declared a national health emergency in the U.S. The age-adjusted incidence rates of both type 1 and type 2 diabetes increased post-COVID-19 for children and adolescents. These results highlight the need for longitudinal cohort studies to examine how the pandemic might influence subsequent diabetes onset in young individuals.
Assuntos
COVID-19 , Diabetes Mellitus Tipo 2 , Humanos , Incidência , Pandemias , COVID-19/epidemiologia , Diabetes Mellitus Tipo 2/epidemiologia , Estudos de Coortes , Estudos Longitudinais , Florida/epidemiologiaRESUMO
Overly restricted and poorly designed eligibility criteria reduce the generalizability of the results from clinical trials. We conducted a study to identify and quantify the impacts of study traits extracted from eligibility criteria on the age of study populations in Alzheimer's Disease (AD) clinical trials. Using machine learning methods and SHapley Additive exPlanation (SHAP) values, we identified 30 and 34 study traits that excluded older patients from AD trials in our 2 generated target populations respectively. We also found that study traits had different magnitudes of impacts on the age distributions of the generated study populations across racial-ethnic groups. To our best knowledge, this was the first study that quantified the impact of eligibility criteria on the age of AD trial participants. Our research is a first step in addressing the overly restrictive eligibility criteria in AD clinical trials.
Assuntos
Doença de Alzheimer , Humanos , Definição da Elegibilidade , Aprendizado de MáquinaRESUMO
Existing malware detectors on safety-critical devices have difficulties in runtime detection due to the performance overhead. In this article, we introduce Propedeutica, a framework for efficient and effective real-time malware detection, leveraging the best of conventional machine learning (ML) and deep learning (DL) techniques. In Propedeutica, all software start executions are considered as benign and monitored by a conventional ML classifier for fast detection. If the software receives a borderline classification from the ML detector (e.g., the software is 50% likely to be benign and 50% likely to be malicious), the software will be transferred to a more accurate, yet performance demanding DL detector. To address spatial-temporal dynamics and software execution heterogeneity, we introduce a novel DL architecture (DeepMalware) for Propedeutica with multistream inputs. We evaluated Propedeutica with 9115 malware samples and 1338 benign software from various categories for the Windows OS. With a borderline interval of [30%, 70%], Propedeutica achieves an accuracy of 94.34% and a false-positive rate of 8.75%, with 41.45% of the samples moved for DeepMalwareanalysis. Even using only CPU, Propedeutica can detect malware within less than 0.1 s.