RESUMO
Human prescription drug labeling contains a summary of the essential scientific information needed for the safe and effective use of the drug and includes the Prescribing Information, FDA-approved patient labeling (Medication Guides, Patient Package Inserts and/or Instructions for Use), and/or carton and container labeling. Drug labeling contains critical information about drug products, such as pharmacokinetics and adverse events. Automatic information extraction from drug labels may facilitate finding the adverse reaction of the drugs or finding the interaction of one drug with another drug. Natural language processing (NLP) techniques, especially recently developed Bidirectional Encoder Representations from Transformers (BERT), have exhibited exceptional merits in text-based information extraction. A common paradigm in training BERT is to pretrain the model on large unlabeled generic language corpora, so that the model learns the distribution of the words in the language, and then fine-tune on a downstream task. In this paper, first, we show the uniqueness of language used in drug labels, which therefore cannot be optimally handled by other BERT models. Then, we present the developed PharmBERT, which is a BERT model specifically pretrained on the drug labels (publicly available at Hugging Face). We demonstrate that our model outperforms the vanilla BERT, ClinicalBERT and BioBERT in multiple NLP tasks in the drug label domain. Moreover, how the domain-specific pretraining has contributed to the superior performance of PharmBERT is demonstrated by analyzing different layers of PharmBERT, and more insight into how it understands different linguistic aspects of the data is gained.
Assuntos
Rotulagem de Medicamentos , Armazenamento e Recuperação da Informação , Humanos , Aprendizagem , Processamento de Linguagem NaturalRESUMO
Product-specific guidances (PSGs) recommended by the United States Food and Drug Administration (FDA) are instrumental to promote and guide generic drug product development. To assess a PSG, the FDA assessor needs to take extensive time and effort to manually retrieve supportive drug information of absorption, distribution, metabolism, and excretion (ADME) from the reference listed drug labeling. In this work, we leveraged the state-of-the-art pre-trained language models to automatically label the ADME paragraphs in the pharmacokinetics section from the FDA-approved drug labeling to facilitate PSG assessment. We applied a transfer learning approach by fine-tuning the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model to develop a novel application of ADME semantic labeling, which can automatically retrieve ADME paragraphs from drug labeling instead of manual work. We demonstrate that fine-tuning the pre-trained BERT model can outperform conventional machine learning techniques, achieving up to 12.5% absolute F1 improvement. To our knowledge, we were the first to successfully apply BERT to solve the ADME semantic labeling task. We further assessed the relative contribution of pre-training and fine-tuning to the overall performance of the BERT model in the ADME semantic labeling task using a series of analysis methods, such as attention similarity and layer-based ablations. Our analysis revealed that the information learned via fine-tuning is focused on task-specific knowledge in the top layers of the BERT, whereas the benefit from the pre-trained BERT model is from the bottom layers.
Assuntos
Rotulagem de Medicamentos , Semântica , Estados Unidos , United States Food and Drug Administration , Idioma , Conhecimento , Processamento de Linguagem NaturalRESUMO
Food effect summarization from New Drug Application (NDA) is an essential component of product-specific guidance (PSG) development and assessment, which provides the basis of recommendations for fasting and fed bioequivalence studies to guide the pharmaceutical industry for developing generic drug products. However, manual summarization of food effect from extensive drug application review documents is time-consuming. Therefore, there is a need to develop automated methods to generate food effect summary. Recent advances in natural language processing (NLP), particularly large language models (LLMs) such as ChatGPT and GPT-4, have demonstrated great potential in improving the effectiveness of automated text summarization, but its ability with regard to the accuracy in summarizing food effect for PSG assessment remains unclear. In this study, we introduce a simple yet effective approach,iterative prompting, which allows one to interact with ChatGPT or GPT-4 more effectively and efficiently through multi-turn interaction. Specifically, we propose a three-turn iterative prompting approach to food effect summarization in which the keyword-focused and length-controlled prompts are respectively provided in consecutive turns to refine the quality of the generated summary. We conduct a series of extensive evaluations, ranging from automated metrics to FDA professionals and even evaluation by GPT-4, on 100 NDA review documents selected over the past five years. We observe that the summary quality is progressively improved throughout the iterative prompting process. Moreover, we find that GPT-4 performs better than ChatGPT, as evaluated by FDA professionals (43% vs. 12%) and GPT-4 (64% vs. 35%). Importantly, all the FDA professionals unanimously rated that 85% of the summaries generated by GPT-4 are factually consistent with the golden reference summary, a finding further supported by GPT-4 rating of 72% consistency. Taken together, these results strongly suggest a great potential for GPT-4 to draft food effect summaries that could be reviewed by FDA professionals, thereby improving the efficiency of the PSG assessment cycle and promoting generic drug product development.
Assuntos
Benchmarking , Medicamentos Genéricos , Idioma , Processamento de Linguagem NaturalRESUMO
BACKGROUND: Circadian misalignment can impair healthcare shift workers' physical and mental health, resulting in sleep deprivation, obesity, and chronic disease. This multidisciplinary research team assessed eating patterns and sleep/physical activity of healthcare workers on three different shifts (day, night, and rotating-shift). To date, no study of real-world shift workers' daily eating and sleep has utilized a largely-objective measurement. METHOD: During this fourteen-day observational study, participants wore two devices (Actiwatch and Bite Technologies counter) to measure physical activity, sleep, light exposure, and eating time. Participants also reported food intake via food diaries on personal mobile devices. RESULTS: In fourteen (5 day-, 5 night-, and 4 rotating-shift) participants, no baseline difference in BMI was observed. Overall, rotating-shift workers consumed fewer calories and had less activity and sleep than day- and night-shift workers. For eating patterns, compared to night- and rotating-shift, day-shift workers ate more frequently during work days. Night workers, however, consumed more calories at work relative to day and rotating workers. For physical activity and sleep, night-shift workers had the highest activity and least sleep on work days. CONCLUSION: This pilot study utilized primarily objective measurement to examine shift workers' habits outside the laboratory. Although no association between BMI and eating patterns/activity/sleep was observed across groups, a small, homogeneous sample may have influenced this. Overall, shift work was associated with 1) increased calorie intake and higher-fat and -carbohydrate diets and 2) sleep deprivation. A larger, more diverse sample can participate in future studies that objectively measure shift workers' real-world habits.
RESUMO
RRM2B plays a crucial role in DNA replication, repair and oxidative stress. While germline RRM2B mutations have been implicated in mitochondrial disorders, its relevance to cancer has not been established. Here, using TCGA studies, we investigated RRM2B alterations in cancer. We found that RRM2B is highly amplified in multiple tumor types, particularly in MYC-amplified tumors, and is associated with increased RRM2B mRNA expression. We also observed that the chromosomal region 8q22.3-8q24, is amplified in multiple tumors, and includes RRM2B, MYC along with several other cancer-associated genes. An analysis of genes within this 8q-amplicon showed that cancers that have both RRM2B-amplified along with MYC have a distinct pattern of amplification compared to cancers that are unaltered or those that have amplifications in RRM2B or MYC only. Investigation of curated biological interactions revealed that gene products of the amplified 8q22.3-8q24 region have important roles in DNA repair, DNA damage response, oxygen sensing, and apoptosis pathways and interact functionally. Notably, RRM2B-amplified cancers are characterized by mutation signatures of defective DNA repair and oxidative stress, and at least RRM2B-amplified breast cancers are associated with poor clinical outcome. These data suggest alterations in RR2MB and possibly the interacting 8q-proteins could have a profound effect on regulatory pathways such as DNA repair and cellular survival, highlighting therapeutic opportunities in these cancers.
RESUMO
Machine learning algorithms can learn mechanisms of antimicrobial resistance from the data of DNA sequence without any a priori information. Interpreting a trained machine learning algorithm can be exploited for validating the model and obtaining new information about resistance mechanisms. Different feature extraction methods, such as SNP calling and counting nucleotide k-mers have been proposed for presenting DNA sequences to the model. However, there are trade-offs between interpretability, computational complexity and accuracy for different feature extraction methods. In this study, we have proposed a new feature extraction method, counting amino acid k-mers or oligopeptides, which provides easier model interpretation compared to counting nucleotide k-mers and reaches the same or even better accuracy in comparison with different methods. Additionally, we have trained machine learning algorithms using different feature extraction methods and compared the results in terms of accuracy, model interpretability and computational complexity. We have built a new feature selection pipeline for extraction of important features so that new AMR determinants can be discovered by analyzing these features. This pipeline allows the construction of models that only use a small number of features and can predict resistance accurately.