RESUMEN
OBJECTIVES: Vedolizumab (VDZ) and ustekinumab (UST) are second-line treatments in pediatric patients with ulcerative colitis (UC) refractory to antitumor necrosis factor (anti-TNF) therapy. Pediatric studies comparing the effectiveness of these medications are lacking. Using a registry from ImproveCareNow (ICN), a global research network in pediatric inflammatory bowel disease, we compared the effectiveness of UST and VDZ in anti-TNF refractory UC. METHODS: We performed a propensity-score weighted regression analysis to compare corticosteroid-free clinical remission (CFCR) at 6 months from starting second-line therapy. Sensitivity analyses tested the robustness of our findings to different ways of handling missing outcome data. Secondary analyses evaluated alternative proxies of response and infection risk. RESULTS: Our cohort included 262 patients on VDZ and 74 patients on UST. At baseline, the two groups differed on their mean pediatric UC activity index (PUCAI) (p = 0.03) but were otherwise similar. At Month 6, 28.3% of patients on VDZ and 25.8% of those on UST achieved CFCR (p = 0.76). Our primary model showed no difference in CFCR (odds ratio: 0.81; 95% confidence interval [CI]: 0.41-1.59) (p = 0.54). The time to biologic discontinuation was similar in both groups (hazard ratio: 1.26; 95% CI: 0.76-2.08) (p = 0.36), with the reference group being VDZ, and we found no differences in clinical response, growth parameters, hospitalizations, surgeries, infections, or malignancy risk. Sensitivity analyses supported these findings of similar effectiveness. CONCLUSIONS: UST and VDZ are similarly effective for inducing clinical remission in anti-TNF refractory UC in pediatric patients. Providers should consider safety, tolerability, cost, and comorbidities when deciding between these therapies.
Asunto(s)
Anticuerpos Monoclonales Humanizados , Colitis Ulcerosa , Fármacos Gastrointestinales , Ustekinumab , Humanos , Colitis Ulcerosa/tratamiento farmacológico , Ustekinumab/uso terapéutico , Femenino , Masculino , Niño , Anticuerpos Monoclonales Humanizados/uso terapéutico , Adolescente , Fármacos Gastrointestinales/uso terapéutico , Resultado del Tratamiento , Factor de Necrosis Tumoral alfa/antagonistas & inhibidores , Inducción de Remisión/métodos , Puntaje de Propensión , Sistema de RegistrosRESUMEN
BACKGROUND: Although patients have easy access to their electronic health records and laboratory test result data through patient portals, laboratory test results are often confusing and hard to understand. Many patients turn to web-based forums or question-and-answer (Q&A) sites to seek advice from their peers. The quality of answers from social Q&A sites on health-related questions varies significantly, and not all responses are accurate or reliable. Large language models (LLMs) such as ChatGPT have opened a promising avenue for patients to have their questions answered. OBJECTIVE: We aimed to assess the feasibility of using LLMs to generate relevant, accurate, helpful, and unharmful responses to laboratory test-related questions asked by patients and identify potential issues that can be mitigated using augmentation approaches. METHODS: We collected laboratory test result-related Q&A data from Yahoo! Answers and selected 53 Q&A pairs for this study. Using the LangChain framework and ChatGPT web portal, we generated responses to the 53 questions from 5 LLMs: GPT-4, GPT-3.5, LLaMA 2, MedAlpaca, and ORCA_mini. We assessed the similarity of their answers using standard Q&A similarity-based evaluation metrics, including Recall-Oriented Understudy for Gisting Evaluation, Bilingual Evaluation Understudy, Metric for Evaluation of Translation With Explicit Ordering, and Bidirectional Encoder Representations from Transformers Score. We used an LLM-based evaluator to judge whether a target model had higher quality in terms of relevance, correctness, helpfulness, and safety than the baseline model. We performed a manual evaluation with medical experts for all the responses to 7 selected questions on the same 4 aspects. RESULTS: Regarding the similarity of the responses from 4 LLMs; the GPT-4 output was used as the reference answer, the responses from GPT-3.5 were the most similar, followed by those from LLaMA 2, ORCA_mini, and MedAlpaca. Human answers from Yahoo data were scored the lowest and, thus, as the least similar to GPT-4-generated answers. The results of the win rate and medical expert evaluation both showed that GPT-4's responses achieved better scores than all the other LLM responses and human responses on all 4 aspects (relevance, correctness, helpfulness, and safety). LLM responses occasionally also suffered from lack of interpretation in one's medical context, incorrect statements, and lack of references. CONCLUSIONS: By evaluating LLMs in generating responses to patients' laboratory test result-related questions, we found that, compared to other 4 LLMs and human answers from a Q&A website, GPT-4's responses were more accurate, helpful, relevant, and safer. There were cases in which GPT-4 responses were inaccurate and not individualized. We identified a number of ways to improve the quality of LLM responses, including prompt engineering, prompt augmentation, retrieval-augmented generation, and response evaluation.
Asunto(s)
Inteligencia Artificial , Registros Electrónicos de Salud , Humanos , LenguajeRESUMEN
sRNAs are important post-transcriptional regulators in bacteria. The current study exploits potential of next-generation technology with computational analyses to develop a whole-genome sRNA-gene network for drug-resistant S. aureus by subjecting public expression-profiles to a novel analysis pipeline. Clustering and examination of the resultant global-interactome indicated a coordinated-regulation of numerous processes by various sRNAs with 9 sRNAs and 10 genes as potential hubs. 10 major sRNA-modules were annotated with various functions, among which a major module including of Rsa sRNAs was predicted to be a central regulatory unit. In addition, sRNA95, a hub molecule associated with this unit was predicted to be a vulnerable target. Finally, novel associations between transcriptional-regulators and sRNAs have been mined resulting in some insights into the association between RNAIII and RsaA. To our knowledge, this is the first study in S. aureus throwing insights into global sRNA-gene interactions and identify potential sRNAs to explore sRNA-based applications for therapeutics.
Asunto(s)
Proteínas Bacterianas/genética , Regulación Bacteriana de la Expresión Génica , Genoma Bacteriano , ARN Pequeño no Traducido/genética , RNA-Seq/métodos , Staphylococcus aureus/genética , Proteínas Bacterianas/metabolismo , Biología Computacional , Redes Reguladoras de Genes , ARN Pequeño no Traducido/metabolismo , Infecciones Estafilocócicas/genética , Infecciones Estafilocócicas/microbiología , Staphylococcus aureus/crecimiento & desarrollo , Staphylococcus aureus/metabolismo , TranscriptomaRESUMEN
The present study aimed to reveal the molecular mechanism of T-2 toxin-induced cerebral edema by aquaporin-4 (AQP4) blocking and permeation. AQP4 is a class of aquaporin channels that is mainly expressed in the brain, and its structural changes lead to life-threatening complications such as cardio-respiratory arrest, nephritis, and irreversible brain damage. We employed molecular dynamics simulation, text mining, and in vitro and in vivo analysis to study the structural and functional changes induced by the T-2 toxin on AQP4. The action of the toxin leads to disrupted permeation of water and permeation coefficients are found to be affected, from the native (2.49 ± 0.02 × 10-14 cm3/s) to toxin-treated AQP4 (7.68 ± 0.15 × 10-14 cm3/s) channels. Furthermore, the T-2 toxin forms strong electrostatic interactions at the binding site and pushes the key residues (Ala210, Phe77, Arg216, and His201) outward at the selectivity filter. Also, the role of a histidine residue in the AQP4 channel was identified by alchemical transformation and umbrella sampling methods. Alchemical free-energy perturbation energy for H201A â A201H, which was found to be 3.07 ± 0.18 kJ/mol, indicates the structural importance of the histidine residue at 201. In addition, histopathology and expression of AQP4 in the Mus musculus brain tissues show the damaged and altered expression of the protein. Text mining reveals the co-occurrence of genes/proteins associated with the AQP4 expression and T-2 toxin-induced cell apoptosis, which leads to cerebral edema.
Asunto(s)
Acuaporina 4/metabolismo , Edema Encefálico/metabolismo , Encéfalo/metabolismo , Toxina T-2/metabolismo , Animales , Encéfalo/patología , Edema Encefálico/patología , Línea Celular , Masculino , Ratones , Simulación del Acoplamiento Molecular , Simulación de Dinámica Molecular , Permeabilidad , Termodinámica , Agua/metabolismoRESUMEN
Biomedical Named Entity Recognition (Bio-NER) is the crucial initial step in the information extraction process and a majorly focused research area in biomedical text mining. In the past years, several models and methodologies have been proposed for the recognition of semantic types related to gene, protein, chemical, drug and other biological relevant named entities. In this paper, we implemented a stacked ensemble approach combined with fuzzy matching for biomedical named entity recognition of disease names. The underlying concept of stacked generalization is to combine the outputs of base-level classifiers using a second-level meta-classifier in an ensemble. We used Conditional Random Field (CRF) as the underlying classification method that makes use of a diverse set of features, mostly based on domain specific, and are orthographic and morphologically relevant. In addition, we used fuzzy string matching to tag rare disease names from our in-house disease dictionary. For fuzzy matching, we incorporated two best fuzzy search algorithms Rabin Karp and Tuned Boyer Moore. Our proposed approach shows promised result of 94.66%, 89.12%, 84.10%, and 76.71% of F-measure while on evaluating training and testing set of both NCBI disease and BioCreative V CDR Corpora.
Asunto(s)
Algoritmos , Biología Computacional , Minería de Datos , Enfermedad , Clasificación , Lógica Difusa , Genes , Humanos , ProteínasRESUMEN
microRNA (miRNA)-messenger RNA (mRNA or gene) interactions are pivotal in various biological processes, including the regulation of gene expression, cellular differentiation, proliferation, apoptosis, and development, as well as the maintenance of cellular homeostasis and pathogenesis of numerous diseases, such as cancer, cardiovascular diseases, neurological disorders, and metabolic conditions. Understanding the mechanisms of miRNA-mRNA interactions can provide insights into disease mechanisms and potential therapeutic targets. However, extracting these interactions efficiently from a huge collection of published articles in PubMed is challenging. In the current study, we annotated a miRNA-mRNA Interaction Corpus (MMIC) and used it for evaluating the performance of a variety of machine learning (ML) models, deep learning-based transformer (DLT) models, and large language models (LLMs) in extracting the miRNA-mRNA interactions mentioned in PubMed. We used the genomics approaches for validating the extracted miRNA-mRNA interactions. Among the ML, DLT, and LLM models, PubMedBERT showed the highest precision, recall, and F-score, with all equal to 0.783. Among the LLM models, the performance of Llama-2 is better when compared to others. Llama 2 achieved 0.56 precision, 0.86 recall, and 0.68 F-score in a zero-shot experiment and 0.56 precision, 0.87 recall, and 0.68 F-score in a three-shot experiment. Our study shows that Llama 2 achieves better recall than ML and DLT models and leaves space for further improvement in terms of precision and F-score.
RESUMEN
About 1 in 9 older adults over 65 has Alzheimer's disease (AD), many of whom also have multiple other chronic conditions such as hypertension and diabetes, necessitating careful monitoring through laboratory tests. Understanding the patterns of laboratory tests in this population aids our understanding and management of these chronic conditions along with AD. In this study, we used an unimodal cosinor model to assess the seasonality of lab tests using electronic health record (EHR) data from 34,303 AD patients from the OneFlorida+ Clinical Research Consortium. We observed significant seasonal fluctuations-higher in winter in lab tests such as glucose, neutrophils per 100 white blood cells (WBC), and WBC. Notably, certain leukocyte types like eosinophils, lymphocytes, and monocytes are elevated during summer, likely reflecting seasonal respiratory diseases and allergens. Seasonality is more pronounced in older patients and varies by gender. Our findings suggest that recognizing these patterns and adjusting reference intervals for seasonality would allow healthcare providers to enhance diagnostic precision, tailor care, and potentially improve patient outcomes.
RESUMEN
Background: Even though patients have easy access to their electronic health records and lab test results data through patient portals, lab results are often confusing and hard to understand. Many patients turn to online forums or question and answering (Q&A) sites to seek advice from their peers. However, the quality of answers from social Q&A on health-related questions varies significantly, and not all the responses are accurate or reliable. Large language models (LLMs) such as ChatGPT have opened a promising avenue for patients to get their questions answered. Objective: We aim to assess the feasibility of using LLMs to generate relevant, accurate, helpful, and unharmful responses to lab test-related questions asked by patients and to identify potential issues that can be mitigated with augmentation approaches. Methods: We first collected lab test results related question and answer data from Yahoo! Answers and selected 53 Q&A pairs for this study. Using the LangChain framework and ChatGPT web portal, we generated responses to the 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini. We first assessed the similarity of their answers using standard QA similarity-based evaluation metrics including ROUGE, BLEU, METEOR, BERTScore. We also utilized an LLM-based evaluator to judge whether a target model has higher quality in terms of relevance, correctness, helpfulness, and safety than the baseline model. Finally, we performed a manual evaluation with medical experts for all the responses of seven selected questions on the same four aspects. Results: Regarding the similarity of the responses from 4 LLMs, where GPT-4 output was used as the reference answer, the responses from LLaMa 2 are the most similar ones, followed by LLaMa 2, ORCA_mini, and MedAlpaca. Human answers from Yahoo data were scored lowest and thus least similar to GPT-4-generated answers. The results of Win Rate and medical expert evaluation both showed that GPT-4's responses achieved better scores than all the other LLM responses and human responses on all the four aspects (relevance, correctness, helpfulness, and safety). However, LLM responses occasionally also suffer from lack of interpretation in one's medical context, incorrect statements, and lack of references. Conclusions: By evaluating LLMs in generating responses to patients' lab test results related questions, we find that compared to other three LLMs and human answer from the Q&A website, GPT-4's responses are more accurate, helpful, relevant, and safer. However, there are cases that GPT-4 responses are inaccurate and not individualized. We identified a number of ways to improve the quality of LLM responses including prompt engineering, prompt augmentation, retrieval augmented generation, and response evaluation.
RESUMEN
BACKGROUND: The Mayo endoscopic subscore (MES) is an important quantitative measure of disease activity in ulcerative colitis. Colonoscopy reports in routine clinical care usually characterize ulcerative colitis disease activity using free text description, limiting their utility for clinical research and quality improvement. We sought to develop algorithms to classify colonoscopy reports according to their MES. METHODS: We annotated 500 colonoscopy reports from 2 health systems. We trained and evaluated 4 classes of algorithms. Our primary outcome was accuracy in identifying scorable reports (binary) and assigning an MES (ordinal). Secondary outcomes included learning efficiency, generalizability, and fairness. RESULTS: Automated machine learning models achieved 98% and 97% accuracy on the binary and ordinal prediction tasks, outperforming other models. Binary models trained on the University of California, San Francisco data alone maintained accuracy (96%) on validation data from Zuckerberg San Francisco General. When using 80% of the training data, models remained accurate for the binary task (97% [n = 320]) but lost accuracy on the ordinal task (67% [n = 194]). We found no evidence of bias by gender (Pâ =â .65) or area deprivation index (Pâ =â .80). CONCLUSIONS: We derived a highly accurate pair of models capable of classifying reports by their MES and recognizing when to abstain from prediction. Our models were generalizable on outside institution validation. There was no evidence of algorithmic bias. Our methods have the potential to enable retrospective studies of treatment effectiveness, prospective identification of patients meeting study criteria, and quality improvement efforts in inflammatory bowel diseases.
Our accurate pair of models automatically classify colonoscopy reports by Mayo endoscopic subscore and abstain from prediction appropriately. Our methods can enable large-scale electronic health record studies of treatment effectiveness, prospective identification of patients for clinical trials, and quality improvement efforts in ulcerative colitis.
RESUMEN
BACKGROUND: Acute hepatic porphyria (AHP) is a group of rare but treatable conditions associated with diagnostic delays of 15 years on average. The advent of electronic health records (EHR) data and machine learning (ML) may improve the timely recognition of rare diseases like AHP. However, prediction models can be difficult to train given the limited case numbers, unstructured EHR data, and selection biases intrinsic to healthcare delivery. We sought to train and characterize models for identifying patients with AHP. METHODS: This diagnostic study used structured and notes-based EHR data from 2 centers at the University of California, UCSF (2012-2022) and UCLA (2019-2022). The data were split into 2 cohorts (referral and diagnosis) and used to develop models that predict (1) who will be referred for testing of acute porphyria, among those who presented with abdominal pain (a cardinal symptom of AHP), and (2) who will test positive, among those referred. The referral cohort consisted of 747 patients referred for testing and 99 849 contemporaneous patients who were not. The diagnosis cohort consisted of 72 confirmed AHP cases and 347 patients who tested negative. The case cohort was 81% female and 6-75 years old at the time of diagnosis. Candidate models used a range of architectures. Feature selection was semi-automated and incorporated publicly available data from knowledge graphs. Our primary outcome was the F-score on an outcome-stratified test set. RESULTS: The best center-specific referral models achieved an F-score of 86%-91%. The best diagnosis model achieved an F-score of 92%. To further test our model, we contacted 372 current patients who lack an AHP diagnosis but were predicted by our models as potentially having it (≥10% probability of referral, ≥50% of testing positive). However, we were only able to recruit 10 of these patients for biochemical testing, all of whom were negative. Nonetheless, post hoc evaluations suggested that these models could identify 71% of cases earlier than their diagnosis date, saving 1.2 years. CONCLUSIONS: ML can reduce diagnostic delays in AHP and other rare diseases. Robust recruitment strategies and multicenter coordination will be needed to validate these models before they can be deployed.
RESUMEN
Outpatient clinical notes are a rich source of information regarding drug safety. However, data in these notes are currently underutilized for pharmacovigilance due to methodological limitations in text mining. Large language models (LLMs) like Bidirectional Encoder Representations from Transformers (BERT) have shown progress in a range of natural language processing tasks but have not yet been evaluated on adverse event (AE) detection. We adapted a new clinical LLM, University of California - San Francisco (UCSF)-BERT, to identify serious AEs (SAEs) occurring after treatment with a non-steroid immunosuppressant for inflammatory bowel disease (IBD). We compared this model to other language models that have previously been applied to AE detection. We annotated 928 outpatient IBD notes corresponding to 928 individual patients with IBD for all SAE-associated hospitalizations occurring after treatment with a non-steroid immunosuppressant. These notes contained 703 SAEs in total, the most common of which was failure of intended efficacy. Out of eight candidate models, UCSF-BERT achieved the highest numerical performance on identifying drug-SAE pairs from this corpus (accuracy 88-92%, macro F1 61-68%), with 5-10% greater accuracy than previously published models. UCSF-BERT was significantly superior at identifying hospitalization events emergent to medication use (P < 0.01). LLMs like UCSF-BERT achieve numerically superior accuracy on the challenging task of SAE detection from clinical notes compared with prior methods. Future work is needed to adapt this methodology to improve model performance and evaluation using multicenter data and newer architectures like Generative pre-trained transformer (GPT). Our findings support the potential value of using large language models to enhance pharmacovigilance.
Asunto(s)
Algoritmos , Inmunosupresores , Enfermedades Inflamatorias del Intestino , Procesamiento de Lenguaje Natural , Farmacovigilancia , Humanos , Proyectos Piloto , Enfermedades Inflamatorias del Intestino/tratamiento farmacológico , Inmunosupresores/efectos adversos , Minería de Datos/métodos , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos/diagnóstico , Sistemas de Registro de Reacción Adversa a Medicamentos , Registros Electrónicos de Salud , Femenino , Masculino , Hospitalización/estadística & datos numéricosRESUMEN
Importance: Acute Hepatic Porphyria (AHP) is a group of rare but treatable conditions associated with diagnostic delays of fifteen years on average. The advent of electronic health records (EHR) data and machine learning (ML) may improve the timely recognition of rare diseases like AHP. However, prediction models can be difficult to train given the limited case numbers, unstructured EHR data, and selection biases intrinsic to healthcare delivery. Objective: To train and characterize models for identifying patients with AHP. Design Setting and Participants: This diagnostic study used structured and notes-based EHR data from two centers at the University of California, UCSF (2012-2022) and UCLA (2019-2022). The data were split into two cohorts (referral, diagnosis) and used to develop models that predict: 1) who will be referred for testing of acute porphyria, amongst those who presented with abdominal pain (a cardinal symptom of AHP), and 2) who will test positive, amongst those referred. The referral cohort consisted of 747 patients referred for testing and 99,849 contemporaneous patients who were not. The diagnosis cohort consisted of 72 confirmed AHP cases and 347 patients who tested negative. Cases were female predominant and 6-75 years old at the time of diagnosis. Candidate models used a range of architectures. Feature selection was semi-automated and incorporated publicly available data from knowledge graphs. Main Outcomes and Measures: F-score on an outcome-stratified test set. Results: The best center-specific referral models achieved an F-score of 86-91%. The best diagnosis model achieved an F-score of 92%. To further test our model, we contacted 372 current patients who lack an AHP diagnosis but were predicted by our models as potentially having it (≥ 10% probability of referral, ≥ 50% of testing positive). However, we were only able to recruit 10 of these patients for biochemical testing, all of whom were negative. Nonetheless, post hoc evaluations suggested that these models could identify 71% of cases earlier than their diagnosis date, saving 1.2 years. Conclusions and Relevance: ML can reduce diagnostic delays in AHP and other rare diseases. Robust recruitment strategies and multicenter coordination will be needed to validate these models before they can be deployed.
RESUMEN
Background and Aims: Outpatient clinical notes are a rich source of information regarding drug safety. However, data in these notes are currently underutilized for pharmacovigilance due to methodological limitations in text mining. Large language models (LLM) like BERT have shown progress in a range of natural language processing tasks but have not yet been evaluated on adverse event detection. Methods: We adapted a new clinical LLM, UCSF BERT, to identify serious adverse events (SAEs) occurring after treatment with a non-steroid immunosuppressant for inflammatory bowel disease (IBD). We compared this model to other language models that have previously been applied to AE detection. Results: We annotated 928 outpatient IBD notes corresponding to 928 individual IBD patients for all SAE-associated hospitalizations occurring after treatment with a non-steroid immunosuppressant. These notes contained 703 SAEs in total, the most common of which was failure of intended efficacy. Out of 8 candidate models, UCSF BERT achieved the highest numerical performance on identifying drug-SAE pairs from this corpus (accuracy 88-92%, macro F1 61-68%), with 5-10% greater accuracy than previously published models. UCSF BERT was significantly superior at identifying hospitalization events emergent to medication use (p < 0.01). Conclusions: LLMs like UCSF BERT achieve numerically superior accuracy on the challenging task of SAE detection from clinical notes compared to prior methods. Future work is needed to adapt this methodology to improve model performance and evaluation using multi-center data and newer architectures like GPT. Our findings support the potential value of using large language models to enhance pharmacovigilance.
RESUMEN
The major outcomes and insights of scientific research and clinical study end up in the form of publication or clinical record in an unstructured text format. Due to advancements in biomedical research, the growth of published literature is getting tremendous large in recent years. The scientists and clinical researchers are facing a big challenge to stay current with the knowledge and to extract hidden information from this sheer quantity of millions of published biomedical literature. The potential one-stop automated solution to this problem is biomedical literature mining. One of the long-standing goals in biology is to discover the disease-causing genes and their specific roles in personalized precision medicine and drug repurposing. However, the empirical approaches and clinical affirmation are expensive and time-consuming. In silico approach using text mining to identify the disease causing genes can contribute towards biomarker discovery. This chapter presents a protocol on combining literature mining and machine learning for predicting biomedical discoveries with a special emphasis on gene-disease relation based discovery. The protocol is presented as a literature based discovery (LBD) pipeline for gene-disease based discovery. The protocol includes our web based tools: (1) DNER (Disease Named Entity Recognizer) for disease entity recognition, (2) BCCNER (Bidirectional, Contextual clues Named Entity Tagger) for gene/protein entity recognition, (3) DisGeReExT (Disease-Gene Relation Extractor) for statistically validated results and visualization, and (4) a newly introduced deep learning based method for association discovery. Our proposed deep learning based method can be generalized and applied to other important biomedical discoveries focusing on entities such as drug/chemical, or miRNA.
Asunto(s)
Investigación Biomédica , Aprendizaje Automático , Minería de Datos/métodos , Reposicionamiento de MedicamentosRESUMEN
In biomedicine, facts about relations between entities (disease, gene, drug, etc.) are hidden in the large trove of 30 million scientific publications. The curated information is proven to play an important role in various applications such as drug repurposing and precision medicine. Recently, due to the advancement in deep learning a transformer architecture named BERT (Bidirectional Encoder Representations from Transformers) has been proposed. This pretrained language model trained using the Books Corpus with 800M words and English Wikipedia with 2500M words reported state of the art results in various NLP (Natural Language Processing) tasks including relation extraction. It is a widely accepted notion that due to the word distribution shift, general domain models exhibit poor performance in information extraction tasks of the biomedical domain. Due to this, an architecture is later adapted to the biomedical domain by training the language models using 28 million scientific literatures from PubMed and PubMed central. This chapter presents a protocol for relation extraction using BERT by discussing state-of-the-art for BERT versions in the biomedical domain such as BioBERT. The protocol emphasis on general BERT architecture, pretraining and fine tuning, leveraging biomedical information, and finally a knowledge graph infusion to the BERT model layer.
Asunto(s)
Almacenamiento y Recuperación de la Información , Procesamiento de Lenguaje Natural , Lenguaje , PubMed , PublicacionesRESUMEN
A novel coronavirus (SARS-CoV-2) has caused a major outbreak in human all over the world. There are several proteins interplay during the entry and replication of this virus in human. Here, we have used text mining and named entity recognition method to identify co-occurrence of the important COVID 19 genes/proteins in the interaction network based on the frequency of the interaction. Network analysis revealed a set of genes/proteins, highly dense genes/protein clusters and sub-networks of Angiotensin-converting enzyme 2 (ACE2), Helicase, spike (S) protein (trimeric), membrane (M) protein, envelop (E) protein, and the nucleocapsid (N) protein. The isolated proteins are screened against procyanidin-a flavonoid from plants using molecular docking. Further, molecular dynamics simulation of critical proteins such as ACE2, Mpro and spike proteins are performed to elucidate the inhibition mechanism. The strong network of hydrogen bonds and hydrophobic interactions along with van der Waals interactions inhibit receptors, which are essential to the entry and replication of the SARS-CoV-2. The binding energy which largely arises from van der Waals interactions is calculated (ACE2=-50.21 ± 6.3, Mpro=-89.50 ± 6.32 and spike=-23.06 ± 4.39) through molecular mechanics Poisson-Boltzmann surface area also confirm the affinity of procyanidin towards the critical receptors. Communicated by Ramaswamy H. Sarma.
Asunto(s)
COVID-19 , Proantocianidinas , Minería de Datos , Humanos , Simulación del Acoplamiento Molecular , Simulación de Dinámica Molecular , Unión Proteica , SARS-CoV-2 , Glicoproteína de la Espiga del Coronavirus/metabolismoRESUMEN
A wealth of knowledge concerning relations between genes and its associated diseases is present in biomedical literature. Mining these biological associations from literature can provide immense support to research ranging from drug-targetable pathways to biomarker discovery. However, time and cost of manual curation heavily slows it down. In this current scenario one of the crucial technologies is biomedical text mining, and relation extraction shows the promising result to explore the research of genes associated with diseases. By developing automatic extraction of gene-disease associations from the literature using joint ensemble learning we addressed this problem from a text mining perspective. In the proposed work, we employ a supervised machine learning approach in which a rich feature set covering conceptual, syntax and semantic properties jointly learned with word embedding are trained using ensemble support vector machine for extracting gene-disease relations from four gold standard corpora. Upon evaluating the machine learning approach shows promised results of 85.34%, 83.93%,87.39% and 85.57% of F-measure on EUADR, GAD, CoMAGC and PolySearch corpora respectively. We strongly believe that the presented novel approach combining rich syntax and semantic feature set with domain-specific word embedding through ensemble support vector machines evaluated on four gold standard corpora can act as a new baseline for future works in gene-disease relation extraction from literature.
Asunto(s)
Minería de Datos/métodos , Enfermedad/genética , Almacenamiento y Recuperación de la Información/métodos , Semántica , Máquina de Vectores de Soporte , Humanos , PubMedRESUMEN
BACKGROUND AND OBJECTIVES: Travel to elevations above 2500â¯m is associated with the risk of developing one or more forms of acute altitude illness such as acute mountain sickness (AMS), high altitude cerebral edema (HACE) or high altitude pulmonary edema (HAPE). Our work aims to identify the functional association of genes involved in high altitude diseases. METHOD: In this work we identified the gene networks responsible for high altitude diseases by using the principle of gene co-occurrence statistics from literature and network analysis. First, we mined the literature data from PubMed on high-altitude diseases, and extracted the co-occurring gene pairs. Next, based on their co-occurrence frequency, gene pairs were ranked. Finally, a gene association network was created using statistical measures to explore potential relationships. RESULTS: Network analysis results revealed that EPO, ACE, IL6 and TNF are the top five genes that were found to co-occur with 20 or more genes, while the association between EPAS1 and EGLN1 genes is strongly substantiated. CONCLUSION: The network constructed from this study proposes a large number of genes that work in-toto in high altitude conditions. Overall, the result provides a good reference for further study of the genetic relationships in high altitude diseases.
Asunto(s)
Mal de Altura/genética , Edema Encefálico/genética , Minería de Datos , Redes Reguladoras de Genes/genética , Hipertensión Pulmonar/genética , HumanosRESUMEN
Tagging biomedical entities such as gene, protein, cell, and cell-line is the first step and an important pre-requisite in biomedical literature mining. In this paper, we describe our hybrid named entity tagging approach namely BCC-NER (bidirectional, contextual clues named entity tagger for gene/protein mention recognition). BCC-NER is deployed with three modules. The first module is for text processing which includes basic NLP pre-processing, feature extraction, and feature selection. The second module is for training and model building with bidirectional conditional random fields (CRF) to parse the text in both directions (forward and backward) and integrate the backward and forward trained models using margin-infused relaxed algorithm (MIRA). The third and final module is for post-processing to achieve a better performance, which includes surrounding text features, parenthesis mismatching, and two-tier abbreviation algorithm. The evaluation results on BioCreative II GM test corpus of BCC-NER achieve a precision of 89.95, recall of 84.15 and overall F-score of 86.95, which is higher than the other currently available open source taggers.