Search | VHL Search Portal

1.

A High-Quality Blue Whale Genome, Segmental Duplications, and Historical Demography.

Bukhman, Yury V; Morin, Phillip A; Meyer, Susanne; Chu, Li-Fang; Jacobsen, Jeff K; Antosiewicz-Bourget, Jessica; Mamott, Daniel; Gonzales, Maylie; Argus, Cara; Bolin, Jennifer; Berres, Mark E; Fedrigo, Olivier; Steill, John; Swanson, Scott A; Jiang, Peng; Rhie, Arang; Formenti, Giulio; Phillippy, Adam M; Harris, Robert S; Wood, Jonathan M D; Howe, Kerstin; Kirilenko, Bogdan M; Munegowda, Chetan; Hiller, Michael; Jain, Aashish; Kihara, Daisuke; Johnston, J Spencer; Ionkov, Alexander; Raja, Kalpana; Toh, Huishi; Lang, Aimee; Wolf, Magnus; Jarvis, Erich D; Thomson, James A; Chaisson, Mark J P; Stewart, Ron.

Mol Biol Evol ; 41(3)2024 Mar 01.

Article in English | MEDLINE | ID: mdl-38376487

ABSTRACT

The blue whale, Balaenoptera musculus, is the largest animal known to have ever existed, making it an important case study in longevity and resistance to cancer. To further this and other blue whale-related research, we report a reference-quality, long-read-based genome assembly of this fascinating species. We assembled the genome from PacBio long reads and utilized Illumina/10×, optical maps, and Hi-C data for scaffolding, polishing, and manual curation. We also provided long read RNA-seq data to facilitate the annotation of the assembly by NCBI and Ensembl. Additionally, we annotated both haplotypes using TOGA and measured the genome size by flow cytometry. We then compared the blue whale genome with other cetaceans and artiodactyls, including vaquita (Phocoena sinus), the world's smallest cetacean, to investigate blue whale's unique biological traits. We found a dramatic amplification of several genes in the blue whale genome resulting from a recent burst in segmental duplications, though the possible connection between this amplification and giant body size requires further study. We also discovered sites in the insulin-like growth factor-1 gene correlated with body size in cetaceans. Finally, using our assembly to examine the heterozygosity and historical demography of Pacific and Atlantic blue whale populations, we found that the genomes of both populations are highly heterozygous and that their genetic isolation dates to the last interglacial period. Taken together, these results indicate how a high-quality, annotated blue whale genome will serve as an important resource for biology, evolution, and conservation research.

Subject(s)

Balaenoptera , Neoplasms , Animals , Balaenoptera/genetics , Segmental Duplications, Genomic , Genome , Demography , Neoplasms/genetics

2.

Advancing entity recognition in biomedicine via instruction tuning of large language models.

Keloth, Vipina K; Hu, Yan; Xie, Qianqian; Peng, Xueqing; Wang, Yan; Zheng, Andrew; Selek, Melih; Raja, Kalpana; Wei, Chih Hsuan; Jin, Qiao; Lu, Zhiyong; Chen, Qingyu; Xu, Hua.

Bioinformatics ; 40(4)2024 Mar 29.

Article in English | MEDLINE | ID: mdl-38514400

ABSTRACT

MOTIVATION: Large Language Models (LLMs) have the potential to revolutionize the field of Natural Language Processing, excelling not only in text generation and reasoning tasks but also in their ability for zero/few-shot learning, swiftly adapting to new tasks with minimal fine-tuning. LLMs have also demonstrated great promise in biomedical and healthcare applications. However, when it comes to Named Entity Recognition (NER), particularly within the biomedical domain, LLMs fall short of the effectiveness exhibited by fine-tuned domain-specific models. One key reason is that NER is typically conceptualized as a sequence labeling task, whereas LLMs are optimized for text generation and reasoning tasks. RESULTS: We developed an instruction-based learning paradigm that transforms biomedical NER from a sequence labeling task into a generation task. This paradigm is end-to-end and streamlines the training and evaluation process by automatically repurposing pre-existing biomedical NER datasets. We further developed BioNER-LLaMA using the proposed paradigm with LLaMA-7B as the foundational LLM. We conducted extensive testing on BioNER-LLaMA across three widely recognized biomedical NER datasets, consisting of entities related to diseases, chemicals, and genes. The results revealed that BioNER-LLaMA consistently achieved higher F1-scores ranging from 5% to 30% compared to the few-shot learning capabilities of GPT-4 on datasets with different biomedical entities. We show that a general-domain LLM can match the performance of rigorously fine-tuned PubMedBERT models and PMC-LLaMA, biomedical-specific language model. Our findings underscore the potential of our proposed paradigm in developing general-domain LLMs that can rival SOTA performances in multi-task, multi-domain scenarios in biomedical and health applications. AVAILABILITY AND IMPLEMENTATION: Datasets and other resources are available at https://github.com/BIDS-Xu-Lab/BioNER-LLaMA.

Subject(s)

Camelids, New World , Deep Learning , Animals , Language , Natural Language Processing

3.

Towards precise PICO extraction from abstracts of randomized controlled trials using a section-specific learning approach.

Hu, Yan; Keloth, Vipina K; Raja, Kalpana; Chen, Yong; Xu, Hua.

Bioinformatics ; 2023 Sep 05.

Article in English | MEDLINE | ID: mdl-37669123

ABSTRACT

MOTIVATION: Automated extraction of participants, intervention, comparison/control, and outcome (PICO) from the randomized controlled trial (RCT) abstracts is important for evidence synthesis. Previous studies have demonstrated the feasibility of applying natural language processing (NLP) for PICO extraction. However, the performance is not optimal due to the complexity of PICO information in RCT abstracts and the challenges involved in their annotation. RESULTS: We propose a two-step NLP pipeline to extract PICO elements from RCT abstracts: (i) sentence classification using a prompt-based learning model and (ii) PICO extraction using a named entity recognition (NER) model. First, the sentences in abstracts were categorized into four sections namely background, methods, results, and conclusions. Next, the NER model was applied to extract the PICO elements from the sentences within the title and methods sections that include >96% of PICO information. We evaluated our proposed NLP pipeline on three datasets, the EBM-NLPmoddataset, a randomly selected and reannotated dataset of 500 RCT abstracts from the EBM-NLP corpus, a dataset of 150 COVID-19 RCT abstracts, and a dataset of 150 Alzheimer's disease (AD) RCT abstracts. The end-to-end evaluation reveals that our proposed approach achieved an overall micro F1 score of 0.833 on the EBM-NLPmod dataset, 0.928 on the COVID-19 dataset, and 0.899 on the AD dataset when measured at the token-level and an overall micro F1 score of 0.712 on EBM-NLPmod dataset, 0.850 on the COVID-19 dataset, and 0.805 on the AD dataset when measured at the entity-level. AVAILABILITY: Our codes and datasets are publicly available at https://github.com/BIDS-Xu-Lab/section_specific_annotation_of_PICO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

4.

Serial KinderMiner (SKiM) discovers and annotates biomedical knowledge using co-occurrence and transformer models.

Millikin, Robert J; Raja, Kalpana; Steill, John; Lock, Cannon; Tu, Xuancheng; Ross, Ian; Tsoi, Lam C; Kuusisto, Finn; Ni, Zijian; Livny, Miron; Bockelman, Brian; Thomson, James; Stewart, Ron.

BMC Bioinformatics ; 24(1): 412, 2023 Nov 01.

Article in English | MEDLINE | ID: mdl-37915001

ABSTRACT

BACKGROUND: The PubMed archive contains more than 34 million articles; consequently, it is becoming increasingly difficult for a biomedical researcher to keep up-to-date with different knowledge domains. Computationally efficient and interpretable tools are needed to help researchers find and understand associations between biomedical concepts. The goal of literature-based discovery (LBD) is to connect concepts in isolated literature domains that would normally go undiscovered. This usually takes the form of an A-B-C relationship, where A and C terms are linked through a B term intermediate. Here we describe Serial KinderMiner (SKiM), an LBD algorithm for finding statistically significant links between an A term and one or more C terms through some B term intermediate(s). The development of SKiM is motivated by the observation that there are only a few LBD tools that provide a functional web interface, and that the available tools are limited in one or more of the following ways: (1) they identify a relationship but not the type of relationship, (2) they do not allow the user to provide their own lists of B or C terms, hindering flexibility, (3) they do not allow for querying thousands of C terms (which is crucial if, for instance, the user wants to query connections between a disease and the thousands of available drugs), or (4) they are specific for a particular biomedical domain (such as cancer). We provide an open-source tool and web interface that improves on all of these issues. RESULTS: We demonstrate SKiM's ability to discover useful A-B-C linkages in three control experiments: classic LBD discoveries, drug repurposing, and finding associations related to cancer. Furthermore, we supplement SKiM with a knowledge graph built with transformer machine-learning models to aid in interpreting the relationships between terms found by SKiM. Finally, we provide a simple and intuitive open-source web interface ( https://skim.morgridge.org ) with comprehensive lists of drugs, diseases, phenotypes, and symptoms so that anyone can easily perform SKiM searches. CONCLUSIONS: SKiM is a simple algorithm that can perform LBD searches to discover relationships between arbitrary user-defined concepts. SKiM is generalized for any domain, can perform searches with many thousands of C term concepts, and moves beyond the simple identification of an existence of a relationship; many relationships are given relationship type labels from our knowledge graph.

Subject(s)

Algorithms , Neoplasms , Humans , PubMed , Knowledge , Knowledge Discovery

5.

Representing and utilizing clinical textual data for real world studies: An OHDSI approach.

Keloth, Vipina K; Banda, Juan M; Gurley, Michael; Heider, Paul M; Kennedy, Georgina; Liu, Hongfang; Liu, Feifan; Miller, Timothy; Natarajan, Karthik; V Patterson, Olga; Peng, Yifan; Raja, Kalpana; Reeves, Ruth M; Rouhizadeh, Masoud; Shi, Jianlin; Wang, Xiaoyan; Wang, Yanshan; Wei, Wei-Qi; Williams, Andrew E; Zhang, Rui; Belenkaya, Rimma; Reich, Christian; Blacketer, Clair; Ryan, Patrick; Hripcsak, George; Elhadad, Noémie; Xu, Hua.

J Biomed Inform ; 142: 104343, 2023 06.

Article in English | MEDLINE | ID: mdl-36935011

ABSTRACT

Clinical documentation in electronic health records contains crucial narratives and details about patients and their care. Natural language processing (NLP) can unlock the information conveyed in clinical notes and reports, and thus plays a critical role in real-world studies. The NLP Working Group at the Observational Health Data Sciences and Informatics (OHDSI) consortium was established to develop methods and tools to promote the use of textual data and NLP in real-world observational studies. In this paper, we describe a framework for representing and utilizing textual data in real-world evidence generation, including representations of information from clinical text in the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), the workflow and tools that were developed to extract, transform and load (ETL) data from clinical notes into tables in OMOP CDM, as well as current applications and specific use cases of the proposed OHDSI NLP solution at large consortia and individual institutions with English textual data. Challenges faced and lessons learned during the process are also discussed to provide valuable insights for researchers who are planning to implement NLP solutions in real-world studies.

Subject(s)

Data Science , Medical Informatics , Humans , Electronic Health Records , Natural Language Processing , Narration

6.

A haplotype-resolved genome assembly of the Nile rat facilitates exploration of the genetic basis of diabetes.

Toh, Huishi; Yang, Chentao; Formenti, Giulio; Raja, Kalpana; Yan, Lily; Tracey, Alan; Chow, William; Howe, Kerstin; Bergeron, Lucie A; Zhang, Guojie; Haase, Bettina; Mountcastle, Jacquelyn; Fedrigo, Olivier; Fogg, John; Kirilenko, Bogdan; Munegowda, Chetan; Hiller, Michael; Jain, Aashish; Kihara, Daisuke; Rhie, Arang; Phillippy, Adam M; Swanson, Scott A; Jiang, Peng; Clegg, Dennis O; Jarvis, Erich D; Thomson, James A; Stewart, Ron; Chaisson, Mark J P; Bukhman, Yury V.

BMC Biol ; 20(1): 245, 2022 11 08.

Article in English | MEDLINE | ID: mdl-36344967

ABSTRACT

BACKGROUND: The Nile rat (Avicanthis niloticus) is an important animal model because of its robust diurnal rhythm, a cone-rich retina, and a propensity to develop diet-induced diabetes without chemical or genetic modifications. A closer similarity to humans in these aspects, compared to the widely used Mus musculus and Rattus norvegicus models, holds the promise of better translation of research findings to the clinic. RESULTS: We report a 2.5 Gb, chromosome-level reference genome assembly with fully resolved parental haplotypes, generated with the Vertebrate Genomes Project (VGP). The assembly is highly contiguous, with contig N50 of 11.1 Mb, scaffold N50 of 83 Mb, and 95.2% of the sequence assigned to chromosomes. We used a novel workflow to identify 3613 segmental duplications and quantify duplicated genes. Comparative analyses revealed unique genomic features of the Nile rat, including some that affect genes associated with type 2 diabetes and metabolic dysfunctions. We discuss 14 genes that are heterozygous in the Nile rat or highly diverged from the house mouse. CONCLUSIONS: Our findings reflect the exceptional level of genomic resolution present in this assembly, which will greatly expand the potential of the Nile rat as a model organism.

Subject(s)

Diabetes Mellitus, Type 2 , Humans , Animals , Haplotypes , Diabetes Mellitus, Type 2/genetics , Murinae , Genome , Genomics

7.

Hypersensitive IFN Responses in Lupus Keratinocytes Reveal Key Mechanistic Determinants in Cutaneous Lupus.

Tsoi, Lam C; Hile, Grace A; Berthier, Celine C; Sarkar, Mrinal K; Reed, Tamra J; Liu, Jianhua; Uppala, Ranjitha; Patrick, Matthew; Raja, Kalpana; Xing, Xianying; Xing, Enze; He, Kevin; Gudjonsson, Johann E; Kahlenberg, J Michelle.

J Immunol ; 202(7): 2121-2130, 2019 04 01.

Article in English | MEDLINE | ID: mdl-30745462

ABSTRACT

Systemic lupus erythematosus (SLE) is a complex autoimmune disease in which 70% of patients experience disfiguring skin inflammation (grouped under the rubric of cutaneous lupus erythematosus [CLE]). There are limited treatment options for SLE and no Food and Drug Administration-approved therapies for CLE. Studies have revealed that IFNs are important mediators for SLE and CLE, but the mechanisms by which IFNs lead to disease are still poorly understood. We aimed to investigate how IFN responses in SLE keratinocytes contribute to development of CLE. A cohort of 72 RNA sequencing samples from 14 individuals (seven SLE and seven healthy controls) were analyzed to study the transcriptomic effects of type I and type II IFNs on SLE versus control keratinocytes. In-depth analysis of the IFN responses was conducted. Bioinformatics and functional assays were conducted to provide implications for the change of IFN response. A significant hypersensitive response to IFNs was identified in lupus keratinocytes, including genes (IFIH1, STAT1, and IRF7) encompassed in SLE susceptibility loci. Binding sites for the transcription factor PITX1 were enriched in genes that exhibit IFN-sensitive responses. PITX1 expression was increased in CLE lesions based on immunohistochemistry, and by using small interfering RNA knockdown, we illustrated that PITX1 was required for upregulation of IFN-regulated genes in vitro. SLE patients exhibit increased IFN signatures in their skin secondary to increased production and a robust, skewed IFN response that is regulated by PITX1. Targeting these exaggerated pathways may prove to be beneficial to prevent and treat hyperinflammatory responses in SLE skin.

Subject(s)

Gene Expression Regulation/immunology , Interferons/immunology , Keratinocytes/immunology , Lupus Erythematosus, Cutaneous/immunology , Paired Box Transcription Factors/immunology , Adult , Female , Humans , Male

8.

Classification of clinically useful sentences in clinical evidence resources.

Morid, Mohammad Amin; Fiszman, Marcelo; Raja, Kalpana; Jonnalagadda, Siddhartha R; Del Fiol, Guilherme.

J Biomed Inform ; 60: 14-22, 2016 Apr.

Article in English | MEDLINE | ID: mdl-26774763

ABSTRACT

UNLABELLED: Most patient care questions raised by clinicians can be answered by online clinical knowledge resources. However, important barriers still challenge the use of these resources at the point of care. OBJECTIVE: To design and assess a method for extracting clinically useful sentences from synthesized online clinical resources that represent the most clinically useful information for directly answering clinicians' information needs. MATERIALS AND METHODS: We developed a Kernel-based Bayesian Network classification model based on different domain-specific feature types extracted from sentences in a gold standard composed of 18 UpToDate documents. These features included UMLS concepts and their semantic groups, semantic predications extracted by SemRep, patient population identified by a pattern-based natural language processing (NLP) algorithm, and cue words extracted by a feature selection technique. Algorithm performance was measured in terms of precision, recall, and F-measure. RESULTS: The feature-rich approach yielded an F-measure of 74% versus 37% for a feature co-occurrence method (p<0.001). Excluding predication, population, semantic concept or text-based features reduced the F-measure to 62%, 66%, 58% and 69% respectively (p<0.01). The classifier applied to Medline sentences reached an F-measure of 73%, which is equivalent to the performance of the classifier on UpToDate sentences (p=0.62). CONCLUSIONS: The feature-rich approach significantly outperformed general baseline methods. This approach significantly outperformed classifiers based on a single type of feature. Different types of semantic features provided a unique contribution to overall classification performance. The classifier's model and features used for UpToDate generalized well to Medline abstracts.

Subject(s)

Decision Support Systems, Clinical , Information Storage and Retrieval/methods , Supervised Machine Learning , Algorithms , Bayes Theorem , Humans , Language , MEDLINE , Natural Language Processing , Semantics , Terminology as Topic , Unified Medical Language System

9.

Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge.

Cormack, James; Nath, Chinmoy; Milward, David; Raja, Kalpana; Jonnalagadda, Siddhartha R.

J Biomed Inform ; 58 Suppl: S120-S127, 2015 Dec.

Article in English | MEDLINE | ID: mdl-26209007

ABSTRACT

This paper describes the use of an agile text mining platform (Linguamatics' Interactive Information Extraction Platform, I2E) to extract document-level cardiac risk factors in patient records as defined in the i2b2/UTHealth 2014 challenge. The approach uses a data-driven rule-based methodology with the addition of a simple supervised classifier. We demonstrate that agile text mining allows for rapid optimization of extraction strategies, while post-processing can leverage annotation guidelines, corpus statistics and logic inferred from the gold standard data. We also show how data imbalance in a training set affects performance. Evaluation of this approach on the test data gave an F-Score of 91.7%, one percent behind the top performing system.

Subject(s)

Cardiovascular Diseases/epidemiology , Data Mining/methods , Diabetes Complications/epidemiology , Electronic Health Records/organization & administration , Narration , Natural Language Processing , Aged , Cardiovascular Diseases/diagnosis , Cohort Studies , Comorbidity , Computer Security , Confidentiality , Diabetes Complications/diagnosis , Female , Humans , Incidence , Longitudinal Studies , Male , Middle Aged , Pattern Recognition, Automated/methods , Risk Assessment/methods , United Kingdom/epidemiology , Vocabulary, Controlled

10.

ProNormz--an integrated approach for human proteins and protein kinases normalization.

Subramani, Suresh; Raja, Kalpana; Natarajan, Jeyakumar.

J Biomed Inform ; 47: 131-8, 2014 Feb.

Article in English | MEDLINE | ID: mdl-24144801

ABSTRACT

The task of recognizing and normalizing protein name mentions in biomedical literature is a challenging task and important for text mining applications such as protein-protein interactions, pathway reconstruction and many more. In this paper, we present ProNormz, an integrated approach for human proteins (HPs) tagging and normalization. In Homo sapiens, a greater number of biological processes are regulated by a large human gene family called protein kinases by post translational phosphorylation. Recognition and normalization of human protein kinases (HPKs) is considered to be important for the extraction of the underlying information on its regulatory mechanism from biomedical literature. ProNormz distinguishes HPKs from other HPs besides tagging and normalization. To our knowledge, ProNormz is the first normalization system available to distinguish HPKs from other HPs in addition to gene normalization task. ProNormz incorporates a specialized synonyms dictionary for human proteins and protein kinases, a set of 15 string matching rules and a disambiguation module to achieve the normalization. Experimental results on benchmark BioCreative II training and test datasets show that our integrated approach achieve a fairly good performance and outperforms more sophisticated semantic similarity and disambiguation systems presented in BioCreative II GN task. As a freely available web tool, ProNormz is useful to developers as extensible gene normalization implementation, to researchers as a standard for comparing their innovative techniques, and to biologists for normalization and categorization of HPs and HPKs mentions in biomedical literature. URL: http://www.biominingbu.org/pronormz.

Subject(s)

Computational Biology/methods , Protein Kinases/chemistry , Proteins/chemistry , Semantics , Algorithms , Data Mining , Databases, Protein , Humans , Internet , Phosphorylation , Protein Processing, Post-Translational , Software

11.

Transcriptional determinants of individualized inflammatory responses at anatomically separate sites.

Tsoi, Lam C; Yang, Jingjing; Liang, Yun; Sarkar, Mrinal K; Xing, Xianying; Beamer, Maria A; Aphale, Abhishek; Raja, Kalpana; Kozlow, Jeffrey H; Getsios, Spiro; Voorhees, John J; Kahlenberg, J Michelle; Elder, James T; Gudjonsson, Johann E.

J Allergy Clin Immunol ; 141(2): 805-808, 2018 02.

Article in English | MEDLINE | ID: mdl-29031600

Subject(s)

Gene Expression Profiling , Gene Expression Regulation/immunology , Psoriasis/immunology , Transcription, Genetic/immunology , Female , Humans , Inflammation/immunology , Inflammation/pathology , Male , Organ Specificity , Psoriasis/pathology

12.

Comorbidity-Guided Text Mining and Omics Pipeline to Identify Candidate Genes and Drugs for Alzheimer's Disease.

Oviya, Iyappan Ramalakshmi; Sankar, Divya; Manoharan, Sharanya; Prabahar, Archana; Raja, Kalpana.

Genes (Basel) ; 15(5)2024 05 11.

Article in English | MEDLINE | ID: mdl-38790243

ABSTRACT

Alzheimer's disease (AD), a multifactorial neurodegenerative disorder, is prevalent among the elderly population. It is a complex trait with mutations in multiple genes. Although the US Food and Drug Administration (FDA) has approved a few drugs for AD treatment, a definitive cure remains elusive. Research efforts persist in seeking improved treatment options for AD. Here, a hybrid pipeline is proposed to apply text mining to identify comorbid diseases for AD and an omics approach to identify the common genes between AD and five comorbid diseases-dementia, type 2 diabetes, hypertension, Parkinson's disease, and Down syndrome. We further identified the pathways and drugs for common genes. The rationale behind this approach is rooted in the fact that elderly individuals often receive multiple medications for various comorbid diseases, and an insight into the genes that are common to comorbid diseases may enhance treatment strategies. We identified seven common genes-PSEN1, PSEN2, MAPT, APP, APOE, NOTCH, and HFE-for AD and five comorbid diseases. We investigated the drugs interacting with these common genes using LINCS gene-drug perturbation. Our analysis unveiled several promising candidates, including MG-132 and Masitinib, which exhibit potential efficacy for both AD and its comorbid diseases. The pipeline can be extended to other diseases.

Subject(s)

Alzheimer Disease , Comorbidity , Data Mining , Alzheimer Disease/genetics , Alzheimer Disease/drug therapy , Humans , Data Mining/methods , Parkinson Disease/genetics , Parkinson Disease/drug therapy , Diabetes Mellitus, Type 2/genetics , Diabetes Mellitus, Type 2/drug therapy , Down Syndrome/genetics , Down Syndrome/drug therapy , Hypertension/genetics , Hypertension/drug therapy

13.

A Study of Biomedical Relation Extraction Using GPT Models.

Zhang, Jeffrey; Wibert, Maxwell; Zhou, Huixue; Peng, Xueqing; Chen, Qingyu; Keloth, Vipina K; Hu, Yan; Zhang, Rui; Xu, Hua; Raja, Kalpana.

AMIA Jt Summits Transl Sci Proc ; 2024: 391-400, 2024.

Article in English | MEDLINE | ID: mdl-38827097

ABSTRACT

Relation Extraction (RE) is a natural language processing (NLP) task for extracting semantic relations between biomedical entities. Recent developments in pre-trained large language models (LLM) motivated NLP researchers to use them for various NLP tasks. We investigated GPT-3.5-turbo and GPT-4 on extracting the relations from three standard datasets, EU-ADR, Gene Associations Database (GAD), and ChemProt. Unlike the existing approaches using datasets with masked entities, we used three versions for each dataset for our experiment: a version with masked entities, a second version with the original entities (unmasked), and a third version with abbreviations replaced with the original terms. We developed the prompts for various versions and used the chat completion model from GPT API. Our approach achieved a F1-score of 0.498 to 0.809 for GPT-3.5-turbo, and a highest F1-score of 0.84 for GPT-4. For certain experiments, the performance of GPT, BioBERT, and PubMedBERT are almost the same.

14.

Serial KinderMiner (SKiM) Discovers and Annotates Biomedical Knowledge Using Co-Occurrence and Transformer Models.

Millikin, Robert J; Raja, Kalpana; Steill, John; Lock, Cannon; Tu, Xuancheng; Ross, Ian; Tsoi, Lam C; Kuusisto, Finn; Ni, Zijian; Livny, Miron; Bockelman, Brian; Thomson, James; Stewart, Ron.

bioRxiv ; 2023 Jun 01.

Article in English | MEDLINE | ID: mdl-37397987

ABSTRACT

Background: The PubMed database contains more than 34 million articles; consequently, it is becoming increasingly difficult for a biomedical researcher to keep up-to-date with different knowledge domains. Computationally efficient and interpretable tools are needed to help researchers find and understand associations between biomedical concepts. The goal of literature-based discovery (LBD) is to connect concepts in isolated literature domains that would normally go undiscovered. This usually takes the form of an A-B-C relationship, where A and C terms are linked through a B term intermediate. Here we describe Serial KinderMiner (SKiM), an LBD algorithm for finding statistically significant links between an A term and one or more C terms through some B term intermediate(s). The development of SKiM is motivated by the the observation that there are only a few LBD tools that provide a functional web interface, and that the available tools are limited in one or more of the following ways: 1) they identify a relationship but not the type of relationship, 2) they do not allow the user to provide their own lists of B or C terms, hindering flexibility, 3) they do not allow for querying thousands of C terms (which is crucial if, for instance, the user wants to query connections between a disease and the thousands of available drugs), or 4) they are specific for a particular biomedical domain (such as cancer). We provide an open-source tool and web interface that improves on all of these issues. Results: We demonstrate SKiM's ability to discover useful A-B-C linkages in three control experiments: classic LBD discoveries, drug repurposing, and finding associations related to cancer. Furthermore, we supplement SKiM with a knowledge graph built with transformer machine-learning models to aid in interpreting the relationships between terms found by SKiM. Finally, we provide a simple and intuitive open-source web interface ( https://skim.morgridge.org ) with comprehensive lists of drugs, diseases, phenotypes, and symptoms so that anyone can easily perform SKiM searches. Conclusions: SKiM is a simple algorithm that can perform LBD searches to discover relationships between arbitrary user-defined concepts. SKiM is generalized for any domain, can perform searches with many thousands of C term concepts, and moves beyond the simple identification of an existence of a relationship; many relationships are given relationship type labels from our knowledge graph.

15.

Biomedical Literature Mining and Its Components.

Raja, Kalpana.

Methods Mol Biol ; 2496: 1-16, 2022.

Article in English | MEDLINE | ID: mdl-35713856

ABSTRACT

The published biomedical articles are the best source of knowledge to understand the importance of biomedical entities such as disease, drugs, and their role in different patient population groups. The number of biomedical literature available and being published is increasing at an exponential rate with the use of large scale experimental techniques. Manual extraction of such information is becoming extremely difficult because of the huge number of biomedical literature available. Alternatively, text mining approaches receive much interest within biomedicine by providing automatic extraction of such information in more structured format from the unstructured biomedical text. Here, a text mining protocol to extract the patient population information, to identify the disease and drug mentions in PubMed titles and abstracts, and a simple information retrieval approach to retrieve a list of relevant documents for a user query are presented. The text mining protocol presented in this chapter is useful for retrieving information on drugs for patients with a specific disease. The protocol covers three major text mining tasks, namely, information retrieval, information extraction, and knowledge discovery.

Subject(s)

Data Mining , Publications , Data Mining/methods , Humans , PubMed

16.

Integrated Approaches to Identify miRNA Biomarkers Associated with Cognitive Dysfunction in Multiple Sclerosis Using Text Mining, Gene Expression, Pathways, and GWAS.

Prabahar, Archana; Raja, Kalpana.

Diagnostics (Basel) ; 12(8)2022 Aug 08.

Article in English | MEDLINE | ID: mdl-36010264

ABSTRACT

Multiple sclerosis (MS), a chronic autoimmune disorder, affects the central nervous system of many young adults. More than half of MS patients develop cognition problems. Although several genomic and transcriptomic studies are currently reported in MS cognitive impairment, a comprehensive repository dealing with all the experimental data is still underdeveloped. In this study, we combined text mining, gene regulation, pathway analysis, and genome-wide association studies (GWAS) to identify miRNA biomarkers to explore the cognitive dysfunction in MS, and to understand the genomic etiology of the disease. We first identified the dysregulated miRNAs associated with MS and cognitive dysfunction using PubTator (text mining), HMDD (experimental associations), miR2Disease, and PhenomiR database (differentially expressed miRNAs). Our results suggest that miRNAs such as hsa-mir-148b-3p, hsa-mir-7b-5p, and hsa-mir-7a-5p are commonly associated with MS and cognitive dysfunction. Next, we retrieved GWAS signals from GWAS Catalog, and analyzed the enrichment analysis of association signals in genes/miRNAs and their association networks. Then, we identified susceptible genetic loci, rs17119 (chromosome 6; p = 1 × 10-10), rs1843938 (chromosome 7; p = 1 × 10-10), and rs11637611 (chromosome 15; p = 1.00 × 10-15), associated with significant genetic risk. Lastly, we conducted a pathway analysis for the susceptible genetic variants and identified novel risk pathways. The ECM receptor signaling pathway (p = 3.98 × 10-8) and PI3K/Akt signaling pathway (p = 5.98 × 10-5) were found to be associated with differentially expressed miRNA biomarkers.

17.

A Simple Computational Approach to Identify Potential Drugs for Multiple Sclerosis and Cognitive Disorders from Expert Curated Resources.

Raja, Kalpana; Prabahar, Archana; Arputhanatham, Shyam Sundar.

Methods Mol Biol ; 2496: 111-121, 2022.

Article in English | MEDLINE | ID: mdl-35713861

ABSTRACT

Multiple sclerosis, a disease of central nervous system leads to potential disability. In the USA, one million cases are diagnosed with multiple sclerosis in 2019. Multiple sclerosis is identified as one of the diseases causing global burden. Cognitive disorder is highly prevalent among 43-70% of multiple sclerosis patients. However, treating cognitive disorder in multiple sclerosis patients is mostly ignored and this leads to several complications. We utilized various expert curated resources to identify potential drugs for multiple sclerosis and cognitive disorder, with specific focus on identifying drugs that are capable of treating both the conditions. We used simple text mining techniques to compile two databases, disease-drug association database and gene-drug interaction database from various existing standard resources. Our study suggests four drugs, Baclofen, Levodopa, Minocycline, and Vitamin B12, for treating both multiple sclerosis and cognitive disorder. In addition, our approach suggests six drugs for multiple sclerosis and 10 drugs for cognitive disorder. We obtained pharmacologist opinion on the drugs suggested for each condition and provided literature evidence for our claim. Here, we present our computational approach as a protocol such that it can be applied to other comorbid diseases that did not gain much attention so far.

Subject(s)

Cognition Disorders , Multiple Sclerosis , Cognition , Cognition Disorders/etiology , Data Mining/methods , Databases, Factual , Humans , Multiple Sclerosis/complications , Multiple Sclerosis/drug therapy

18.

A Text Mining Protocol for Predicting Drug-Drug Interaction and Adverse Drug Reactions from PubMed Articles.

Shukkoor, Mohamed Saleem Abdul; Raja, Kalpana; Baharuldin, Mohamad Taufik Hidayat.

Methods Mol Biol ; 2496: 237-258, 2022.

Article in English | MEDLINE | ID: mdl-35713868

ABSTRACT

Drug-drug interactions (DDIs) and adverse drug reactions (ADRs) occur during the pharmacotherapy of multiple comorbidities and in susceptible individuals. DDIs and ADRs limit the therapeutic outcomes in pharmacotherapy. DDIs and ADRs have significant impact on patients' life and health care cost. Hence, knowledge of DDI and ADRs is required for providing better clinical outcomes to patients. Various approaches are developed by the scientific community to document and report the occurrences of DDIs and ADRs through scientific publications. Due to the enormously increasing number of publications and the requirement of updated information on DDIs and ADRs, manual retrieval of data is time consuming and laborious. Various automated techniques are developed to get information on DDIs and ADRs. One such technique is text mining of DDIs and ADRs from published biomedical literature in PubMed. Here, we present a recently developed text mining protocol for predicting DDIs and ADRs from PubMed abstracts.

Subject(s)

Drug-Related Side Effects and Adverse Reactions , Comorbidity , Data Mining/methods , Drug Interactions , Drug-Related Side Effects and Adverse Reactions/diagnosis , Humans , PubMed

19.

A Text Mining Protocol for Extracting Drug-Drug Interaction and Adverse Drug Reactions Specific to Patient Population, Pharmacokinetics, Pharmacodynamics, and Disease.

Shukkoor, Mohamed Saleem Abdul; Baharuldin, Mohamad Taufik Hidayat; Raja, Kalpana.

Methods Mol Biol ; 2496: 259-282, 2022.

Article in English | MEDLINE | ID: mdl-35713869

ABSTRACT

Drug-drug interactions (DDIs) and adverse drug reactions (ADR) are experienced by many patients, especially by elderly population due to their multiple comorbidities and polypharmacy. Databases such as PubMed contain hundreds of abstracts with DDI and ADR information. PubMed is being updated every day with thousands of abstracts. Therefore, manually retrieving the data and extracting the relevant information is tedious task. Hence, automated text mining approaches are required to retrieve DDI and ADR information from PubMed. Recently we developed a hybrid approach for predicting DDI and ADR information from PubMed. There are many other existing approaches for retrieving DDI and ADR information from PubMed. However, none of the approaches are meant for retrieving DDI and ADR specific to patient population, gender, pharmacokinetics, and pharmacodynamics. Here, we present a text mining protocol which is based on our recent work for retrieving DDI and ADR information specific to patient population, gender, pharmacokinetics, and pharmacodynamics from PubMed.

Subject(s)

Drug-Related Side Effects and Adverse Reactions , Aged , Data Mining/methods , Databases, Factual , Drug Interactions , Humans , PubMed

20.

Extracting Significant Comorbid Diseases from MeSH Index of PubMed.

Anand, Dheepa; Manoharan, Sharanya; Iyyappan, Oviya Ramalakshmi; Anand, Sadhanha; Raja, Kalpana.

Methods Mol Biol ; 2496: 283-299, 2022.

Article in English | MEDLINE | ID: mdl-35713870

ABSTRACT

Text mining is an important research area to be explored in terms of understanding disease associations and have an insight in disease comorbidities. The reason for comorbid occurrence in any patient may be genetic or molecular interference from any other processes. Comorbidity and multimorbidity may be technically different, yet still are inseparable in studies. They have overlapping nature of associations and hence can be integrated for a more rational approach. The association rule generally used to determine comorbidity may also be helpful in novel knowledge prediction or may even serve as an important tool of assessment in surgical cases. Another approach of interest may be to utilize biological vocabulary resources like UMLS/MeSH across a patient health information and analyze the interrelationship between different health conditions. The protocol presented here can be utilized for understanding the disease associations and analyze at an extensive level.

Subject(s)

Abstracting and Indexing , Medical Subject Headings , Data Mining , Humans , Natural Language Processing , PubMed

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL