Search | Nursing VHL Search Portal

1.

The gender agency gap in fiction writing (1850 to 2010).

Stuhler, Oscar.

Proc Natl Acad Sci U S A ; 121(29): e2319514121, 2024 Jul 16.

Article in English | MEDLINE | ID: mdl-38976724

ABSTRACT

Works of fiction play a crucial role in the production of cultural stereotypes. Concerning gender, a widely held presumption is that many such works ascribe agency to men and passivity to women. However, large-scale diachronic analyses of this notion have been lacking. This paper provides an assessment of agency attributions in 87,531 fiction works written between 1850 and 2010. It introduces a syntax-based approach for extracting networks of character interactions. Agency is then formalized as a dyadic property: Does a character primarily serve as an agent acting upon the other character or as recipient acted upon by the other character? Findings indicate that female characters are more likely to be passive in cross-gender relationships than their male counterparts. This difference, the gender agency gap, has declined since the 19th century but persists into the 21st. Male authors are especially likely to attribute less agency to female characters. Moreover, certain kinds of actions, especially physical and villainous ones, have more pronounced gender disparities.

Subject(s)

Writing , Female , Male , Humans , History, 19th Century , History, 20th Century , History, 21st Century , Literature , Gender Identity

2.

GPT is an effective tool for multilingual psychological text analysis.

Rathje, Steve; Mirea, Dan-Mircea; Sucholutsky, Ilia; Marjieh, Raja; Robertson, Claire E; Van Bavel, Jay J.

Proc Natl Acad Sci U S A ; 121(34): e2308950121, 2024 Aug 20.

Article in English | MEDLINE | ID: mdl-39133853

ABSTRACT

The social and behavioral sciences have been increasingly using automated text analysis to measure psychological constructs in text. We explore whether GPT, the large-language model (LLM) underlying the AI chatbot ChatGPT, can be used as a tool for automated psychological text analysis in several languages. Across 15 datasets (n = 47,925 manually annotated tweets and news headlines), we tested whether different versions of GPT (3.5 Turbo, 4, and 4 Turbo) can accurately detect psychological constructs (sentiment, discrete emotions, offensiveness, and moral foundations) across 12 languages. We found that GPT (r = 0.59 to 0.77) performed much better than English-language dictionary analysis (r = 0.20 to 0.30) at detecting psychological constructs as judged by manual annotators. GPT performed nearly as well as, and sometimes better than, several top-performing fine-tuned machine learning models. Moreover, GPT's performance improved across successive versions of the model, particularly for lesser-spoken languages, and became less expensive. Overall, GPT may be superior to many existing methods of automated text analysis, since it achieves relatively high accuracy across many languages, requires no training data, and is easy to use with simple prompts (e.g., "is this text negative?") and little coding experience. We provide sample code and a video tutorial for analyzing text with the GPT application programming interface. We argue that GPT and other LLMs help democratize automated text analysis by making advanced natural language processing capabilities more accessible, and may help facilitate more cross-linguistic research with understudied languages.

Subject(s)

Multilingualism , Humans , Language , Machine Learning , Natural Language Processing , Emotions , Social Media

3.

Literature-based predictions of Mendelian disease therapies.

Deisseroth, Cole A; Lee, Won-Seok; Kim, Jiyoen; Jeong, Hyun-Hwan; Dhindsa, Ryan S; Wang, Julia; Zoghbi, Huda Y; Liu, Zhandong.

Am J Hum Genet ; 110(10): 1661-1672, 2023 10 05.

Article in English | MEDLINE | ID: mdl-37741276

ABSTRACT

In the effort to treat Mendelian disorders, correcting the underlying molecular imbalance may be more effective than symptomatic treatment. Identifying treatments that might accomplish this goal requires extensive and up-to-date knowledge of molecular pathways-including drug-gene and gene-gene relationships. To address this challenge, we present "parsing modifiers via article annotations" (PARMESAN), a computational tool that searches PubMed and PubMed Central for information to assemble these relationships into a central knowledge base. PARMESAN then predicts putatively novel drug-gene relationships, assigning an evidence-based score to each prediction. We compare PARMESAN's drug-gene predictions to all of the drug-gene relationships displayed by the Drug-Gene Interaction Database (DGIdb) and show that higher-scoring relationship predictions are more likely to match the directionality (up- versus down-regulation) indicated by this database. PARMESAN had more than 200,000 drug predictions scoring above 8 (as one example cutoff), for more than 3,700 genes. Among these predicted relationships, 210 were registered in DGIdb and 201 (96%) had matching directionality. This publicly available tool provides an automated way to prioritize drug screens to target the most-promising drugs to test, thereby saving time and resources in the development of therapeutics for genetic disorders.

Subject(s)

PubMed , Humans , Databases, Factual

4.

IDPpub: Illuminating the Dark Phosphoproteome Through PubMed Mining.

Savage, Sara R; Zhang, Yaoyun; Jaehnig, Eric J; Liao, Yuxing; Shi, Zhiao; Pham, Huy Anh; Xu, Hua; Zhang, Bing.

Mol Cell Proteomics ; 23(1): 100682, 2024 Jan.

Article in English | MEDLINE | ID: mdl-37993103

ABSTRACT

Global phosphoproteomics experiments quantify tens of thousands of phosphorylation sites. However, data interpretation is hampered by our limited knowledge on functions, biological contexts, or precipitating enzymes of the phosphosites. This study establishes a repository of phosphosites with associated evidence in biomedical abstracts, using deep learning-based natural language processing techniques. Our model for illuminating the dark phosphoproteome through PubMed mining (IDPpub) was generated by fine-tuning BioBERT, a deep learning tool for biomedical text mining. Trained using sentences containing protein substrates and phosphorylation site positions from 3000 abstracts, the IDPpub model was then used to extract phosphorylation sites from all MEDLINE abstracts. The extracted proteins were normalized to gene symbols using the National Center for Biotechnology Information gene query, and sites were mapped to human UniProt sequences using ProtMapper and mouse UniProt sequences by direct match. Precision and recall were calculated using 150 curated abstracts, and utility was assessed by analyzing the CPTAC (Clinical Proteomics Tumor Analysis Consortium) pan-cancer phosphoproteomics datasets and the PhosphoSitePlus database. Using 10-fold cross validation, pairs of correct substrates and phosphosite positions were extracted with an average precision of 0.93 and recall of 0.94. After entity normalization and site mapping to human reference sequences, an independent validation achieved a precision of 0.91 and recall of 0.77. The IDPpub repository contains 18,458 unique human phosphorylation sites with evidence sentences from 58,227 abstracts and 5918 mouse sites in 14,610 abstracts. This included evidence sentences for 1803 sites identified in CPTAC studies that are not covered by manually curated functional information in PhosphoSitePlus. Evaluation results demonstrate the potential of IDPpub as an effective biomedical text mining tool for collecting phosphosites. Moreover, the repository (http://idppub.ptmax.org), which can be automatically updated, can serve as a powerful complement to existing resources.

Subject(s)

Data Mining , Natural Language Processing , Humans , Data Mining/methods , Databases, Factual , PubMed

5.

ChatGPT outperforms crowd workers for text-annotation tasks.

Gilardi, Fabrizio; Alizadeh, Meysam; Kubli, Maël.

Proc Natl Acad Sci U S A ; 120(30): e2305016120, 2023 Jul 25.

Article in English | MEDLINE | ID: mdl-37463210

ABSTRACT

Many NLP applications require manual text annotations for a variety of tasks, notably to train classifiers or evaluate the performance of unsupervised models. Depending on the size and degree of complexity, the tasks may be conducted by crowd workers on platforms such as MTurk as well as trained annotators, such as research assistants. Using four samples of tweets and news articles (n = 6,183), we show that ChatGPT outperforms crowd workers for several annotation tasks, including relevance, stance, topics, and frame detection. Across the four datasets, the zero-shot accuracy of ChatGPT exceeds that of crowd workers by about 25 percentage points on average, while ChatGPT's intercoder agreement exceeds that of both crowd workers and trained annotators for all tasks. Moreover, the per-annotation cost of ChatGPT is less than $0.003-about thirty times cheaper than MTurk. These results demonstrate the potential of large language models to drastically increase the efficiency of text classification.

6.

IK-DDI: a novel framework based on instance position embedding and key external text for DDI extraction.

Dou, Mingliang; Ding, Jiaqi; Chen, Genlang; Duan, Junwen; Guo, Fei; Tang, Jijun.

Brief Bioinform ; 24(3)2023 05 19.

Article in English | MEDLINE | ID: mdl-36932655

ABSTRACT

Determining drug-drug interactions (DDIs) is an important part of pharmacovigilance and has a vital impact on public health. Compared with drug trials, obtaining DDI information from scientific articles is a faster and lower cost but still a highly credible approach. However, current DDI text extraction methods consider the instances generated from articles to be independent and ignore the potential connections between different instances in the same article or sentence. Effective use of external text data could improve prediction accuracy, but existing methods cannot extract key information from external data accurately and reasonably, resulting in low utilization of external data. In this study, we propose a DDI extraction framework, instance position embedding and key external text for DDI (IK-DDI), which adopts instance position embedding and key external text to extract DDI information. The proposed framework integrates the article-level and sentence-level position information of the instances into the model to strengthen the connections between instances generated from the same article or sentence. Moreover, we introduce a comprehensive similarity-matching method that uses string and word sense similarity to improve the matching accuracy between the target drug and external text. Furthermore, the key sentence search method is used to obtain key information from external data. Therefore, IK-DDI can make full use of the connection between instances and the information contained in external text data to improve the efficiency of DDI extraction. Experimental results show that IK-DDI outperforms existing methods on both macro-averaged and micro-averaged metrics, which suggests our method provides complete framework that can be used to extract relationships between biomedical entities and process external text data.

Subject(s)

Data Mining , Pharmacovigilance , Data Mining/methods , Drug Interactions , Benchmarking , Drug Delivery Systems

7.

Extracting structured information from unstructured histopathology reports using generative pre-trained transformer 4 (GPT-4).

Truhn, Daniel; Loeffler, Chiara Ml; Müller-Franzes, Gustav; Nebelung, Sven; Hewitt, Katherine J; Brandner, Sebastian; Bressem, Keno K; Foersch, Sebastian; Kather, Jakob Nikolas.

J Pathol ; 262(3): 310-319, 2024 03.

Article in English | MEDLINE | ID: mdl-38098169

ABSTRACT

Deep learning applied to whole-slide histopathology images (WSIs) has the potential to enhance precision oncology and alleviate the workload of experts. However, developing these models necessitates large amounts of data with ground truth labels, which can be both time-consuming and expensive to obtain. Pathology reports are typically unstructured or poorly structured texts, and efforts to implement structured reporting templates have been unsuccessful, as these efforts lead to perceived extra workload. In this study, we hypothesised that large language models (LLMs), such as the generative pre-trained transformer 4 (GPT-4), can extract structured data from unstructured plain language reports using a zero-shot approach without requiring any re-training. We tested this hypothesis by utilising GPT-4 to extract information from histopathological reports, focusing on two extensive sets of pathology reports for colorectal cancer and glioblastoma. We found a high concordance between LLM-generated structured data and human-generated structured data. Consequently, LLMs could potentially be employed routinely to extract ground truth data for machine learning from unstructured pathology reports in the future. © 2023 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland.

Subject(s)

Glioblastoma , Precision Medicine , Humans , Machine Learning , United Kingdom

8.

RDscan: Extracting RNA-disease relationship from the literature based on pre-training model.

Zhang, Yang; Yang, Yu; Ren, Liping; Ning, Lin; Zou, Quan; Luo, Nanchao; Zhang, Yinghui; Liu, Ruijun.

Methods ; 228: 48-54, 2024 Aug.

Article in English | MEDLINE | ID: mdl-38789016

ABSTRACT

With the rapid advancements in molecular biology and genomics, a multitude of connections between RNA and diseases has been unveiled, making the efficient and accurate extraction of RNA-disease (RD) relationships from extensive biomedical literature crucial for advancing research in this field. This study introduces RDscan, a novel text mining method developed based on the pre-training and fine-tuning strategy, aimed at automatically extracting RD-related information from a vast corpus of literature using pre-trained biomedical large language models (LLM). Initially, we constructed a dedicated RD corpus by manually curating from literature, comprising 2,082 positive and 2,000 negative sentences, alongside an independent test dataset (comprising 500 positive and 500 negative sentences) for training and evaluating RDscan. Subsequently, by fine-tuning the Bioformer and BioBERT pre-trained models, RDscan demonstrated exceptional performance in text classification and named entity recognition (NER) tasks. In 5-fold cross-validation, RDscan significantly outperformed traditional machine learning methods (Support Vector Machine, Logistic Regression and Random Forest). In addition, we have developed an accessible webserver that assists users in extracting RD relationships from text. In summary, RDscan represents the first text mining tool specifically designed for RD relationship extraction, and is poised to emerge as an invaluable tool for researchers dedicated to exploring the intricate interactions between RNA and diseases. Webserver of RDscan is free available at https://cellknowledge.com.cn/RDscan/.

Subject(s)

Data Mining , RNA , Data Mining/methods , RNA/genetics , Humans , Machine Learning , Disease/genetics , Support Vector Machine , Software

9.

Detecting fake-review buyers using network structure: Direct evidence from Amazon.

He, Sherry; Hollenbeck, Brett; Overgoor, Gijs; Proserpio, Davide; Tosyali, Ali.

Proc Natl Acad Sci U S A ; 119(47): e2211932119, 2022 11 22.

Article in English | MEDLINE | ID: mdl-36378645

ABSTRACT

Online reviews significantly impact consumers' decision-making process and firms' economic outcomes and are widely seen as crucial to the success of online markets. Firms, therefore, have a strong incentive to manipulate ratings using fake reviews. This presents a problem that academic researchers have tried to solve for over two decades and on which platforms expend a large amount of resources. Nevertheless, the prevalence of fake reviews is arguably higher than ever. To combat this, we collect a dataset of reviews for thousands of Amazon products and develop a general and highly accurate method for detecting fake reviews. A unique difference between previous datasets and ours is that we directly observe which sellers buy fake reviews. Thus, while prior research has trained models using laboratory-generated reviews or proxies for fake reviews, we are able to train a model using actual fake reviews. We show that products that buy fake reviews are highly clustered in the product reviewer network. Therefore, features constructed from this network are highly predictive of which products buy fake reviews. We show that our network-based approach is also successful at detecting fake review buyers even without ground truth data, as unsupervised clustering methods can accurately identify fake review buyers by identifying clusters of products that are closely connected in the network. While text or metadata can be manipulated to evade detection, network-based features are more costly to manipulate because these features result directly from the inherent limitations of buying reviews from online review marketplaces, making our detection approach more robust to manipulation.

Subject(s)

Commerce , Text Messaging , Consumer Behavior , Motivation

10.

Biomedical semantic text summarizer.

Kirmani, Mahira; Kour, Gagandeep; Mohd, Mudasir; Sheikh, Nasrullah; Khan, Dawood Ashraf; Maqbool, Zahid; Wani, Mohsin Altaf; Wani, Abid Hussain.

BMC Bioinformatics ; 25(1): 152, 2024 Apr 16.

Article in English | MEDLINE | ID: mdl-38627652

ABSTRACT

BACKGROUND: Text summarization is a challenging problem in Natural Language Processing, which involves condensing the content of textual documents without losing their overall meaning and information content, In the domain of bio-medical research, summaries are critical for efficient data analysis and information retrieval. While several bio-medical text summarizers exist in the literature, they often miss out on an essential text aspect: text semantics. RESULTS: This paper proposes a novel extractive summarizer that preserves text semantics by utilizing bio-semantic models. We evaluate our approach using ROUGE on a standard dataset and compare it with three state-of-the-art summarizers. Our results show that our approach outperforms existing summarizers. CONCLUSION: The usage of semantics can improve summarizer performance and lead to better summaries. Our summarizer has the potential to aid in efficient data analysis and information retrieval in the field of biomedical research.

Subject(s)

Algorithms , Biomedical Research , Semantics , Information Storage and Retrieval , Natural Language Processing

11.

VAIV bio-discovery service using transformer model and retrieval augmented generation.

Kim, Seonho; Yoon, Juntae.

BMC Bioinformatics ; 25(1): 273, 2024 Aug 21.

Article in English | MEDLINE | ID: mdl-39169321

ABSTRACT

BACKGROUND: There has been a considerable advancement in AI technologies like LLM and machine learning to support biomedical knowledge discovery. MAIN BODY: We propose a novel biomedical neural search service called 'VAIV Bio-Discovery', which supports enhanced knowledge discovery and document search on unstructured text such as PubMed. It mainly handles with information related to chemical compound/drugs, gene/proteins, diseases, and their interactions (chemical compounds/drugs-proteins/gene including drugs-targets, drug-drug, and drug-disease). To provide comprehensive knowledge, the system offers four search options: basic search, entity and interaction search, and natural language search. We employ T5slim_dec, which adapts the autoregressive generation task of the T5 (text-to-text transfer transformer) to the interaction extraction task by removing the self-attention layer in the decoder block. It also assists in interpreting research findings by summarizing the retrieved search results for a given natural language query with Retrieval Augmented Generation (RAG). The search engine is built with a hybrid method that combines neural search with the probabilistic search, BM25. CONCLUSION: As a result, our system can better understand the context, semantics and relationships between terms within the document, enhancing search accuracy. This research contributes to the rapidly evolving biomedical field by introducing a new service to access and discover relevant knowledge.

Subject(s)

Natural Language Processing , Data Mining/methods , Knowledge Discovery/methods , PubMed , Search Engine , Machine Learning , Information Storage and Retrieval/methods , Neural Networks, Computer

12.

Robustness evaluations of pathway activity inference methods on gene expression data.

Hui, Tay Xin; Kasim, Shahreen; Aziz, Izzatdin Abdul; Fudzee, Mohd Farhan Md; Haron, Nazleeni Samiha; Sutikno, Tole; Hassan, Rohayanti; Mahdin, Hairulnizam; Sen, Seah Choon.

BMC Bioinformatics ; 25(1): 23, 2024 Jan 12.

Article in English | MEDLINE | ID: mdl-38216898

ABSTRACT

BACKGROUND: With the exponential growth of high-throughput technologies, multiple pathway analysis methods have been proposed to estimate pathway activities from gene expression profiles. These pathway activity inference methods can be divided into two main categories: non-Topology-Based (non-TB) and Pathway Topology-Based (PTB) methods. Although some review and survey articles discussed the topic from different aspects, there is a lack of systematic assessment and comparisons on the robustness of these approaches. RESULTS: Thus, this study presents comprehensive robustness evaluations of seven widely used pathway activity inference methods using six cancer datasets based on two assessments. The first assessment seeks to investigate the robustness of pathway activity in pathway activity inference methods, while the second assessment aims to assess the robustness of risk-active pathways and genes predicted by these methods. The mean reproducibility power and total number of identified informative pathways and genes were evaluated. Based on the first assessment, the mean reproducibility power of pathway activity inference methods generally decreased as the number of pathway selections increased. Entropy-based Directed Random Walk (e-DRW) distinctly outperformed other methods in exhibiting the greatest reproducibility power across all cancer datasets. On the other hand, the second assessment shows that no methods provide satisfactory results across datasets. CONCLUSION: However, PTB methods generally appear to perform better in producing greater reproducibility power and identifying potential cancer markers compared to non-TB methods.

Subject(s)

Neoplasms , Humans , Reproducibility of Results , Neoplasms/genetics , Entropy , Gene Expression

13.

GPDminer: a tool for extracting named entities and analyzing relations in biological literature.

Park, Yeon-Ji; Yang, Geun-Je; Sohn, Chae-Bong; Park, Soo Jun.

BMC Bioinformatics ; 25(1): 101, 2024 Mar 06.

Article in English | MEDLINE | ID: mdl-38448845

ABSTRACT

PURPOSE: The expansion of research across various disciplines has led to a substantial increase in published papers and journals, highlighting the necessity for reliable text mining platforms for database construction and knowledge acquisition. This abstract introduces GPDMiner(Gene, Protein, and Disease Miner), a platform designed for the biomedical domain, addressing the challenges posed by the growing volume of academic papers. METHODS: GPDMiner is a text mining platform that utilizes advanced information retrieval techniques. It operates by searching PubMed for specific queries, extracting and analyzing information relevant to the biomedical field. This system is designed to discern and illustrate relationships between biomedical entities obtained from automated information extraction. RESULTS: The implementation of GPDMiner demonstrates its efficacy in navigating the extensive corpus of biomedical literature. It efficiently retrieves, extracts, and analyzes information, highlighting significant connections between genes, proteins, and diseases. The platform also allows users to save their analytical outcomes in various formats, including Excel and images. CONCLUSION: GPDMiner offers a notable additional functionality among the array of text mining tools available for the biomedical field. This tool presents an effective solution for researchers to navigate and extract relevant information from the vast unstructured texts found in biomedical literature, thereby providing distinctive capabilities that set it apart from existing methodologies. Its application is expected to greatly benefit researchers in this domain, enhancing their capacity for knowledge discovery and data management.

Subject(s)

Data Management , Data Mining , Databases, Factual , Knowledge Discovery , PubMed

14.

A randomized controlled trial of an app-based intervention on physical activity and glycemic control in people with type 2 diabetes.

Kim, Gyuri; Kim, Seohyun; Lee, You-Bin; Jin, Sang-Man; Hur, Kyu Yeon; Kim, Jae Hyeon.

BMC Med ; 22(1): 185, 2024 May 01.

Article in English | MEDLINE | ID: mdl-38693528

ABSTRACT

BACKGROUND: We investigated the effects of a physical activity encouragement intervention based on a smartphone personal health record (PHR) application (app) on step count increases, glycemic control, and body weight in patients with type 2 diabetes (T2D). METHODS: In this 12-week, single-center, randomized controlled, 12-week extension study, patients with T2D who were overweight or obese were randomized using ratio 1:2 to a group using a smartphone PHR app (control group) or group using the app and received individualized motivational text messages (intervention group) for 12 weeks. During the extension period, the sending of the encouraging text messages to the intervention group was discontinued. The primary outcome was a change in daily step count after 12 weeks and analyzed by independent t-test. The secondary outcomes included HbA1c, fasting glucose, and body weight analyzed by paired or independent t-test. RESULTS: Of 200 participants, 62 (93.9%) and 118 (88.1%) in the control and intervention group, respectively, completed the 12-week main study. The change in daily step count from baseline to week 12 was not significantly different between the two groups (P = 0.365). Among participants with baseline step counts < 7,500 steps per day, the change in the mean daily step count at week 12 in the intervention group (1,319 ± 3,020) was significantly larger than that in control group (-139 ± 2,309) (P = 0.009). At week 12, HbA1c in the intervention group (6.7 ± 0.5%) was significantly lower than that in control group (6.9 ± 0.6%, P = 0.041) and at week 24, changes in HbA1c from baseline were significant in both groups but, comparable between groups. Decrease in HbA1c from baseline to week 12 of intervention group was greater in participants with baseline HbA1c ≥ 7.5% (-0.81 ± 0.84%) compared with those with baseline HbA1c < 7.5% (-0.22 ± 0.39%) (P for interaction = 0.014). A significant reduction in body weight from baseline to week 24 was observed in both groups without significant between-group differences (P = 0.370). CONCLUSIONS: App-based individualized motivational intervention for physical activity did not increase daily step count from baseline to week 12, and the changes in HbA1c levels from baseline to week 12 were comparable. TRIAL REGISTRATION: ClinicalTrials.gov (NCT03407222).

Subject(s)

Diabetes Mellitus, Type 2 , Glycemic Control , Mobile Applications , Humans , Diabetes Mellitus, Type 2/therapy , Male , Middle Aged , Female , Glycemic Control/methods , Aged , Exercise/physiology , Adult , Blood Glucose/metabolism , Glycated Hemoglobin/metabolism , Glycated Hemoglobin/analysis , Body Weight/physiology , Smartphone , Text Messaging

15.

BioGPT: generative pre-trained transformer for biomedical text generation and mining.

Luo, Renqian; Sun, Liai; Xia, Yingce; Qin, Tao; Zhang, Sheng; Poon, Hoifung; Liu, Tie-Yan.

Brief Bioinform ; 23(6)2022 11 19.

Article in English | MEDLINE | ID: mdl-36156661

ABSTRACT

Pre-trained language models have attracted increasing attention in the biomedical domain, inspired by their great success in the general natural language domain. Among the two main branches of pre-trained language models in the general language domain, i.e. BERT (and its variants) and GPT (and its variants), the first one has been extensively studied in the biomedical domain, such as BioBERT and PubMedBERT. While they have achieved great success on a variety of discriminative downstream biomedical tasks, the lack of generation ability constrains their application scope. In this paper, we propose BioGPT, a domain-specific generative Transformer language model pre-trained on large-scale biomedical literature. We evaluate BioGPT on six biomedical natural language processing tasks and demonstrate that our model outperforms previous models on most tasks. Especially, we get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks, respectively, and 78.2% accuracy on PubMedQA, creating a new record. Our case study on text generation further demonstrates the advantage of BioGPT on biomedical literature to generate fluent descriptions for biomedical terms.

Subject(s)

Data Mining , Natural Language Processing

16.

e-TSN: an interactive visual exploration platform for target-disease knowledge mapping from literature.

Feng, Ziyan; Shen, Zihao; Li, Honglin; Li, Shiliang.

Brief Bioinform ; 23(6)2022 11 19.

Article in English | MEDLINE | ID: mdl-36347537

ABSTRACT

Target discovery and identification processes are driven by the increasing amount of biomedical data. The vast numbers of unstructured texts of biomedical publications provide a rich source of knowledge for drug target discovery research and demand the development of specific algorithms or tools to facilitate finding disease genes and proteins. Text mining is a method that can automatically mine helpful information related to drug target discovery from massive biomedical literature. However, there is a substantial lag between biomedical publications and the subsequent abstraction of information extracted by text mining to databases. The knowledge graph is introduced to integrate heterogeneous biomedical data. Here, we describe e-TSN (Target significance and novelty explorer, http://www.lilab-ecust.cn/etsn/), a knowledge visualization web server integrating the largest database of associations between targets and diseases from the full scientific literature by constructing significance and novelty scoring methods based on bibliometric statistics. The platform aims to visualize target-disease knowledge graphs to assist in prioritizing candidate disease-related proteins. Approved drugs and associated bioactivities for each interested target are also provided to facilitate the visualization of drug-target relationships. In summary, e-TSN is a fast and customizable visualization resource for investigating and analyzing the intricate target-disease networks, which could help researchers understand the mechanisms underlying complex disease phenotypes and improve the drug discovery and development efficiency, especially for the unexpected outbreak of infectious disease pandemics like COVID-19.

Subject(s)

COVID-19 , Humans , Data Mining/methods , Publications , Knowledge , Algorithms , Proteins

17.

DSEATM: drug set enrichment analysis uncovering disease mechanisms by biomedical text mining.

Luo, Zhi-Hui; Zhu, Li-Da; Wang, Ya-Min; Hu Qian, Sheng; Li, Menglu; Zhang, Wen; Chen, Zhen-Xia.

Brief Bioinform ; 23(4)2022 07 18.

Article in English | MEDLINE | ID: mdl-35679594

ABSTRACT

Disease pathogenesis is always a major topic in biomedical research. With the exponential growth of biomedical information, drug effect analysis for specific phenotypes has shown great promise in uncovering disease-associated pathways. However, this method has only been applied to a limited number of drugs. Here, we extracted the data of 4634 diseases, 3671 drugs, 112 809 disease-drug associations and 81 527 drug-gene associations by text mining of 29 168 919 publications. On this basis, we proposed a 'Drug Set Enrichment Analysis by Text Mining (DSEATM)' pipeline and applied it to 3250 diseases, which outperformed the state-of-the-art method. Furthermore, diseases pathways enriched by DSEATM were similar to those obtained using the TCGA cancer RNA-seq differentially expressed genes. In addition, the drug number, which showed a remarkable positive correlation of 0.73 with the AUC, plays a determining role in the performance of DSEATM. Taken together, DSEATM is an auspicious and accurate disease research tool that offers fresh insights.

Subject(s)

Biomedical Research , Data Mining , Data Mining/methods , Phenotype

18.

Prediction of biomarker-disease associations based on graph attention network and text representation.

Yang, Minghao; Huang, Zhi-An; Gu, Wenhao; Han, Kun; Pan, Wenying; Yang, Xiao; Zhu, Zexuan.

Brief Bioinform ; 23(5)2022 09 20.

Article in English | MEDLINE | ID: mdl-35901464

ABSTRACT

MOTIVATION: The associations between biomarkers and human diseases play a key role in understanding complex pathology and developing targeted therapies. Wet lab experiments for biomarker discovery are costly, laborious and time-consuming. Computational prediction methods can be used to greatly expedite the identification of candidate biomarkers. RESULTS: Here, we present a novel computational model named GTGenie for predicting the biomarker-disease associations based on graph and text features. In GTGenie, a graph attention network is utilized to characterize diverse similarities of biomarkers and diseases from heterogeneous information resources. Meanwhile, a pretrained BERT-based model is applied to learn the text-based representation of biomarker-disease relation from biomedical literature. The captured graph and text features are then integrated in a bimodal fusion network to model the hybrid entity representation. Finally, inductive matrix completion is adopted to infer the missing entries for reconstructing relation matrix, with which the unknown biomarker-disease associations are predicted. Experimental results on HMDD, HMDAD and LncRNADisease data sets showed that GTGenie can obtain competitive prediction performance with other state-of-the-art methods. AVAILABILITY: The source code of GTGenie and the test data are available at: https://github.com/Wolverinerine/GTGenie.

Subject(s)

Computational Biology , Software , Computational Biology/methods , Humans

19.

Automated assembly of molecular mechanisms at scale from text mining and curated databases.

Bachman, John A; Gyori, Benjamin M; Sorger, Peter K.

Mol Syst Biol ; 19(5): e11325, 2023 05 09.

Article in English | MEDLINE | ID: mdl-36938926

ABSTRACT

The analysis of omic data depends on machine-readable information about protein interactions, modifications, and activities as found in protein interaction networks, databases of post-translational modifications, and curated models of gene and protein function. These resources typically depend heavily on human curation. Natural language processing systems that read the primary literature have the potential to substantially extend knowledge resources while reducing the burden on human curators. However, machine-reading systems are limited by high error rates and commonly generate fragmentary and redundant information. Here, we describe an approach to precisely assemble molecular mechanisms at scale using multiple natural language processing systems and the Integrated Network and Dynamical Reasoning Assembler (INDRA). INDRA identifies full and partial overlaps in information extracted from published papers and pathway databases, uses predictive models to improve the reliability of machine reading, and thereby assembles individual pieces of information into non-redundant and broadly usable mechanistic knowledge. Using INDRA to create high-quality corpora of causal knowledge we show it is possible to extend protein-protein interaction databases and explain co-dependencies in the Cancer Dependency Map.

Subject(s)

Data Mining , Natural Language Processing , Humans , Reproducibility of Results , Databases, Factual

20.

A patient-centered textbook outcome measure effectively discriminates contemporary elective open abdominal aortic aneurysm repair quality.

Felsted, Amy; Beck, Adam W; Banks, Charles Adam; Neal, Dan; Columbo, Jesse A; Robinson, Scott T; Stone, David H; Scali, Salvatore T.

J Vasc Surg ; 2024 Jun 03.

Article in English | MEDLINE | ID: mdl-38838968

ABSTRACT

BACKGROUND: There is persistent controversy surrounding the merit of surgical volume benchmarks being used solely as a sufficient proxy for assessing the quality of open abdominal aortic aneurysm (AAA) repair. Importantly, operative volume quotas may fail to reflect a more nuanced and comprehensive depiction of surgical outcomes most relevant to patients. Accordingly, we herein propose a patient-centered textbook outcome (TO) for AAA repair that is analogous to other large magnitude extirpative operations performed in other surgical specialties, and test its feasibility to discriminate hospital performance using Society for Vascular Surgery (SVS) volume guidelines. METHODS: All elective open infrarenal AAA repairs (OAR) in the SVS-Vascular Quality Initiative were examined (2009-2022). The primary end point was a TO, defined as a composite of no in-hospital complication or reintervention/reoperation, length of stay of ≤10 days, home discharge, and 1-year survival rates. The discriminatory ability of the TO measure was assessed by comparing centers that did or did not meet the SVS annual OAR volume threshold recommendation (high volume ≥10 OARs/year; low volume <10 OARs/year). Logistic regression and multivariable models adjusted for patient and procedure-related differences. RESULTS: A total of 9657 OARs across 198 centers were analyzed (mean age, 69.5 ± 8.4 years; female, 26%; non-White, 12%). A TO was identified in 44% (n = 4293) of the overall cohort. The incidence of individual TO components included no in-hospital complication (61%), no in-hospital reintervention or reoperation (92%), length of stay of ≤10 days (78%), home discharge (76%), and 1-year survival (91%). Median annual center volume was 6 (interquartile range, 3-10) and a majority of centers did not meet the SVS volume suggested threshold (<10 OARs/year, n = 148 [74%]). However, most patients (6265 of 9657 [65%]) underwent OAR in high-volume hospitals. When comparing high- and low-volume centers, a TO was more likely to occur in high-volume institutions: ≥10 OARs/year (46%) vs <10 OARs/year (42%; P = .0006). The association of a protective effect for higher center volume remained after risk adjustment (odds ratio, 1.1; 95% confidence interval, 1.05-1.26; P = .003). CONCLUSIONS: TOs for elective OAR reflect a more nuanced and comprehensive patient centered proxy to measure care delivery, consistent with other surgical specialties. Surprisingly, a TO was achieved in <50% of elective AAA cases nationally. Although the likelihood of a TO seems to correlate with SVS center volume recommendations, it more importantly reflects elements which may be prioritized by patients and thus offers insights into further improving real-world AAA care.

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL