Search | VHL Regional Portal

1.

High-quality gene/disease embedding in a multi-relational heterogeneous graph after a joint matrix/tensor decomposition.

Zhou, Kaiyin; Zhang, Sheng; Wang, Yuxing; Cohen, Kevin Bretonnel; Kim, Jin-Dong; Luo, Qi; Yao, Xinzhi; Zhou, Xingyu; Xia, Jingbo.

J Biomed Inform ; 126: 103973, 2022 02.

Article in English | MEDLINE | ID: mdl-34995810

ABSTRACT

MOTIVATION: Node embedding of biological entity network has been widely investigated for the downstream application scenarios. To embed full semantics of gene and disease, a multi-relational heterogeneous graph is considered in a scenario where uni-relation between gene/disease and other heterogeneous entities are abundant while multi-relation between gene and disease is relatively sparse. After introducing this novel graph format, it is illuminative to design a specific data integration algorithm to fully capture the graph information and bring embeddings with high quality. RESULTS: First, a typical multi-relational triple dataset was introduced, which carried significant association between gene and disease. Second, we curated all human genes and diseases in seven mainstream datasets and constructed a large-scale gene-disease network, which compromising 163,024 nodes and 25,265,607 edges, and relates to 27,165 genes, 2,665 diseases, 15,067 chemicals, 108,023 mutations, 2,363 pathways, and 7.732 phenotypes. Third, we proposed a Joint Decomposition of Heterogeneous Matrix and Tensor (JDHMT) model, which integrated all heterogeneous data resources and obtained embedding for each gene or disease. Forth, a visualized intrinsic evaluation was performed, which investigated the embeddings in terms of interpretable data clustering. Furthermore, an extrinsic evaluation was performed in the form of linking prediction. Both intrinsic and extrinsic evaluation results showed that JDHMT model outperformed other eleven state-of-the-art (SOTA) methods which are under relation-learning, proximity-preserving or message-passing paradigms. Finally, the constructed gene-disease network, embedding results and codes were made available. DATA AND CODES AVAILABILITY: The constructed massive gene-disease network is available at: https://hzaubionlp.com/heterogeneous-biological-network/. The codes are available at: https://github.com/bionlp-hzau/JDHMT.

Subject(s)

Algorithms , Semantics , Learning , Phenotype

2.

Editor's introduction to the special section on the 7th Biomedical Linked Annotation Hackathon (BLAH7).

Kim, Jin-Dong; Cohen, Kevin Bretonnel; Rinaldi, Fabio; Lu, Zhiyong; Park, Hyun-Seok.

Genomics Inform ; 19(3): e20, 2021 Sep.

Article in English | MEDLINE | ID: mdl-34638167

3.

Natural Language Processing for Rapid Response to Emergent Diseases: Case Study of Calcium Channel Blockers and Hypertension in the COVID-19 Pandemic.

Neuraz, Antoine; Lerner, Ivan; Digan, William; Paris, Nicolas; Tsopra, Rosy; Rogier, Alice; Baudoin, David; Cohen, Kevin Bretonnel; Burgun, Anita; Garcelon, Nicolas; Rance, Bastien.

J Med Internet Res ; 22(8): e20773, 2020 Aug 14.

Article in English | MEDLINE | ID: mdl-32759101

ABSTRACT

BACKGROUND: A novel disease poses special challenges for informatics solutions. Biomedical informatics relies for the most part on structured data, which require a preexisting data or knowledge model; however, novel diseases do not have preexisting knowledge models. In an emergent epidemic, language processing can enable rapid conversion of unstructured text to a novel knowledge model. However, although this idea has often been suggested, no opportunity has arisen to actually test it in real time. The current coronavirus disease (COVID-19) pandemic presents such an opportunity. OBJECTIVE: The aim of this study was to evaluate the added value of information from clinical text in response to emergent diseases using natural language processing (NLP). METHODS: We explored the effects of long-term treatment by calcium channel blockers on the outcomes of COVID-19 infection in patients with high blood pressure during in-patient hospital stays using two sources of information: data available strictly from structured electronic health records (EHRs) and data available through structured EHRs and text mining. RESULTS: In this multicenter study involving 39 hospitals, text mining increased the statistical power sufficiently to change a negative result for an adjusted hazard ratio to a positive one. Compared to the baseline structured data, the number of patients available for inclusion in the study increased by 2.95 times, the amount of available information on medications increased by 7.2 times, and the amount of additional phenotypic information increased by 11.9 times. CONCLUSIONS: In our study, use of calcium channel blockers was associated with decreased in-hospital mortality in patients with COVID-19 infection. This finding was obtained by quickly adapting an NLP pipeline to the domain of the novel disease; the adapted pipeline still performed sufficiently to extract useful information. When that information was used to supplement existing structured data, the sample size could be increased sufficiently to see treatment effects that were not previously statistically detectable.

Subject(s)

Betacoronavirus , Calcium Channel Blockers/therapeutic use , Coronavirus Infections/drug therapy , Hypertension/complications , Natural Language Processing , Pneumonia, Viral/drug therapy , COVID-19 , Coronavirus Infections/complications , Data Mining , Electronic Health Records , Humans , Pandemics , Pneumonia, Viral/complications , SARS-CoV-2 , Time Factors , COVID-19 Drug Treatment

4.

Editor's introduction to the special issue of the 6th Biomedical Linked Annotation Hackathon (BLAH6).

Kim, Jin-Dong; Cohen, Kevin Bretonnel; Rinaldi, Fabio; Lu, Zhiyong; Collier, Nigel; Park, Hyun-Seok.

Genomics Inform ; 18(2): e12, 2020 Jun.

Article in English | MEDLINE | ID: mdl-32634866

5.

Introduction to BLAH5 special issue: recent progress on interoperability of biomedical text mining.

Kim, Jin-Dong; Cohen, Kevin Bretonnel; Collier, Nigel; Lu, Zhiyong; Rinaldi, Fabio.

Genomics Inform ; 17(2): e12, 2019 Jun.

Article in English | MEDLINE | ID: mdl-31307127

6.

GOF/LOF knowledge inference with tensor decomposition in support of high order link discovery for gene, mutation and disease.

Zhou, Kai Yin; Wang, Yu Xing; Zhang, Sheng; Gachloo, Mina; Kim, Jin Dong; Luo, Qi; Cohen, Kevin Bretonnel; Xia, Jing Bo.

Math Biosci Eng ; 16(3): 1376-1391, 2019 02 20.

Article in English | MEDLINE | ID: mdl-30947425

ABSTRACT

For discovery of new usage of drugs, the function type of their target genes plays an important role, and the hypothesis of "Antagonist-GOF" and "Agonist-LOF" has laid a solid foundation for supporting drug repurposing. In this research, an active gene annotation corpus was used as training data to predict the gain-of-function or loss-of-function or unknown character of each human gene after variation events. Unlike the design of(entity, predicate, entity) triples in a traditional three way tensor, a four way and a five way tensor, GMFD-/GMAFD-tensor, were designed to represent higher order links among or among part of these entities: genes(G), mutations(M), functions(F), diseases( D) and annotation labels(A). A tensor decomposition algorithm, CP decomposition, was applied to the higher order tensor and to unveil the correlation among entities. Meanwhile, a state-of-the-art baseline tensor decomposition algorithm, RESCAL, was carried on the three way tensor as a comparing method. The result showed that CP decomposition on higher order tensor performed better than RESCAL on traditional three way tensor in recovering masked data and making predictions. In addition, The four way tensor was proved to be the best format for our issue. At the end, a case study reproducing two disease-gene-drug links(Myelodysplatic Syndromes-IL2RA-Aldesleukin, Lymphoma- IL2RA-Aldesleukin) presented the feasibility of our prediction model for drug repurposing.

Subject(s)

Drug Repositioning/economics , Drug Repositioning/methods , Genetic Variation , Machine Learning , Mutation , Algorithms , Cost-Benefit Analysis , Genetic Diseases, Inborn/genetics , Humans , Interleukin-2/analogs & derivatives , Interleukin-2/therapeutic use , Interleukin-2 Receptor alpha Subunit/genetics , Lymphoma/genetics , Models, Genetic , Molecular Sequence Annotation , Myelodysplastic Syndromes/genetics , Recombinant Proteins/therapeutic use , Software

7.

Inter-Annotator Agreement and the Upper Limit on Machine Performance: Evidence from Biomedical Natural Language Processing.

Boguslav, Mayla; Cohen, Kevin Bretonnel.

Stud Health Technol Inform ; 245: 298-302, 2017.

Article in English | MEDLINE | ID: mdl-29295103

ABSTRACT

Human-annotated data is a fundamental part of natural language processing system development and evaluation. The quality of that data is typically assessed by calculating the agreement between the annotators. It is widely assumed that this agreement between annotators is the upper limit on system performance in natural language processing: if humans can't agree with each other about the classification more than some percentage of the time, we don't expect a computer to do any better. We trace the logical positivist roots of the motivation for measuring inter-annotator agreement, demonstrate the prevalence of the widely-held assumption about the relationship between inter-annotator agreement and system performance, and present data that suggest that inter-annotator agreement is not, in fact, an upper bound on language processing system performance.

Subject(s)

Data Curation , Natural Language Processing , Humans , Observer Variation

8.

Crowdsourcing and curation: perspectives from biology and natural language processing.

Hirschman, Lynette; Fort, Karën; Boué, Stéphanie; Kyrpides, Nikos; Islamaj Dogan, Rezarta; Cohen, Kevin Bretonnel.

Database (Oxford) ; 20162016.

Article in English | MEDLINE | ID: mdl-27504010

ABSTRACT

Crowdsourcing is increasingly utilized for performing tasks in both natural language processing and biocuration. Although there have been many applications of crowdsourcing in these fields, there have been fewer high-level discussions of the methodology and its applicability to biocuration. This paper explores crowdsourcing for biocuration through several case studies that highlight different ways of leveraging 'the crowd'; these raise issues about the kind(s) of expertise needed, the motivations of participants, and questions related to feasibility, cost and quality. The paper is an outgrowth of a panel session held at BioCreative V (Seville, September 9-11, 2015). The session consisted of four short talks, followed by a discussion. In their talks, the panelists explored the role of expertise and the potential to improve crowd performance by training; the challenge of decomposing tasks to make them amenable to crowdsourcing; and the capture of biological data and metadata through community editing.Database URL: http://www.mitre.org/publications/technical-papers/crowdsourcing-and-curation-perspectives.

Subject(s)

Crowdsourcing , Data Curation/methods , Metadata , Natural Language Processing

9.

Methodological Issues in Predicting Pediatric Epilepsy Surgery Candidates Through Natural Language Processing and Machine Learning.

Cohen, Kevin Bretonnel; Glass, Benjamin; Greiner, Hansel M; Holland-Bouley, Katherine; Standridge, Shannon; Arya, Ravindra; Faist, Robert; Morita, Diego; Mangano, Francesco; Connolly, Brian; Glauser, Tracy; Pestian, John.

Biomed Inform Insights ; 8: 11-8, 2016.

Article in English | MEDLINE | ID: mdl-27257386

ABSTRACT

OBJECTIVE: We describe the development and evaluation of a system that uses machine learning and natural language processing techniques to identify potential candidates for surgical intervention for drug-resistant pediatric epilepsy. The data are comprised of free-text clinical notes extracted from the electronic health record (EHR). Both known clinical outcomes from the EHR and manual chart annotations provide gold standards for the patient's status. The following hypotheses are then tested: 1) machine learning methods can identify epilepsy surgery candidates as well as physicians do and 2) machine learning methods can identify candidates earlier than physicians do. These hypotheses are tested by systematically evaluating the effects of the data source, amount of training data, class balance, classification algorithm, and feature set on classifier performance. The results support both hypotheses, with F-measures ranging from 0.71 to 0.82. The feature set, classification algorithm, amount of training data, class balance, and gold standard all significantly affected classification performance. It was further observed that classification performance was better than the highest agreement between two annotators, even at one year before documented surgery referral. The results demonstrate that such machine learning methods can contribute to predicting pediatric epilepsy surgery candidates and reducing lag time to surgery referral.

10.

BioC: a minimalist approach to interoperability for biomedical text processing.

Comeau, Donald C; Islamaj Dogan, Rezarta; Ciccarese, Paolo; Cohen, Kevin Bretonnel; Krallinger, Martin; Leitner, Florian; Lu, Zhiyong; Peng, Yifan; Rinaldi, Fabio; Torii, Manabu; Valencia, Alfonso; Verspoor, Karin; Wiegers, Thomas C; Wu, Cathy H; Wilbur, W John.

Database (Oxford) ; 2013: bat064, 2013.

Article in English | MEDLINE | ID: mdl-24048470

ABSTRACT

A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net/

Subject(s)

Biomedical Research , Data Mining , Natural Language Processing , Software , Humans

11.

Mining FDA drug labels for medical conditions.

Li, Qi; Deleger, Louise; Lingren, Todd; Zhai, Haijun; Kaiser, Megan; Stoutenborough, Laura; Jegga, Anil G; Cohen, Kevin Bretonnel; Solti, Imre.

BMC Med Inform Decis Mak ; 13: 53, 2013 Apr 24.

Article in English | MEDLINE | ID: mdl-23617267

ABSTRACT

BACKGROUND: Cincinnati Children's Hospital Medical Center (CCHMC) has built the initial Natural Language Processing (NLP) component to extract medications with their corresponding medical conditions (Indications, Contraindications, Overdosage, and Adverse Reactions) as triples of medication-related information ([(1) drug name]-[(2) medical condition]-[(3) LOINC section header]) for an intelligent database system, in order to improve patient safety and the quality of health care. The Food and Drug Administration's (FDA) drug labels are used to demonstrate the feasibility of building the triples as an intelligent database system task. METHODS: This paper discusses a hybrid NLP system, called AutoMCExtractor, to collect medical conditions (including disease/disorder and sign/symptom) from drug labels published by the FDA. Altogether, 6,611 medical conditions in a manually-annotated gold standard were used for the system evaluation. The pre-processing step extracted the plain text from XML file and detected eight related LOINC sections (e.g. Adverse Reactions, Warnings and Precautions) for medical condition extraction. Conditional Random Fields (CRF) classifiers, trained on token, linguistic, and semantic features, were then used for medical condition extraction. Lastly, dictionary-based post-processing corrected boundary-detection errors of the CRF step. We evaluated the AutoMCExtractor on manually-annotated FDA drug labels and report the results on both token and span levels. RESULTS: Precision, recall, and F-measure were 0.90, 0.81, and 0.85, respectively, for the span level exact match; for the token-level evaluation, precision, recall, and F-measure were 0.92, 0.73, and 0.82, respectively. CONCLUSIONS: The results demonstrate that (1) medical conditions can be extracted from FDA drug labels with high performance; and (2) it is feasible to develop a framework for an intelligent database system.

Subject(s)

Adverse Drug Reaction Reporting Systems , Data Mining/methods , Drug Labeling , United States Food and Drug Administration , Humans , Medication Systems , Natural Language Processing , Ohio , United States

12.

Text and data mining for biomedical discovery.

Gonzalez, Graciela; Cohen, Kevin Bretonnel; Greene, Casey S; Hahn, Udo; Kann, Maricel G; Leaman, Robert; Shah, Nigam; Ye, Jieping.

Pac Symp Biocomput ; 2013: 368-72, 2013.

Article in English | MEDLINE | ID: mdl-23424141

ABSTRACT

The biggest challenge for text and data mining is to truly impact the biomedical discovery process, enabling scientists to generate novel hypothesis to address the most crucial questions. Among a number of worthy submissions, we have selected six papers that exemplify advances in text and data mining methods that have a demonstrated impact on a wide range of applications. Work presented in this session includes data mining techniques applied to the discovery of 3-way genetic interactions and to the analysis of genetic data in the context of electronic medical records (EMRs), as well as an integrative approach that combines data from genetic (SNP) and transcriptomic (microarray) sources for clinical prediction. Text mining advances include a classification method to determine whether a published article contains pharmacological experiments relevant to drug-drug interactions, a fine-grained text mining approach for detecting the catalytic sites in proteins in the biomedical literature, and a method for automatically extending a taxonomy of health-related terms to integrate consumer-friendly synonyms for medical terminologies.

Subject(s)

Computational Biology , Data Mining , Computational Biology/methods , Data Mining/methods , Drug Interactions , Electronic Health Records , Humans , Terminology as Topic

13.

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools.

Verspoor, Karin; Cohen, Kevin Bretonnel; Lanfranchi, Arrick; Warner, Colin; Johnson, Helen L; Roeder, Christophe; Choi, Jinho D; Funk, Christopher; Malenkiy, Yuriy; Eckert, Miriam; Xue, Nianwen; Baumgartner, William A; Bada, Michael; Palmer, Martha; Hunter, Lawrence E.

BMC Bioinformatics ; 13: 207, 2012 Aug 17.

Article in English | MEDLINE | ID: mdl-22901054

ABSTRACT

BACKGROUND: We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus. RESULTS: Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data. CONCLUSIONS: The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.

Subject(s)

Data Mining/methods , Natural Language Processing , Software

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL