A gene-phenotype relationship extraction pipeline from the biomedical literature using a representation learning approach.

Xing, Wenhui; Qi, Junsheng; Yuan, Xiaohui; Li, Lin; Zhang, Xiaoyu; Fu, Yuhua; Xiong, Shengwu; Hu, Lun; Peng, Jing

Xing, Wenhui; Qi, Junsheng; Yuan, Xiaohui; Li, Lin; Zhang, Xiaoyu; Fu, Yuhua; Xiong, Shengwu; Hu, Lun; Peng, Jing.

Afiliación

Xing W; School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China.
Qi J; Department of Plant Science, College of Biological Science, China Agricultural University, Beijing, China.
Yuan X; School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China.
Li L; School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China.
Zhang X; Britton Chance Center for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics-Huazhong University of Science and Technology, Wuhan, China.
Fu Y; School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China.
Xiong S; School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China.
Hu L; School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China.
Peng J; School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China.

Bioinformatics ; 34(13): i386-i394, 2018 07 01.

Article en En | MEDLINE | ID: mdl-29950017

ABSTRACT

ABSTRACT

Motivation The fundamental challenge of modern genetic analysis is to establish gene-phenotype correlations that are often found in the large-scale publications. Because lexical features of gene are relatively regular in text, the main challenge of these relation extraction is phenotype recognition. Due to phenotypic descriptions are often study- or author-specific, few lexicon can be used to effectively identify the entire phenotypic expressions in text, especially for plants.

Results:

We have proposed a pipeline for extracting phenotype, gene and their relations from biomedical literature. Combined with abbreviation revision and sentence template extraction, we improved the unsupervised word-embedding-to-sentence-embedding cascaded approach as representation learning to recognize the various broad phenotypic information in literature. In addition, the dictionary- and rule-based method was applied for gene recognition. Finally, we integrated one of famous information extraction system OLLIE to identify gene-phenotype relations. To demonstrate the applicability of the pipeline, we established two types of comparison experiment using model organism Arabidopsis thaliana. In the comparison of state-of-the-art baselines, our approach obtained the best performance (F1-Measure of 66.83%). We also applied the pipeline to 481 full-articles from TAIR gene-phenotype manual relationship dataset to prove the validity. The results showed that our proposed pipeline can cover 70.94% of the original dataset and add 373 new relations to expand it. Availability and implementation The source code is available at http//www.wutbiolab.cn 82/Gene-Phenotype-Relation-Extraction-Pipeline.zip. Supplementary information Supplementary data are available at Bioinformatics online.

Asunto(s)

Minería de Datos/métodos; Estudios de Asociación Genética/métodos; Programas Informáticos; Bases de Datos Bibliográficas; Genotipo; Aprendizaje Automático; Fenotipo; Plantas/genética

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Programas Informáticos / Estudios de Asociación Genética / Minería de Datos Idioma: En Revista: Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2018 Tipo del documento: Article País de afiliación: China

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google