Pesquisa | Secretaria de Estado da Saúde

Analyzing the field of bioinformatics with the multi-faceted topic modeling technique.

Heo, Go Eun; Kang, Keun Young; Song, Min; Lee, Jeong-Hoon.

BMC Bioinformatics ; 18(Suppl 7): 251, 2017 May 31.

Artigo em Inglês | MEDLINE | ID: mdl-28617229

RESUMO

BACKGROUND: Bioinformatics is an interdisciplinary field at the intersection of molecular biology and computing technology. To characterize the field as convergent domain, researchers have used bibliometrics, augmented with text-mining techniques for content analysis. In previous studies, Latent Dirichlet Allocation (LDA) was the most representative topic modeling technique for identifying topic structure of subject areas. However, as opposed to revealing the topic structure in relation to metadata such as authors, publication date, and journals, LDA only displays the simple topic structure. METHODS: In this paper, we adopt the Tang et al.'s Author-Conference-Topic (ACT) model to study the field of bioinformatics from the perspective of keyphrases, authors, and journals. The ACT model is capable of incorporating the paper, author, and conference into the topic distribution simultaneously. To obtain more meaningful results, we use journals and keyphrases instead of conferences and bag-of-words.. For analysis, we use PubMed to collected forty-six bioinformatics journals from the MEDLINE database. We conducted time series topic analysis over four periods from 1996 to 2015 to further examine the interdisciplinary nature of bioinformatics. RESULTS: We analyze the ACT Model results in each period. Additionally, for further integrated analysis, we conduct a time series analysis among the top-ranked keyphrases, journals, and authors according to their frequency. We also examine the patterns in the top journals by simultaneously identifying the topical probability in each period, as well as the top authors and keyphrases. The results indicate that in recent years diversified topics have become more prevalent and convergent topics have become more clearly represented. CONCLUSION: The results of our analysis implies that overtime the field of bioinformatics becomes more interdisciplinary where there is a steady increase in peripheral fields such as conceptual, mathematical, and system biology. These results are confirmed by integrated analysis of topic distribution as well as top ranked keyphrases, authors, and journals.

Assuntos

Biologia Computacional/métodos , Modelos Teóricos , Bibliometria , Mineração de Dados , Bases de Dados Factuais , Humanos

Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts.

Verspoor, Karin M; Heo, Go Eun; Kang, Keun Young; Song, Min.

BMC Med Inform Decis Mak ; 16 Suppl 1: 68, 2016 07 18.

Artigo em Inglês | MEDLINE | ID: mdl-27454860

RESUMO

BACKGROUND: The Variome corpus, a small collection of published articles about inherited colorectal cancer, includes annotations of 11 entity types and 13 relation types related to the curation of the relationship between genetic variation and disease. Due to the richness of these annotations, the corpus provides a good testbed for evaluation of biomedical literature information extraction systems. METHODS: In this paper, we focus on assessing performance on extracting the relations in the corpus, using gold standard entities as a starting point, to establish a baseline for extraction of relations important for extraction of genetic variant information from the literature. We test the application of the Public Knowledge Discovery Engine for Java (PKDE4J) system, a natural language processing system designed for information extraction of entities and relations in text, on the relation extraction task using this corpus. RESULTS: For the relations which are attested at least 100 times in the Variome corpus, we realise a performance ranging from 0.78-0.84 Precision-weighted F-score, depending on the relation. We find that the PKDE4J system adapted straightforwardly to the range of relation types represented in the corpus; some extensions to the original methodology were required to adapt to the multi-relational classification context. The results are competitive with state-of-the-art relation extraction performance on more heavily studied corpora, although the analysis shows that the Recall of a co-occurrence baseline outweighs the benefit of improved Precision for many relations, indicating the value of simple semantic constraints on relations. CONCLUSIONS: This work represents the first attempt to apply relation extraction methods to the Variome corpus. The results demonstrate that automated methods have good potential to structure the information expressed in the published literature related to genetic variants, connecting mutations to genes, diseases, and patient cohorts. Further development of such approaches will facilitate more efficient biocuration of genetic variant information into structured databases, leveraging the knowledge embedded in the vast publication literature.

Assuntos

Neoplasias Colorretais/genética , Mineração de Dados/métodos , Bases de Dados Genéticas , Variação Genética/genética , Humanos

PKDE4J: Entity and relation extraction for public knowledge discovery.

Song, Min; Kim, Won Chul; Lee, Dahee; Heo, Go Eun; Kang, Keun Young.

J Biomed Inform ; 57: 320-32, 2015 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-26277115

RESUMO

Due to an enormous number of scientific publications that cannot be handled manually, there is a rising interest in text-mining techniques for automated information extraction, especially in the biomedical field. Such techniques provide effective means of information search, knowledge discovery, and hypothesis generation. Most previous studies have primarily focused on the design and performance improvement of either named entity recognition or relation extraction. In this paper, we present PKDE4J, a comprehensive text-mining system that integrates dictionary-based entity extraction and rule-based relation extraction in a highly flexible and extensible framework. Starting with the Stanford CoreNLP, we developed the system to cope with multiple types of entities and relations. The system also has fairly good performance in terms of accuracy as well as the ability to configure text-processing components. We demonstrate its competitive performance by evaluating it on many corpora and found that it surpasses existing systems with average F-measures of 85% for entity extraction and 81% for relation extraction.

Assuntos

Mineração de Dados , Descoberta do Conhecimento , Conhecimento , Publicações Periódicas como Assunto , Publicações

Examining influential factors for acknowledgements classification using supervised learning.

Song, Min; Kang, Keun Young; Timakum, Tatsawan; Zhang, Xinyuan.

PLoS One ; 15(2): e0228928, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32059035

RESUMO

Acknowledgements have been examined as important elements in measuring the contributions to and intellectual debts of a scientific publication. Unlike previous studies that were limited in the scope of analysis and manual examination. The present study aimed to conduct the automatic classification of acknowledgements on a large scale of data. To this end, we first created a training dataset for acknowledgements classification by sampling the acknowledgements sections from the entire PubMed Central database. Second, we adopted various supervised learning algorithms to examine which algorithm performed best in what condition. In addition, we observed the factors affecting classification performance. We investigated the effects of the following three main aspects: classification algorithms, categories, and text representations. The CNN+Doc2Vec algorithm achieved the highest performance of 93.58% accuracy in the original dataset and 87.93% in the converted dataset. The experimental results indicated that the characteristics of categories and sentence patterns influenced the performance of classification. Most of the classifiers performed better on the categories of financial, peer interactive communication, and technical support compared to other classes.

Assuntos

Publicações/classificação , Algoritmos , Inteligência Artificial , Humanos , Aprendizado de Máquina , Publicações/tendências , Pesquisadores , Aprendizado de Máquina Supervisionado

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

Detalhe da pesquisa