Discovering themes in biomedical literature using a projection-based algorithm.

Yeganova, Lana; Kim, Sun; Balasanov, Grigory; Wilbur, W John

Yeganova, Lana; Kim, Sun; Balasanov, Grigory; Wilbur, W John.

Afiliación

Yeganova L; National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA. lana.yeganova@nih.gov.
Kim S; National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA.
Balasanov G; National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA.
Wilbur WJ; National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA.

BMC Bioinformatics ; 19(1): 269, 2018 07 16.

Article en En | MEDLINE | ID: mdl-30012087

RESUMEN

BACKGROUND: The need to organize any large document collection in a manner that facilitates human comprehension has become crucial with the increasing volume of information available. Two common approaches to provide a broad overview of the information space are document clustering and topic modeling. Clustering aims to group documents or terms into meaningful clusters. Topic modeling, on the other hand, focuses on finding coherent keywords for describing topics appearing in a set of documents. In addition, there have been efforts for clustering documents and finding keywords simultaneously. RESULTS: We present an algorithm to analyze document collections that is based on a notion of a theme, defined as a dual representation based on a set of documents and key terms. In this work, a novel vector space mechanism is proposed for computing themes. Starting with a single document, the theme algorithm treats terms and documents as explicit components, and iteratively uses each representation to refine the other until the theme is detected. The method heavily relies on an optimization routine that we refer to as the projection algorithm which, under specific conditions, is guaranteed to converge to the first singular vector of a data matrix. We apply our algorithm to a collection of about sixty thousand PubMed â documents examining the subject of Single Nucleotide Polymorphism, evaluate the results and show the effectiveness and scalability of the proposed method. CONCLUSIONS: This study presents a contribution on theoretical and algorithmic levels, as well as demonstrates the feasibility of the method for large scale applications. The evaluation of our system on benchmark datasets demonstrates that our method compares favorably with the current state-of-the-art methods in computing clusters of documents with coherent topic terms.

Asunto(s)

Algoritmos; Publicaciones; Análisis por Conglomerados; Bases de Datos Genéticas; Humanos; Polimorfismo de Nucleótido Simple/genética

Palabras clave

First singular vector; Projection algorithm; Theme discovery

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Publicaciones / Algoritmos Tipo de estudio: Prognostic_studies Límite: Humans Idioma: En Revista: BMC Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2018 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google