Improving the utility of MeSH® terms using the TopicalMeSH representation.
J Biomed Inform
; 61: 77-86, 2016 06.
Article
en En
| MEDLINE
| ID: mdl-27001195
OBJECTIVE: To evaluate whether vector representations encoding latent topic proportions that capture similarities to MeSH terms can improve performance on biomedical document retrieval and classification tasks, compared to using MeSH terms. MATERIALS AND METHODS: We developed the TopicalMeSH representation, which exploits the 'correspondence' between topics generated using latent Dirichlet allocation (LDA) and MeSH terms to create new document representations that combine MeSH terms and latent topic vectors. We used 15 systematic drug review corpora to evaluate performance on information retrieval and classification tasks using this TopicalMeSH representation, compared to using standard encodings that rely on either (1) the original MeSH terms, (2) the text, or (3) their combination. For the document retrieval task, we compared the precision and recall achieved by ranking citations using MeSH and TopicalMeSH representations, respectively. For the classification task, we considered three supervised machine learning approaches, Support Vector Machines (SVMs), logistic regression, and decision trees. We used these to classify documents as relevant or irrelevant using (independently) MeSH, TopicalMeSH, Words (i.e., n-grams extracted from citation titles and abstracts, encoded via bag-of-words representation), a combination of MeSH and Words, and a combination of TopicalMeSH and Words. We also used SVM to compare the classification performance of tf-idf weighted MeSH terms, LDA Topics, a combination of Topics and MeSH, and TopicalMeSH to supervised LDA's classification performance. RESULTS: For the document retrieval task, using the TopicalMeSH representation resulted in higher precision than MeSH in 11 of 15 corpora while achieving the same recall. For the classification task, use of TopicalMeSH features realized a higher F1 score in 14 of 15 corpora when used by SVMs, 12 of 15 corpora using logistic regression, and 12 of 15 corpora using decision trees. TopicalMeSH also had better document classification performance on 12 of 15 corpora when compared to Topics, tf-idf weighted MeSH terms, and a combination of Topics and MeSH using SVMs. Supervised LDA achieved the worst performance in most of the corpora. CONCLUSION: The proposed TopicalMeSH representation (which combines MeSH terms with latent topics) consistently improved performance on document retrieval and classification tasks, compared to using alternative standard representations using MeSH terms alone, as well as, several standard alternative approaches.
Palabras clave
Texto completo:
1
Colección:
01-internacional
Base de datos:
MEDLINE
Asunto principal:
Almacenamiento y Recuperación de la Información
/
Medical Subject Headings
/
Máquina de Vectores de Soporte
Tipo de estudio:
Prognostic_studies
Límite:
Humans
Idioma:
En
Revista:
J Biomed Inform
Asunto de la revista:
INFORMATICA MEDICA
Año:
2016
Tipo del documento:
Article
País de afiliación:
Estados Unidos
Pais de publicación:
Estados Unidos