Comparative Evaluation of Pre-Trained Language Models for Biomedical Information Retrieval.
Stud Health Technol Inform
; 316: 827-831, 2024 Aug 22.
Article
em En
| MEDLINE
| ID: mdl-39176920
ABSTRACT
Finding relevant information in the biomedical literature increasingly depends on efficient information retrieval (IR) algorithms. Cross-Encoders, SentenceBERT, and ColBERT are algorithms based on pre-trained language models that use nuanced but computable vector representations of search queries and documents for IR applications. Here we investigate how well these vectorization algorithms estimate relevance labels of biomedical documents for search queries using the OHSUMED dataset. For our evaluation, we compared computed scores to provided labels by using boxplots and Spearman's rank correlations. According to these metrics, we found that Sentence-BERT moderately outperformed the alternative vectorization algorithms and that additional fine-tuning based on a subset of OHSUMED labels yielded little additional benefit. Future research might aim to develop a larger dedicated dataset in order to optimize such methods more systematically, and to evaluate the corresponding functions in IR tools with end-users.
Palavras-chave
Texto completo:
1
Coleções:
01-internacional
Base de dados:
MEDLINE
Assunto principal:
Algoritmos
/
Processamento de Linguagem Natural
/
Armazenamento e Recuperação da Informação
Limite:
Humans
Idioma:
En
Ano de publicação:
2024
Tipo de documento:
Article