Comparing NER Approaches on French Clinical Text, with Easy-to-Reuse Pipelines.

Hubert, Thibault; Vaillant, Ghislain; Birot, Olivier; Arias, Camila; Neuraz, Antoine; Coulet, Adrien

Hubert, Thibault; Vaillant, Ghislain; Birot, Olivier; Arias, Camila; Neuraz, Antoine; Coulet, Adrien.

Afiliação

Hubert T; Inria, HeKA, PariSanté Campus, Paris, France.
Vaillant G; Centre de Recherche des Cordeliers, Inserm, Université Paris Cité, Sorbonne Université, France.
Birot O; Inria, HeKA, PariSanté Campus, Paris, France.
Arias C; Centre de Recherche des Cordeliers, Inserm, Université Paris Cité, Sorbonne Université, France.
Neuraz A; Inria, HeKA, PariSanté Campus, Paris, France.
Coulet A; Centre de Recherche des Cordeliers, Inserm, Université Paris Cité, Sorbonne Université, France.

Stud Health Technol Inform ; 316: 272-276, 2024 Aug 22.

Article em En | MEDLINE | ID: mdl-39176725

ABSTRACT

ABSTRACT

The task of Named Entity Recognition (NER) is central for leveraging the content of clinical texts in observational studies. Indeed, texts contain a large part of the information available in Electronic Health Records (EHRs). However, clinical texts are highly heterogeneous between healthcare services and institutions, between countries and languages, making it hard to predict how existing tools may perform on a particular corpus. We compared four NER approaches on three French corpora and share our benchmarking pipeline in an open and easy-to-reuse manner, using the medkit Python library. We include in our pipelines fine-tuning operations with either one or several of the considered corpora. Our results illustrate the expected superiority of language models over a dictionary-based approach, and question the necessity of refining models already trained on biomedical texts. Beyond benchmarking, we believe sharing reusable and customizable pipelines for comparing fast-evolving Natural Language Processing (NLP) tools is a valuable contribution, since clinical texts themselves can hardly be shared for privacy concerns.

Assuntos

Registros Eletrônicos de Saúde; Processamento de Linguagem Natural; França; Humanos

Palavras-chave

Benchmark; Clinical texts; Named Entity Recognition; Open science

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Processamento de Linguagem Natural / Registros Eletrônicos de Saúde Limite: Humans País/Região como assunto: Europa Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google