Your browser doesn't support javascript.
loading
Annotation-preserving machine translation of English corpora to validate Dutch clinical concept extraction tools.
Seinen, Tom M; Kors, Jan A; van Mulligen, Erik M; Rijnbeek, Peter R.
Afiliação
  • Seinen TM; Department of Medical Informatics, Erasmus University Medical Center, 3015 GD Rotterdam, The Netherlands.
  • Kors JA; Department of Medical Informatics, Erasmus University Medical Center, 3015 GD Rotterdam, The Netherlands.
  • van Mulligen EM; Department of Medical Informatics, Erasmus University Medical Center, 3015 GD Rotterdam, The Netherlands.
  • Rijnbeek PR; Department of Medical Informatics, Erasmus University Medical Center, 3015 GD Rotterdam, The Netherlands.
J Am Med Inform Assoc ; 31(8): 1725-1734, 2024 Aug 01.
Article em En | MEDLINE | ID: mdl-38934643
ABSTRACT

OBJECTIVE:

To explore the feasibility of validating Dutch concept extraction tools using annotated corpora translated from English, focusing on preserving annotations during translation and addressing the scarcity of non-English annotated clinical corpora. MATERIALS AND

METHODS:

Three annotated corpora were standardized and translated from English to Dutch using 2 machine translation services, Google Translate and OpenAI GPT-4, with annotations preserved through a proposed method of embedding annotations in the text before translation. The performance of 2 concept extraction tools, MedSpaCy and MedCAT, was assessed across the corpora in both Dutch and English.

RESULTS:

The translation process effectively generated Dutch annotated corpora and the concept extraction tools performed similarly in both English and Dutch. Although there were some differences in how annotations were preserved across translations, these did not affect extraction accuracy. Supervised MedCAT models consistently outperformed unsupervised models, whereas MedSpaCy demonstrated high recall but lower precision.

DISCUSSION:

Our validation of Dutch concept extraction tools on corpora translated from English was successful, highlighting the efficacy of our annotation preservation method and the potential for efficiently creating multilingual corpora. Further improvements and comparisons of annotation preservation techniques and strategies for corpus synthesis could lead to more efficient development of multilingual corpora and accurate non-English concept extraction tools.

CONCLUSION:

This study has demonstrated that translated English corpora can be used to validate non-English concept extraction tools. The annotation preservation method used during translation proved effective, and future research can apply this corpus translation method to additional languages and clinical settings.
Assuntos
Palavras-chave

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Tradução Limite: Humans País/Região como assunto: Europa Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Tradução Limite: Humans País/Região como assunto: Europa Idioma: En Ano de publicação: 2024 Tipo de documento: Article