Search | VHL Regional Portal

A pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish.

Dunstan, Jocelyn; Vakili, Thomas; Miranda, Luis; Villena, Fabián; Aracena, Claudio; Quiroga, Tamara; Vera, Paulina; Viteri Valenzuela, Sebastián; Rocco, Victor.

BMC Med Inform Decis Mak ; 24(1): 204, 2024 Jul 24.

Article in English | MEDLINE | ID: mdl-39049027

ABSTRACT

Despite the high creation cost, annotated corpora are indispensable for robust natural language processing systems. In the clinical field, in addition to annotating medical entities, corpus creators must also remove personally identifiable information (PII). This has become increasingly important in the era of large language models where unwanted memorization can occur. This paper presents a corpus annotated to anonymize personally identifiable information in 1,787 anamneses of work-related accidents and diseases in Spanish. Additionally, we applied a previously released model for Named Entity Recognition (NER) trained on referrals from primary care physicians to identify diseases, body parts, and medications in this work-related text. We analyzed the differences between the models and the gold standard curated by a physician in detail. Moreover, we compared the performance of the NER model on the original narratives, in narratives where personal information has been masked, and in texts where the personal data is replaced by another similar surrogate value (pseudonymization). Within this publication, we share the annotation guidelines and the annotated corpus.

Subject(s)

Natural Language Processing , Humans , Spain , Occupational Health , Narration

Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus.

Dellanzo, Antonella; Cotik, Viviana; Lozano Barriga, Daniel Yunior; Mollapaza Apaza, Jonathan Jimmy; Palomino, Daniel; Schiaffino, Fernando; Yanque Aliaga, Alexander; Ochoa-Luna, José.

BMC Bioinformatics ; 23(1): 558, 2022 Dec 23.

Article in English | MEDLINE | ID: mdl-36564712

ABSTRACT

BACKGROUND: In order to detect threats to public health and to be well-prepared for endemic and pandemic illness outbreaks, countries usually rely on event-based surveillance (EBS) and indicator-based surveillance systems. Event-based surveillance systems are key components of early warning systems and focus on fast capturing of data to detect threat signals through channels other than traditional surveillance. In this study, we develop Natural Language Processing tools that can be used within EBS systems. In particular, we focus on information extraction techniques that enable digital surveillance to monitor Internet data and social media. RESULTS: We created an annotated Spanish corpus from ProMED-mail health reports regarding disease outbreaks in Latin America. The corpus has been used to train algorithms for two information extraction tasks: named entity recognition and relation extraction. The algorithms, based on deep learning and rules, have been applied to recognize diseases, hosts, and geographical locations where a disease is occurring, among other entities and relations. In addition, an in-depth analysis of micro-average F1 metrics shows the suitability of our approaches for both tasks. CONCLUSIONS: The annotated corpus and algorithms presented could leverage the development of automated tools for extracting information from news and health reports written in Spanish. Moreover, this framework could be useful within EBS systems to support the early detection of Latin American disease outbreaks.

Subject(s)

Disease Outbreaks , Public Health , Humans , Latin America/epidemiology , Natural Language Processing , Data Mining/methods

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL