Pesquisa | BVS CLAP/SMR-OPAS/OMS

1.

A pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish.

Dunstan, Jocelyn; Vakili, Thomas; Miranda, Luis; Villena, Fabián; Aracena, Claudio; Quiroga, Tamara; Vera, Paulina; Viteri Valenzuela, Sebastián; Rocco, Victor.

BMC Med Inform Decis Mak ; 24(1): 204, 2024 Jul 24.

Artigo em Inglês | MEDLINE | ID: mdl-39049027

RESUMO

Despite the high creation cost, annotated corpora are indispensable for robust natural language processing systems. In the clinical field, in addition to annotating medical entities, corpus creators must also remove personally identifiable information (PII). This has become increasingly important in the era of large language models where unwanted memorization can occur. This paper presents a corpus annotated to anonymize personally identifiable information in 1,787 anamneses of work-related accidents and diseases in Spanish. Additionally, we applied a previously released model for Named Entity Recognition (NER) trained on referrals from primary care physicians to identify diseases, body parts, and medications in this work-related text. We analyzed the differences between the models and the gold standard curated by a physician in detail. Moreover, we compared the performance of the NER model on the original narratives, in narratives where personal information has been masked, and in texts where the personal data is replaced by another similar surrogate value (pseudonymization). Within this publication, we share the annotation guidelines and the annotated corpus.

Assuntos

Processamento de Linguagem Natural , Humanos , Espanha , Saúde Ocupacional , Narração

2.

Supporting the classification of patients in public hospitals in Chile by designing, deploying and validating a system based on natural language processing.

Villena, Fabián; Pérez, Jorge; Lagos, René; Dunstan, Jocelyn.

BMC Med Inform Decis Mak ; 21(1): 208, 2021 07 01.

Artigo em Inglês | MEDLINE | ID: mdl-34210317

RESUMO

BACKGROUND: In Chile, a patient needing a specialty consultation or surgery has to first be referred by a general practitioner, then placed on a waiting list. The Explicit Health Guarantees (GES in Spanish) ensures, by law, the maximum time to solve 85 health problems. Usually, a health professional manually verifies if each referral, written in natural language, corresponds or not to a GES-covered disease. An error in this classification is catastrophic for patients, as it puts them on a non-prioritized waiting list, characterized by prolonged waiting times. METHODS: To support the manual process, we developed and deployed a system that automatically classifies referrals as GES-covered or not using historical data. Our system is based on word embeddings specially trained for clinical text produced in Chile. We used a vector representation of the reason for referral and patient's age as features for training machine learning models using human-labeled historical data. We constructed a ground truth dataset combining classifications made by three healthcare experts, which was used to validate our results. RESULTS: The best performing model over ground truth reached an AUC score of 0.94, with a weighted F1-score of 0.85 (0.87 in precision and 0.86 in recall). During seven months of continuous and voluntary use, the system has amended 87 patient misclassifications. CONCLUSION: This system is a result of a collaboration between technical and clinical experts, and the design of the classifier was custom-tailored for a hospital's clinical workflow, which encouraged the voluntary use of the platform. Our solution can be easily expanded across other hospitals since the registry is uniform in Chile.

Assuntos

Medicina , Processamento de Linguagem Natural , Chile , Hospitais Públicos , Humanos , Aprendizado de Máquina

3.

[Construction of text resources for automatic identification of clinical information in unstructured narratives]. / Construcción de recursos de texto para la identificación automática de información clínica en narrativas no estructuradas.

Báez, Pablo; Villena, Fabián; Zúñiga, Karen; Jones, Natalia; Fernández, Gustavo; Durán, Manuel; Dunstan, Jocelyn.

Rev Med Chil ; 149(7): 1014-1022, 2021 Jul.

Artigo em Espanhol | MEDLINE | ID: mdl-34751303

RESUMO

BACKGROUND: A significant proportion of the clinical record is in free text format, making it difficult to extract key information and make secondary use of patient data. Automatic detection of information within narratives initially requires humans, following specific protocols and rules, to identify medical entities of interest. AIM: To build a linguistic resource of annotated medical entities on texts produced in Chilean hospitals. MATERIAL AND METHODS: A clinical corpus was constructed using 150 referrals in public hospitals. Three annotators identified six medical entities: clinical findings, diagnoses, body parts, medications, abbreviations, and family members. An annotation scheme was designed, and an iterative approach to train the annotators was applied. The F1-Score metric was used to assess the progress of the annotator's agreement during their training. RESULTS: An average F1-Score of 0.73 was observed at the beginning of the project. After the training period, it increased to 0.87. Annotation of clinical findings and body parts showed significant discrepancy, while abbreviations, medications, and family members showed high agreement. CONCLUSIONS: A linguistic resource with annotated medical entities on texts produced in Chilean hospitals was built and made available, working with annotators related to medicine. The iterative annotation approach allowed us to improve performance metrics. The corpus and annotation protocols will be released to the research community.

Assuntos

Processamento Eletrônico de Dados , Chile , Humanos

4.

[Automatic keyword retrieval from clinical texts: an application of natural language processing to massive data of Chilean suspected diagnosis]. / Obtención automática de palabras clave en textos clínicos: una aplicación de procesamiento del lenguaje natural a datos masivos de sospecha diagnóstica en Chile.

Villena, Fabián; Dunstan, Jocelyn.

Rev Med Chil ; 147(10): 1229-1238, 2019 Oct.

Artigo em Espanhol | MEDLINE | ID: mdl-32186630

RESUMO

BACKGROUND: Free-text imposes a challenge in health data analysis since the lack of structure makes the extraction and integration of information difficult, particularly in the case of massive data. An appropriate machine-interpretation of electronic health records in Chile can unleash knowledge contained in large volumes of clinical texts, expanding clinical management and national research capabilities. AIM: To illustrate the use of a weighted frequency algorithm to find keywords. This finding was carried out in the diagnostic suspicion field of the Chilean specialty consultation waiting list, for diseases not covered by the Chilean Explicit Health Guarantees plan. MATERIAL AND METHODS: The waiting lists for a first specialty consultation for the period 2008-2018 were obtained from 17 out of 29 Chilean health services, and total of 2,592,925 diagnostic suspicions were identified. A natural language processing technique called Term Frequency-Inverse Document Frequency was used for the retrieval of diagnostic suspicion keywords. RESULTS: For each specialty, four key words with the highest weighted frequency were determined. Word clouds showing words weighted by their importance were created to obtain a visual representation. These are available at cimt.uchile.cl/lechile/. CONCLUSIONS: The algorithm allowed to summarize unstructured clinical free-text data, improving its usefulness and accessibility.

Assuntos

Mineração de Dados/métodos , Técnicas e Procedimentos Diagnósticos , Processamento Eletrônico de Dados/métodos , Armazenamento e Recuperação da Informação/métodos , Prontuários Médicos , Processamento de Linguagem Natural , Chile , Humanos , Computação em Informática Médica , Medicina , Encaminhamento e Consulta/estatística & dados numéricos , Reprodutibilidade dos Testes , Fatores de Tempo

5.

Correction to: Supporting the classifcation of patients in public hospitals in Chile by designing, deploying and validating a system based on natural language processing.

Villena, Fabián; Pérez, Jorge; Lagos, René; Dunstan, Jocelyn.

BMC Med Inform Decis Mak ; 21(1): 220, 2021 Jul 20.

Artigo em Inglês | MEDLINE | ID: mdl-34284760

6.

Training and intrinsic evaluation of lightweight word embeddings for the clinical domain in Spanish.

Chiu, Carolina; Villena, Fabián; Martin, Kinan; Núñez, Fredy; Besa, Cecilia; Dunstan, Jocelyn.

Front Artif Intell ; 5: 970517, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36213168

RESUMO

Resources for Natural Language Processing (NLP) are less numerous for languages different from English. In the clinical domain, where these resources are vital for obtaining new knowledge about human health and diseases, creating new resources for the Spanish language is imperative. One of the most common approaches in NLP is word embeddings, which are dense vector representations of a word, considering the word's context. This vector representation is usually the first step in various NLP tasks, such as text classification or information extraction. Therefore, in order to enrich Spanish language NLP tools, we built a Spanish clinical corpus from waiting list diagnostic suspicions, a biomedical corpus from medical journals, and term sequences sampled from the Unified Medical Language System (UMLS). These three corpora can be used to compute word embeddings models from scratch using Word2vec and fastText algorithms. Furthermore, to validate the quality of the calculated embeddings, we adapted several evaluation datasets in English, including some tests that have not been used in Spanish to the best of our knowledge. These translations were validated by two bilingual clinicians following an ad hoc validation standard for the translation. Even though contextualized word embeddings nowadays receive enormous attention, their calculation and deployment require specialized hardware and giant training corpora. Our static embeddings can be used in clinical applications with limited computational resources. The validation of the intrinsic test we present here can help groups working on static and contextualized word embeddings. We are releasing the training corpus and the embeddings within this publication.

7.

On the Construction of Multilingual Corpora for Clinical Text Mining.

Villena, Fabián; Eisenmann, Urs; Knaup, Petra; Dunstan, Jocelyn; Ganzinger, Matthias.

Stud Health Technol Inform ; 270: 347-351, 2020 Jun 16.

Artigo em Inglês | MEDLINE | ID: mdl-32570404

RESUMO

The amount of digital data derived from healthcare processes have increased tremendously in the last years. This applies especially to unstructured data, which are often hard to analyze due to the lack of available tools to process and extract information. Natural language processing is often used in medicine, but the majority of tools used by researchers are developed primarily for the English language. For developing and testing natural language processing methods, it is important to have a suitable corpus, specific to the medical domain that covers the intended target language. To improve the potential of natural language processing research, we developed tools to derive language specific medical corpora from publicly available text sources. n order to extract medicine-specific unstructured text data, openly available pub-lications from biomedical journals were used in a four-step process: (1) medical journal databases were scraped to download the articles, (2) the articles were parsed and consolidated into a single repository, (3) the content of the repository was de-scribed, and (4) the text data and the codes were released. In total, 93 969 articles were retrieved, with a word count of 83 868 501 in three different languages (German, English, and Spanish) from two medical journal databases Our results show that unstructured text data extraction from openly available medical journal databases for the construction of unified corpora of medical text data can be achieved through web scraping techniques.

Assuntos

Mineração de Dados , Multilinguismo , Processamento de Linguagem Natural , Unified Medical Language System

8.

Construcción de recursos de texto para la identificación automática de información clínica en narrativas no estructuradas / Construction of text resources for automatic identification of clinical information in unstructured narratives

Báez, Pablo; Villena, Fabián; Zúñiga, Karen; Jones, Natalia; Fernández, Gustavo; Durán, Manuel; Dunstan, Jocelyn.

Rev. méd. Chile ; 149(7): 1014-1022, jul. 2021. ilus, graf

Artigo em Espanhol | LILACS | ID: biblio-1389546

RESUMO

Background: A significant proportion of the clinical record is in free text format, making it difficult to extract key information and make secondary use of patient data. Automatic detection of information within narratives initially requires humans, following specific protocols and rules, to identify medical entities of interest. Aim: To build a linguistic resource of annotated medical entities on texts produced in Chilean hospitals. Material and Methods: A clinical corpus was constructed using 150 referrals in public hospitals. Three annotators identified six medical entities: clinical findings, diagnoses, body parts, medications, abbreviations, and family members. An annotation scheme was designed, and an iterative approach to train the annotators was applied. The F1-Score metric was used to assess the progress of the annotator's agreement during their training. Results: An average F1-Score of 0.73 was observed at the beginning of the project. After the training period, it increased to 0.87. Annotation of clinical findings and body parts showed significant discrepancy, while abbreviations, medications, and family members showed high agreement. Conclusions: A linguistic resource with annotated medical entities on texts produced in Chilean hospitals was built and made available, working with annotators related to medicine. The iterative annotation approach allowed us to improve performance metrics. The corpus and annotation protocols will be released to the research community.

Assuntos

Humanos , Processamento Eletrônico de Dados , Chile

9.

Obtención automática de palabras clave en textos clínicos: una aplicación de procesamiento del lenguaje natural a datos masivos de sospecha diagnóstica en Chile / Automatic keyword retrieval from clinical texts: an application of natural language processing to massive data of Chilean suspected diagnosis

Villena, Fabián; Dunstan, Jocelyn.

Rev. méd. Chile ; 147(10): 1229-1238, oct. 2019. tab, graf

Artigo em Espanhol | LILACS | ID: biblio-1058589

RESUMO

Background: Free-text imposes a challenge in health data analysis since the lack of structure makes the extraction and integration of information difficult, particularly in the case of massive data. An appropriate machine-interpretation of electronic health records in Chile can unleash knowledge contained in large volumes of clinical texts, expanding clinical management and national research capabilities. Aim: To illustrate the use of a weighted frequency algorithm to find keywords. This finding was carried out in the diagnostic suspicion field of the Chilean specialty consultation waiting list, for diseases not covered by the Chilean Explicit Health Guarantees plan. Material and Methods: The waiting lists for a first specialty consultation for the period 2008-2018 were obtained from 17 out of 29 Chilean health services, and total of 2,592,925 diagnostic suspicions were identified. A natural language processing technique called Term Frequency-Inverse Document Frequency was used for the retrieval of diagnostic suspicion keywords. Results: For each specialty, four key words with the highest weighted frequency were determined. Word clouds showing words weighted by their importance were created to obtain a visual representation. These are available at cimt.uchile.cl/lechile/. Conclusions: The algorithm allowed to summarize unstructured clinical free-text data, improving its usefulness and accessibility.

Assuntos

Humanos , Processamento de Linguagem Natural , Processamento Eletrônico de Dados/métodos , Prontuários Médicos , Armazenamento e Recuperação da Informação/métodos , Técnicas e Procedimentos Diagnósticos , Mineração de Dados/métodos , Encaminhamento e Consulta/estatística & dados numéricos , Fatores de Tempo , Computação em Informática Médica , Chile , Reprodutibilidade dos Testes , Medicina

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA