Automatic Correction of Real-Word Errors in Spanish Clinical Texts.

Bravo-Candel, Daniel; López-Hernández, Jésica; García-Díaz, José Antonio; Molina-Molina, Fernando; García-Sánchez, Francisco

Bravo-Candel, Daniel; López-Hernández, Jésica; García-Díaz, José Antonio; Molina-Molina, Fernando; García-Sánchez, Francisco.

Afiliação

Bravo-Candel D; Department of Informatics and Systems, Faculty of Computer Science, Campus de Espinardo, University of Murcia, 30100 Murcia, Spain.
López-Hernández J; Department of Informatics and Systems, Faculty of Computer Science, Campus de Espinardo, University of Murcia, 30100 Murcia, Spain.
García-Díaz JA; Department of Informatics and Systems, Faculty of Computer Science, Campus de Espinardo, University of Murcia, 30100 Murcia, Spain.
Molina-Molina F; VÓCALI Sistemas Inteligentes S.L., 30100 Murcia, Spain.
García-Sánchez F; Department of Informatics and Systems, Faculty of Computer Science, Campus de Espinardo, University of Murcia, 30100 Murcia, Spain.

Sensors (Basel) ; 21(9)2021 Apr 21.

Article em En | MEDLINE | ID: mdl-33919018

ABSTRACT

ABSTRACT

Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being a real-word error is computed. On the other hand, state-of-the-art approaches make use of deep learning models to learn context by extracting semantic features from text. In this work, a deep learning model were implemented for correcting real-word errors in clinical text. Specifically, a Seq2seq Neural Machine Translation Model mapped erroneous sentences to correct them. For that, different types of error were generated in correct sentences by using rules. Different Seq2seq models were trained and evaluated on two corpora the Wikicorpus and a collection of three clinical datasets. The medicine corpus was much smaller than the Wikicorpus due to privacy issues when dealing with patient information. Moreover, GloVe and Word2Vec pretrained word embeddings were used to study their performance. Despite the medicine corpus being much smaller than the Wikicorpus, Seq2seq models trained on the medicine corpus performed better than those models trained on the Wikicorpus. Nevertheless, a larger amount of clinical text is required to improve the results.

Assuntos

Idioma; Semântica; Humanos; Processamento de Linguagem Natural; Privacidade; Probabilidade

Palavras-chave

clinical texts; error correction; natural language processing; real-word error; seq2seq neural machine translation model; word embeddings

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Semântica / Idioma Idioma: En Ano de publicação: 2021 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Semântica / Idioma Idioma: En Ano de publicação: 2021 Tipo de documento: Article