Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation.

Adjeisah, Michael; Liu, Guohua; Nyabuga, Douglas Omwenga; Nortey, Richard Nuetey; Song, Jinling

Adjeisah, Michael; Liu, Guohua; Nyabuga, Douglas Omwenga; Nortey, Richard Nuetey; Song, Jinling.

Afiliação

Adjeisah M; School of Computer Science and Technology, Donghua University, Shanghai, China.
Liu G; School of Computer Science and Technology, Donghua University, Shanghai, China.
Nyabuga DO; School of Computer Science and Technology, Donghua University, Shanghai, China.
Nortey RN; School of Information Science and Technology, Donghua University, Shanghai, China.
Song J; School of Mathematics and Information Technology, Hebei Normal University of Science & Technology, Qinhuangdao, Hebei, China.

Comput Intell Neurosci ; 2021: 6682385, 2021.

Article em En | MEDLINE | ID: mdl-33936190

ABSTRACT

ABSTRACT

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.

Assuntos

Processamento de Linguagem Natural; Tradução; Idioma; Traduções

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Tradução / Processamento de Linguagem Natural Tipo de estudo: Prognostic_studies Idioma: En Ano de publicação: 2021 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google