Morpheme matching based text tokenization for a scarce resourced language.

Rehman, Zobia; Anwar, Waqas; Bajwa, Usama Ijaz; Xuan, Wang; Chaoying, Zhou

Rehman, Zobia; Anwar, Waqas; Bajwa, Usama Ijaz; Xuan, Wang; Chaoying, Zhou.

Afiliação

Rehman Z; Department of Computer Science, COMSATS Institute of Information Technology, Abbottabad, Pakistan.

PLoS One ; 8(8): e68178, 2013.

Article em En | MEDLINE | ID: mdl-23990871

ABSTRACT

ABSTRACT

Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been proposed for Urdu text tokenization, along with some other algorithms to solve the additional issues of boundary detection of compound words, affixation, reduplication, names and abbreviations. This study resulted into 97.28% precision, 93.71% recall, and 95.46% F1-measure; while tokenizing a corpus of 57000 words by using a morpheme list with 6400 entries.

Assuntos

Idioma; Linguagens de Programação; Algoritmos; Inteligência Artificial; Armazenamento e Recuperação da Informação; Funções Verossimilhança; Nomes; Reprodutibilidade dos Testes; Software

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Linguagens de Programação / Idioma Idioma: En Ano de publicação: 2013 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Linguagens de Programação / Idioma Idioma: En Ano de publicação: 2013 Tipo de documento: Article