A Memory-Efficient Encoding Method for Processing Mixed-Type Data on Machine Learning.

Lopez-Arevalo, Ivan; Aldana-Bobadilla, Edwin; Molina-Villegas, Alejandro; Galeana-Zapién, Hiram; Muñiz-Sanchez, Victor; Gausin-Valle, Saul

Lopez-Arevalo, Ivan; Aldana-Bobadilla, Edwin; Molina-Villegas, Alejandro; Galeana-Zapién, Hiram; Muñiz-Sanchez, Victor; Gausin-Valle, Saul.

Afiliação

Lopez-Arevalo I; Centro de Investigación y de Estudios Avanzados del I.P.N., Unidad Tamaulipas, Victoria 87130, Mexico.
Aldana-Bobadilla E; Conacyt-Centro de Investigación y de Estudios Avanzados del I.P.N., Unidad Tamaulipas, Victoria 87130, Mexico.
Molina-Villegas A; Conacyt-Centro de Investigación en Ciencias de Información Geoespacial, Merida 97302, Mexico.
Galeana-Zapién H; Centro de Investigación y de Estudios Avanzados del I.P.N., Unidad Tamaulipas, Victoria 87130, Mexico.
Muñiz-Sanchez V; Centro de Investigación en Matemáticas, Monterrey 66628, Mexico.
Gausin-Valle S; Centro de Investigación y de Estudios Avanzados del I.P.N., Unidad Tamaulipas, Victoria 87130, Mexico.

Entropy (Basel) ; 22(12)2020 Dec 09.

Article em En | MEDLINE | ID: mdl-33316972

ABSTRACT

ABSTRACT

The most common machine-learning methods solve supervised and unsupervised problems based on datasets where the problem's features belong to a numerical space. However, many problems often include data where numerical and categorical data coexist, which represents a challenge to manage them. To transform categorical data into a numeric form, preprocessing tasks are compulsory. Methods such as one-hot and feature-hashing have been the most widely used encoding approaches at the expense of a significant increase in the dimensionality of the dataset. This effect introduces unexpected challenges to deal with the overabundance of variables and/or noisy data. In this regard, in this paper we propose a novel encoding approach that maps mixed-type data into an information space using Shannon's Theory to model the amount of information contained in the original data. We evaluated our proposal with ten mixed-type datasets from the UCI repository and two datasets representing real-world problems obtaining promising results. For demonstrating the performance of our proposal, this was applied for preparing these datasets for classification, regression, and clustering tasks. We demonstrate that our encoding proposal is remarkably superior to one-hot and feature-hashing encoding in terms of memory efficiency. Our proposal can preserve the information conveyed by the original data.

Palavras-chave

categorical data; data preprocessing; machine learning

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Bases de dados: MEDLINE Tipo de estudo: Prognostic_studies Idioma: En Revista: Entropy (Basel) Ano de publicação: 2020 Tipo de documento: Article País de afiliação: México

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google