RESUMO
This work proposes a sequential methodology for selecting variables in classification problems in which the number of predictors is much larger than the sample size. The methodology includes a Monte Carlo permutation procedure that conditionally tests the null hypothesis of no association among the outcomes and the available predictors. In order to improve computing aspects, we propose a new parametric distribution, the Truncated and Zero Inflated Gumbel Distribution. The final application is to find compact classification models with improved performance for genomic data. Results using real data sets show that the proposed methodology selects compact models with optimized classification performances.
Assuntos
Genômica/estatística & dados numéricos , Algoritmos , Bioestatística/métodos , Neoplasias da Mama/tratamento farmacológico , Neoplasias da Mama/genética , Simulação por Computador , Interpretação Estatística de Dados , Bases de Dados Factuais/estatística & dados numéricos , Feminino , Perfilação da Expressão Gênica/estatística & dados numéricos , Humanos , Modelos Estatísticos , Método de Monte Carlo , Análise Multivariada , Tamanho da AmostraRESUMO
Acute leukemia classification into its myeloid and lymphoblastic subtypes is usually accomplished according to the morphology of the tumor. Nevertheless, the subtypes may have similar histopathological appearance, making screening procedures difficult. In addition, approximately one-third of acute myeloid leukemias are characterized by aberrant cytoplasmic localization of nucleophosmin (NPMc(+)), where the majority has a normal karyotype. This work is based on two DNA microarray datasets, available publicly, to differentiate leukemia subtypes. The datasets were split into training and test sets, and feature selection methods were applied. Artificial neural network classifiers were developed to compare the feature selection methods. For the first dataset, 50 genes selected using the best classifier was able to classify all patients in the test set. For the second dataset, five genes yielded 97.5% accuracy in the test set.