Missing data imputation using classification and regression trees.

Chen, Cheng-Yang; Chang, Yu-Wei

Chen, Cheng-Yang; Chang, Yu-Wei.

Afiliação

Chen CY; Department of Statistics, National Chengchi University, Taipei, Taiwan.
Chang YW; Department of Statistics, National Chengchi University, Taipei, Taiwan.

PeerJ Comput Sci ; 10: e2119, 2024.

Article em En | MEDLINE | ID: mdl-38983189

ABSTRACT

ABSTRACT

Background:

Missing data are common when analyzing real data. One popular solution is to impute missing data so that one complete dataset can be obtained for subsequent data analysis. In the present study, we focus on missing data imputation using classification and regression trees (CART).

Methods:

We consider a new perspective on missing data in a CART imputation problem and realize the perspective through some resampling algorithms. Several existing missing data imputation methods using CART are compared through simulation studies, and we aim to investigate the methods with better imputation accuracy under various conditions. Some systematic findings are demonstrated and presented. These imputation methods are further applied to two real datasets Hepatitis data and Credit approval data for illustration.

Results:

The method that performs the best strongly depends on the correlation between variables. For imputing missing ordinal categorical variables, the rpart package with surrogate variables is recommended under correlations larger than 0 with missing completely at random (MCAR) and missing at random (MAR) conditions. Under missing not at random (MNAR), chi-squared test methods and the rpart package with surrogate variables are suggested. For imputing missing quantitative variables, the iterative imputation method is most recommended under moderate correlation conditions.

Palavras-chave

Classification and regression trees; Missing data; Missing data imputation; Resampling

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: PeerJ Comput Sci Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Taiwan País de publicação: Estados Unidos

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google