Comparison of imputation methods for missing production data of dairy cattle.

You, J; Ellis, J L; Adams, S; Sahar, M; Jacobs, M; Tulpan, D

You, J; Ellis, J L; Adams, S; Sahar, M; Jacobs, M; Tulpan, D.

Afiliação

You J; Department of Animal Biosciences, University of Guelph, Guelph, ON, Canada.
Ellis JL; Department of Animal Biosciences, University of Guelph, Guelph, ON, Canada. Electronic address: jellis@uoguelph.ca.
Adams S; Department of Animal Biosciences, University of Guelph, Guelph, ON, Canada.
Sahar M; Department of Animal Biosciences, University of Guelph, Guelph, ON, Canada.
Jacobs M; Trouw Nutrition Innovation Department, Amersfoort, Netherlands.
Tulpan D; Department of Animal Biosciences, University of Guelph, Guelph, ON, Canada.

Animal ; 17 Suppl 5: 100921, 2023 Dec.

Article em En | MEDLINE | ID: mdl-37659911

ABSTRACT

ABSTRACT

Nowadays, vast amounts of data representing feed intake, growth, and environmental impact of individual animals are being recorded in on-farm settings. Despite their apparent use, data collected in real-world applications often have missing values in one or several variables, due to reasons including human error, machine error, or sampling frequency misalignment across multiple variables. Since incomplete datasets are less valuable for downstream data analysis, it is important to address the missing value problem properly. One option may be to reduce the dataset to a subset that contains only complete data, but considerable data may be lost via this process. The current study aimed to compare imputation methods for the estimation of missing values in a raw dataset of dairy cattle including 454 553 records collected from 629 cows between 2009 and 2020. The dataset was subjected to a cleaning process that reduced its size to 437 075 observations corresponding to 512 cows. Missing values were present in four variables concentrate DM intake (CDMI, missing percentage = 2.30%), forage DM intake (FDMI, 8.05%), milk yield (MY, 15.12%), and BW (64.33%). After removing all missing values, the resulting dataset (n = 129 353) was randomly sampled five times to create five independent subsets that exhibit the same missing data percentages as the cleaned dataset. Four univariate and nine multivariate imputation methods (eight machine learning methods and the MissForest method) were applied and evaluated on the five repeats, and average imputation performance was reported for each repeat. The results showed that Random Forest was overall the best imputation method for this type of data and had a lower mean squared prediction error and higher concordance correlation coefficient than the other imputation methods for all imputed variables. Random Forest performed particularly well for imputing CDMI, MY, and BW, compared to imputing FDMI.

Assuntos

Leite; Projetos de Pesquisa; Humanos; Feminino; Bovinos; Animais; Aprendizado de Máquina; Ingestão de Alimentos; Fazendas

Palavras-chave

Big data; Dairy cow; Interpolation; Machine learning; Unavailable values

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Projetos de Pesquisa / Leite Tipo de estudo: Prognostic_studies Limite: Animals / Female / Humans Idioma: En Revista: Animal Ano de publicação: 2023 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google