Comparing regression, naive Bayes, and random forest methods in the prediction of individual survival to second lactation in Holstein cattle.

van der Heide, E M M; Veerkamp, R F; van Pelt, M L; Kamphuis, C; Athanasiadis, I; Ducro, B J

van der Heide, E M M; Veerkamp, R F; van Pelt, M L; Kamphuis, C; Athanasiadis, I; Ducro, B J.

Affiliation

van der Heide EMM; Wageningen University and Research Animal Breeding and Genomics, PO Box 338, 6700 AH Wageningen, the Netherlands. Electronic address: esther.vanderheide@wur.nl.
Veerkamp RF; Wageningen University and Research Animal Breeding and Genomics, PO Box 338, 6700 AH Wageningen, the Netherlands.
van Pelt ML; Cooperation CRV, Animal Evaluation Unit, PO Box 454, 6800 AL Arnhem, the Netherlands.
Kamphuis C; Wageningen University and Research Information Technology Group, 6706 KN Wageningen, the Netherlands.
Athanasiadis I; Wageningen University and Research Information Technology Group, 6706 KN Wageningen, the Netherlands.
Ducro BJ; Wageningen University and Research Animal Breeding and Genomics, PO Box 338, 6700 AH Wageningen, the Netherlands.

J Dairy Sci ; 102(10): 9409-9421, 2019 Oct.

Article in En | MEDLINE | ID: mdl-31447154

ABSTRACT

ABSTRACT

In this study, we compared multiple logistic regression, a linear method, to naive Bayes and random forest, 2 nonlinear machine-learning methods. We used all 3 methods to predict individual survival to second lactation in dairy heifers. The data set used for prediction contained 6,847 heifers born between January 2012 and June 2013, and had known survival outcomes. Each animal had 50 genomic estimated breeding values available at birth and up to 65 phenotypic variables that accumulated over time. Survival was predicted at 5 moments in life at birth, at 18 mo, at first calving, at 6 wk after first calving, and at 200 d after first calving. The data sets were randomly split into 70% training and 30% testing sets to evaluate model performance for 20-fold validation. The methods were compared for accuracy, sensitivity, specificity, area under the curve (AUC) value, contrasts between groups for the prediction outcomes, and increase in surviving animals in a practical scenario. At birth and 18 mo, all methods had overlapping performance; no method significantly outperformed the other. At first calving, 6 wk after first calving, and 200 d after first calving, random forest and naive Bayes had overlapping performance, and both machine-learning methods outperformed multiple logistic regression. Overall, naive Bayes has the highest average AUC at all decision points up to 200 d after first calving. Random forest had the highest AUC at 200 d after first calving. All methods obtained similar increases in survival in the practical scenario. Despite this, the methods appeared to predict the survival of individual heifers differently. All methods improved over time, but the changes in mean model outcomes for surviving and non-surviving animals differed by method. Furthermore, the correlations of individual predictions between methods ranged from r = 0.417 to r = 0.700; the lowest correlations were at first calving for all methods. In short, all 3 methods were able to predict survival at a population level, because all methods improved survival in a practical scenario. However, depending on the method used, predictions for individual animals were quite different between methods.

Subject(s)

Cattle/physiology; Genome/genetics; Machine Learning; Animals; Animals, Newborn; Bayes Theorem; Breeding; Cattle/genetics; Female; Lactation; Parturition/genetics; Pregnancy

Key words

machine learning; naive Bayes; phenotypic prediction; random forest; regression

Fulltext

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Cattle / Genome / Machine Learning Type of study: Prognostic_studies / Risk_factors_studies Limits: Animals / Pregnancy Language: En Journal: J Dairy Sci Year: 2019 Type: Article

Fulltext

XML

PubMed Links

Search on Google