Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies.
PLoS One
; 10(8): e0135832, 2015.
Article
en En
| MEDLINE
| ID: mdl-26285210
Various attempts have been made to predict the individual disease risk based on genotype data from genome-wide association studies (GWAS). However, most studies only investigated one or two classification algorithms and feature encoding schemes. In this study, we applied seven different classification algorithms on GWAS case-control data sets for seven different diseases to create models for disease risk prediction. Further, we used three different encoding schemes for the genotypes of single nucleotide polymorphisms (SNPs) and investigated their influence on the predictive performance of these models. Our study suggests that an additive encoding of the SNP data should be the preferred encoding scheme, as it proved to yield the best predictive performances for all algorithms and data sets. Furthermore, our results showed that the differences between most state-of-the-art classification algorithms are not statistically significant. Consequently, we recommend to prefer algorithms with simple models like the linear support vector machine (SVM) as they allow for better subsequent interpretation without significant loss of accuracy.
Texto completo:
1
Bases de datos:
MEDLINE
Asunto principal:
Enfermedad
/
Biología Computacional
/
Estudio de Asociación del Genoma Completo
Tipo de estudio:
Etiology_studies
/
Prognostic_studies
/
Risk_factors_studies
Límite:
Humans
Idioma:
En
Revista:
PLoS One
Asunto de la revista:
CIENCIA
/
MEDICINA
Año:
2015
Tipo del documento:
Article
País de afiliación:
Alemania