Using a multi-staged strategy based on machine learning and mathematical modeling to predict genotype-phenotype risk patterns in diabetic kidney disease: a prospective case-control cohort analysis.

Leung, Ross K K; Wang, Ying; Ma, Ronald C W; Luk, Andrea O Y; Lam, Vincent; Ng, Maggie; So, Wing Yee; Tsui, Stephen K W; Chan, Juliana C N

Leung, Ross K K; Wang, Ying; Ma, Ronald C W; Luk, Andrea O Y; Lam, Vincent; Ng, Maggie; So, Wing Yee; Tsui, Stephen K W; Chan, Juliana C N.

Afiliação

Leung RK; Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Hong Kong, SAR, China.

BMC Nephrol ; 14: 162, 2013 Jul 23.

Article em En | MEDLINE | ID: mdl-23879411

ABSTRACT

ABSTRACT

BACKGROUND:

Multi-causality and heterogeneity of phenotypes and genotypes characterize complex diseases. In a database with comprehensive collection of phenotypes and genotypes, we compared the performance of common machine learning methods to generate mathematical models to predict diabetic kidney disease (DKD).

METHODS:

In a prospective cohort of type 2 diabetic patients, we selected 119 subjects with DKD and 554 without DKD at enrolment and after a median follow-up period of 7.8 years for model training, testing and validation using seven machine learning methods (partial least square regression, the classification and regression tree, the C5.0 decision tree, random forest, naïve Bayes classification, neural network and support vector machine). We used 17 clinical attributes and 70 single nucleotide polymorphisms (SNPs) of 54 candidate genes to build different models. The top attributes selected by the best-performing models were then used to build models with performance comparable to those using the entire dataset.

RESULTS:

Age, age of diagnosis, systolic blood pressure and genetic polymorphisms of uteroglobin and lipid metabolism were selected by most methods. Models generated by support vector machine (svmRadial) and random forest (cforest) had the best prediction accuracy whereas models derived from naïve Bayes classifier and partial least squares regression had the least optimal performance. Using 10 clinical attributes (systolic and diastolic blood pressure, age, age of diagnosis, triglyceride, white blood cell count, total cholesterol, waist to hip ratio, LDL cholesterol, and alcohol intake) and 5 genetic attributes (UGB G38A, LIPC -514C > T, APOB Thr71Ile, APOC3 3206T > G and APOC3 1100C > T), selected most often by SVM and cforest, we were able to build high-performance models.

CONCLUSIONS:

Amongst different machine learning methods, svmRadial and cforest had the best performance. Genetic polymorphisms related to inflammation and lipid metabolism warrant further investigation for their associations with DKD.

Assuntos

Inteligência Artificial; Nefropatias Diabéticas/diagnóstico; Nefropatias Diabéticas/genética; Genótipo; Modelos Teóricos; Fenótipo; Idoso; Estudos de Casos e Controles; Estudos de Coortes; Diabetes Mellitus Tipo 2/diagnóstico; Diabetes Mellitus Tipo 2/genética; Feminino; Seguimentos; Humanos; Masculino; Pessoa de Meia-Idade; Valor Preditivo dos Testes; Estudos Prospectivos; Fatores de Risco

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Fenótipo / Inteligência Artificial / Nefropatias Diabéticas / Genótipo / Modelos Teóricos Tipo de estudo: Diagnostic_studies / Etiology_studies / Incidence_studies / Observational_studies / Prognostic_studies / Risk_factors_studies Limite: Aged / Female / Humans / Male / Middle aged Idioma: En Ano de publicação: 2013 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google