Enhancing selection of alcohol consumption-associated genes by random forest.

Lyu, Chenglin; Joehanes, Roby; Huan, Tianxiao; Levy, Daniel; Li, Yi; Wang, Mengyao; Liu, Xue; Liu, Chunyu; Ma, Jiantao

Lyu, Chenglin; Joehanes, Roby; Huan, Tianxiao; Levy, Daniel; Li, Yi; Wang, Mengyao; Liu, Xue; Liu, Chunyu; Ma, Jiantao.

Afiliação

Lyu C; Department of Biostatistics, Boston University School of Public Health, Boston, MA02118, USA.
Joehanes R; Department of Anatomy and Neurobiology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA02118, USA.
Huan T; Framingham Heart Study and Population Sciences Branch, NHLBI, Framingham, MA01702, USA.
Levy D; Framingham Heart Study and Population Sciences Branch, NHLBI, Framingham, MA01702, USA.
Li Y; Framingham Heart Study and Population Sciences Branch, NHLBI, Framingham, MA01702, USA.
Wang M; Department of Biostatistics, Boston University School of Public Health, Boston, MA02118, USA.
Liu X; Department of Biostatistics, Boston University School of Public Health, Boston, MA02118, USA.
Liu C; Department of Biostatistics, Boston University School of Public Health, Boston, MA02118, USA.
Ma J; Department of Biostatistics, Boston University School of Public Health, Boston, MA02118, USA.

Br J Nutr ; 131(12): 2058-2067, 2024 Jun 28.

Article em En | MEDLINE | ID: mdl-38606596

ABSTRACT

ABSTRACT

Machine learning methods have been used in identifying omics markers for a variety of phenotypes. We aimed to examine whether a supervised machine learning algorithm can improve identification of alcohol-associated transcriptomic markers. In this study, we analysed array-based, whole-blood derived expression data for 17 873 gene transcripts in 5508 Framingham Heart Study participants. By using the Boruta algorithm, a supervised random forest (RF)-based feature selection method, we selected twenty-five alcohol-associated transcripts. In a testing set (30 % of entire study participants), AUC (area under the receiver operating characteristics curve) of these twenty-five transcripts were 0·73, 0·69 and 0·66 for non-drinkers v. moderate drinkers, non-drinkers v. heavy drinkers and moderate drinkers v. heavy drinkers, respectively. The AUC of the selected transcripts by the Boruta method were comparable to those identified using conventional linear regression models, for example, AUC of 1958 transcripts identified by conventional linear regression models (false discovery rate < 0·2) were 0·74, 0·66 and 0·65, respectively. With Bonferroni correction for the twenty-five Boruta method-selected transcripts and three CVD risk factors (i.e. at P < 6·7e-4), we observed thirteen transcripts were associated with obesity, three transcripts with type 2 diabetes and one transcript with hypertension. For example, we observed that alcohol consumption was inversely associated with the expression of DOCK4, IL4R, and SORT1, and DOCK4 and SORT1 were positively associated with obesity, and IL4R was inversely associated with hypertension. In conclusion, using a supervised machine learning method, the RF-based Boruta algorithm, we identified novel alcohol-associated gene transcripts.

Assuntos

Consumo de Bebidas Alcoólicas; Algoritmos; Humanos; Consumo de Bebidas Alcoólicas/genética; Masculino; Feminino; Pessoa de Meia-Idade; Aprendizado de Máquina; Doenças Cardiovasculares/genética; Transcriptoma; Adulto; Fatores de Risco; Aprendizado de Máquina Supervisionado; Algoritmo Florestas Aleatórias

Palavras-chave

Alcohol consumption; Boruta; CVD; Gene expression; Machine learning; random forest

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Algoritmos / Consumo de Bebidas Alcoólicas Limite: Adult / Female / Humans / Male / Middle aged Idioma: En Revista: Br J Nutr Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google