Thresholding Gini variable importance with a single-trained random forest: An empirical Bayes approach.

Dunne, Robert; Reguant, Roc; Ramarao-Milne, Priya; Szul, Piotr; Sng, Letitia M F; Lundberg, Mischa; Twine, Natalie A; Bauer, Denis C

Dunne, Robert; Reguant, Roc; Ramarao-Milne, Priya; Szul, Piotr; Sng, Letitia M F; Lundberg, Mischa; Twine, Natalie A; Bauer, Denis C.

Afiliação

Dunne R; Data61, Commonwealth Scientific and Industrial Research Organisation, Sydney, Australia.
Reguant R; Transformational Bioinformatics, Commonwealth Scientific and Industrial Research Organisation, Westmead, Australia.
Ramarao-Milne P; Transformational Bioinformatics, Commonwealth Scientific and Industrial Research Organisation, Westmead, Australia.
Szul P; Data61, Commonwealth Scientific and Industrial Research Organisation, Dutton Park, Australia.
Sng LMF; Transformational Bioinformatics, Commonwealth Scientific and Industrial Research Organisation, Westmead, Australia.
Lundberg M; Transformational Bioinformatics, Commonwealth Scientific and Industrial Research Organisation, Westmead, Australia.
Twine NA; Diamantina Institute, The University of Queensland, St Lucia, Australia.
Bauer DC; Transformational Bioinformatics, Commonwealth Scientific and Industrial Research Organisation, Westmead, Australia.

Comput Struct Biotechnol J ; 21: 4354-4360, 2023.

Article em En | MEDLINE | ID: mdl-37711185

ABSTRACT

ABSTRACT

Random forests (RFs) are a widely used modelling tool capable of feature selection via a variable importance measure (VIM), however, a threshold is needed to control for false positives. In the absence of a good understanding of the characteristics of VIMs, many current approaches attempt to select features associated to the response by training multiple RFs to generate statistical power via a permutation null, by employing recursive feature elimination, or through a combination of both. However, for high-dimensional datasets these approaches become computationally infeasible. In this paper, we present RFlocalfdr, a statistical approach, built on the empirical Bayes argument of Efron, for thresholding mean decrease in impurity (MDI) importances. It identifies features significantly associated with the response while controlling the false positive rate. Using synthetic data and real-world data in health, we demonstrate that RFlocalfdr has equivalent accuracy to currently published approaches, while being orders of magnitude faster. We show that RFlocalfdr can successfully threshold a dataset of 106 datapoints, establishing its usability for large-scale datasets, like genomics. Furthermore, RFlocalfdr is compatible with any RF implementation that returns a VIM and counts, making it a versatile feature selection tool that reduces false discoveries.

Palavras-chave

Empirical Bayes; Feature selection; Genetic analysis; Local FDR; Machine learning significance; Random forest

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Tipo de estudo: Clinical_trials / Prognostic_studies Idioma: En Revista: Comput Struct Biotechnol J Ano de publicação: 2023 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google