Proportion-based normalizations outperform compositional data transformations in machine learning applications.

Yerke, Aaron; Fry Brumit, Daisy; Fodor, Anthony A

Yerke, Aaron; Fry Brumit, Daisy; Fodor, Anthony A.

Afiliação

Yerke A; Department of Bioinformatics and Genomics, Bioinformatics Building, UNC Charlotte, The University of North Carolina, Charlotte 9331 Robert D. Snyder Rd, Charlotte, USA.
Fry Brumit D; Food Components and Health Laboratory, USDA, ARS, Beltsville Human Nutrition Research Center, Beltsville, USA.
Fodor AA; Department of Bioinformatics and Genomics, Bioinformatics Building, UNC Charlotte, The University of North Carolina, Charlotte 9331 Robert D. Snyder Rd, Charlotte, USA.

Microbiome ; 12(1): 45, 2024 Mar 05.

Article em En | MEDLINE | ID: mdl-38443997

ABSTRACT

ABSTRACT

BACKGROUND:

Normalization, as a pre-processing step, can significantly affect the resolution of machine learning analysis for microbiome studies. There are countless options for normalization scheme selection. In this study, we examined compositionally aware algorithms including the additive log ratio (alr), the centered log ratio (clr), and a recent evolution of the isometric log ratio (ilr) in the form of balance trees made with the PhILR R package. We also looked at compositionally naïve transformations such as raw counts tables and several transformations that are based on relative abundance, such as proportions, the Hellinger transformation, and a transformation based on the logarithm of proportions (which we call "lognorm").

RESULTS:

In our evaluation, we used 65 metadata variables culled from four publicly available datasets at the amplicon sequence variant (ASV) level with a random forest machine learning algorithm. We found that different common pre-processing steps in the creation of the balance trees made very little difference in overall performance. Overall, we found that the compositionally aware data transformations such as alr, clr, and ilr (PhILR) performed generally slightly worse or only as well as compositionally naïve transformations. However, relative abundance-based transformations outperformed most other transformations by a small but reliably statistically significant margin.

CONCLUSIONS:

Our results suggest that minimizing the complexity of transformations while correcting for read depth may be a generally preferable strategy in preparing data for machine learning compared to more sophisticated, but more complex, transformations that attempt to better correct for compositionality. Video Abstract.

Assuntos

Algoritmos; Microbiota; Aprendizado de Máquina; Microbiota/genética

Palavras-chave

Compositional data; High-throughput nucleotide sequencing; Machine learning; Metagenomics; Normalization; PhILR; Random forest; Statistical data interpretation; Transformation

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Algoritmos / Microbiota Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google