Feature-specific quantile normalization and feature-specific mean-variance normalization deliver robust bi-directional classification and feature selection performance between microarray and RNAseq data.

Skubleny, Daniel; Ghosh, Sunita; Spratlin, Jennifer; Schiller, Daniel E; Rayat, Gina R

Skubleny, Daniel; Ghosh, Sunita; Spratlin, Jennifer; Schiller, Daniel E; Rayat, Gina R.

Afiliação

Skubleny D; Department of Surgery, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, T6G 2R3, Canada. skubleny@ualberta.ca.
Ghosh S; Department of Oncology, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, T6G 2R3, Canada.
Spratlin J; Department of Mathematical and Statistical Sciences, Faculty of Science, University of Alberta, Edmonton, AB, T6G 2R3, Canada.
Schiller DE; Department of Oncology, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, T6G 2R3, Canada.
Rayat GR; Department of Surgery, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, T6G 2R3, Canada.

BMC Bioinformatics ; 25(1): 136, 2024 Mar 29.

Article em En | MEDLINE | ID: mdl-38549046

ABSTRACT

ABSTRACT

BACKGROUND:

Cross-platform normalization seeks to minimize technological bias between microarray and RNAseq whole-transcriptome data. Incorporating multiple gene expression platforms permits external validation of experimental findings, and augments training sets for machine learning models. Here, we compare the performance of Feature Specific Quantile Normalization (FSQN) to a previously used but unvalidated and uncharacterized method we label as Feature Specific Mean Variance Normalization (FSMVN). We evaluate the performance of these methods for bidirectional normalization in the context of nested feature selection.

RESULTS:

FSQN and FSMVN provided clinically equivalent bidirectional model performance with and without feature selection for colon CMS and breast PAM50 classification. Using principal component analysis, we determine that these methods eliminate batch effects related to technological platforms. Without feature selection, no statistical difference was identified between the performance of FSQN and FSMVN of cross-platform data compared to within-platform distributions. Under optimal feature selection conditions, balanced accuracy was FSQN and FSMVN were statistically equivalent to the within-platform distribution performance in multivariable linear regression analysis. FSQN and FSMVN also provided similar performance to within-platform distributions as the number of selected genes used to create models decreases.

CONCLUSIONS:

In the context of generating supervised machine learning classifiers for molecular subtypes, FSQN and FSMVN are equally effective. Under optimal modeling conditions, FSQN and FSMVN provide equivalent model accuracy performance on cross-platform normalization data compared to within-platform data. Using cross-platform data should still be approached with caution as subtle performance differences may exist depending on the classification problem, training, and testing distributions.

Assuntos

Perfilação da Expressão Gênica; Transcriptoma; Perfilação da Expressão Gênica/métodos; Análise em Microsséries; Modelos Lineares

Palavras-chave

Cross-platform normalization; FSMVN; FSQN; Feature selection; Mean; Microarray; Molecular classification; Quantile normalization; RNAseq; Variance

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Perfilação da Expressão Gênica / Transcriptoma Idioma: En Revista: BMC Bioinformatics Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Canadá

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google