A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies.

Van, Richard; Alvarez, Daniel; Mize, Travis; Gannavarapu, Sravani; Chintham Reddy, Lohitha; Nasoz, Fatma; Han, Mira V

Van, Richard; Alvarez, Daniel; Mize, Travis; Gannavarapu, Sravani; Chintham Reddy, Lohitha; Nasoz, Fatma; Han, Mira V.

Afiliação

Van R; School of Life Sciences, University of Nevada Las Vegas, Las Vegas, NV, USA.
Alvarez D; Nevada Institute of Personalized Medicine, Las Vegas, NV, USA.
Mize T; Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA.
Gannavarapu S; Nevada Institute of Personalized Medicine, Las Vegas, NV, USA.
Chintham Reddy L; Icahn School of Medicine at Mount Sinai, Institute for Genomic Health, New York City, NY, USA.
Nasoz F; Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA.
Han MV; Nevada Institute of Personalized Medicine, Las Vegas, NV, USA.

BMC Bioinformatics ; 25(1): 181, 2024 May 08.

Article em En | MEDLINE | ID: mdl-38720247

ABSTRACT

ABSTRACT

BACKGROUND:

RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins.

RESULTS:

We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer.

CONCLUSION:

By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.

Assuntos

Aprendizado de Máquina; Neoplasias; RNA-Seq; Humanos; RNA-Seq/métodos; Neoplasias/genética; Transcriptoma/genética; Análise de Sequência de RNA/métodos; Perfilação da Expressão Gênica/métodos; Biologia Computacional/métodos

Palavras-chave

Batch effect correction; Cancer; Classification; Data scaling; Normalization; RNA-Seq

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Aprendizado de Máquina / RNA-Seq / Neoplasias Limite: Humans Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google