ForestSubtype: a cancer subtype identifying approach based on high-dimensional genomic data and a parallel random forest.

Luo, Junwei; Feng, Yading; Wu, Xuyang; Li, Ruimin; Shi, Jiawei; Chang, Wenjing; Wang, Junfeng

Luo, Junwei; Feng, Yading; Wu, Xuyang; Li, Ruimin; Shi, Jiawei; Chang, Wenjing; Wang, Junfeng.

Afiliação

Luo J; School of Software, Henan Polytechnic University, Jiaozuo, China.
Feng Y; School of Software, Henan Polytechnic University, Jiaozuo, China.
Wu X; School of Software, Henan Polytechnic University, Jiaozuo, China.
Li R; School of Software, Henan Polytechnic University, Jiaozuo, China.
Shi J; School of Software, Henan Polytechnic University, Jiaozuo, China.
Chang W; School of Software, Henan Polytechnic University, Jiaozuo, China.
Wang J; School of Software, Henan Polytechnic University, Jiaozuo, China. wangjunfeng@hpu.edu.cn.

BMC Bioinformatics ; 24(1): 289, 2023 Jul 19.

Article em En | MEDLINE | ID: mdl-37468832

ABSTRACT

ABSTRACT

BACKGROUND:

Cancer subtype classification is helpful for personalized cancer treatment. Although, some approaches have been developed to classifying caner subtype based on high dimensional gene expression data, it is difficult to obtain satisfactory classification results. Meanwhile, some cancers have been well studied and classified to some subtypes, which are adopt by most researchers. Hence, this priori knowledge is significant for further identifying new meaningful subtypes.

RESULTS:

In this paper, we present a combined parallel random forest and autoencoder approach for cancer subtype identification based on high dimensional gene expression data, ForestSubtype. ForestSubtype first adopts the parallel RF and the priori knowledge of cancer subtype to train a module and extract significant candidate features. Second, ForestSubtype uses a random forest as the base module and ten parallel random forests to compute each feature weight and rank them separately. Then, the intersection of the features with the larger weights output by the ten parallel random forests is taken as our subsequent candidate features. Third, ForestSubtype uses an autoencoder to condenses the selected features into a two-dimensional data. Fourth, ForestSubtype utilizes k-means++ to obtain new cancer subtype identification results. In this paper, the breast cancer gene expression data obtained from The Cancer Genome Atlas are used for training and validation, and an independent breast cancer dataset from the Molecular Taxonomy of Breast Cancer International Consortium is used for testing. Additionally, we use two other cancer datasets for validating the generalizability of ForestSubtype. ForestSubtype outperforms the other two methods in terms of the distribution of clusters, internal and external metric results. The open-source code is available at https//github.com/lffyd/ForestSubtype .

CONCLUSIONS:

Our work shows that the combination of high-dimensional gene expression data and parallel random forests and autoencoder, guided by a priori knowledge, can identify new subtypes more effectively than existing methods of cancer subtype classification.

Assuntos

Neoplasias da Mama; Algoritmo Florestas Aleatórias; Humanos; Feminino; Genômica; Software

Palavras-chave

Auto Encoder; Cancer subtyping; Gene expression data; Machine learning; Random forest

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Neoplasias da Mama / Algoritmo Florestas Aleatórias Tipo de estudo: Clinical_trials Limite: Female / Humans Idioma: En Ano de publicação: 2023 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google