Category encoding method to select feature genes for the classification of bulk and single-cell RNA-seq data.

Zhou, Yan; Zhang, Li; Xu, Jinfeng; Zhang, Jun; Yan, Xiaodong

Zhou, Yan; Zhang, Li; Xu, Jinfeng; Zhang, Jun; Yan, Xiaodong.

Afiliación

Zhou Y; Shenzhen Key Laboratory of Advanced Machine Learning and Applications, Institute of Statistical Sciences, College of Mathematics and Statistics, Shenzhen University, Shenzhen, China.
Zhang L; Shenzhen Key Laboratory of Advanced Machine Learning and Applications, Institute of Statistical Sciences, College of Mathematics and Statistics, Shenzhen University, Shenzhen, China.
Xu J; Department of Mathematics, Hong Kong University, Pokfulam, Hong Kong.
Zhang J; Shenzhen Key Laboratory of Advanced Machine Learning and Applications, Institute of Statistical Sciences, College of Mathematics and Statistics, Shenzhen University, Shenzhen, China.
Yan X; Zhongtai Securities Institute for Financial Studies, Shandong University, Jinan, China.

Stat Med ; 40(18): 4077-4089, 2021 08 15.

Article en En | MEDLINE | ID: mdl-34028849

ABSTRACT

ABSTRACT

Bulk and single-cell RNA-seq (scRNA-seq) data are being used as alternatives to traditional technology in biology and medicine research. These data are used, for example, for the detection of differentially expressed (DE) genes. Several statistical methods have been developed for the classification of bulk and single-cell RNA-seq data. These feature genes are vitally important for the classification of bulk and single-cell RNA-seq data. The majority of genes are not DE and they are thus irrelevant for class distinction. To improve the classification performance and save the computation time, removal of irrelevant genes is necessary. Removal will aid the detection of the important feature genes. Widely used schemes in the literature, such as the BSS/WSS (BW) method, assume that data are normally distributed and may not be suitable for bulk and single-cell RNA-seq data. In this article, a category encoding (CAEN) method is proposed to select feature genes for bulk and single-cell RNA-seq data classification. This novel method encodes categories by employing the rank of sequence samples for each gene in each class. Correlation coefficients are considered for gene and class with the rank of sample and a new rank of category. The highest gene correlation coefficients are considered feature genes, which are the most effective for classifying bulk and single-cell RNA-seq dataset. The sure screening method was also established for rank consistency properties of the proposed CAEN method. Simulation studies show that the classifier using the proposed CAEN method performs better than, or at least as well as, the existing methods in most settings. Existing real datasets were analyzed, with the results demonstrating superior performance of the proposed method over current competitors. The application has been coded into an R package named "CAEN" to facilitate wide use.

Asunto(s)

Perfilación de la Expresión Génica; Simulación por Computador; RNA-Seq; Análisis de Secuencia de ARN

Palabras clave

CAEN; classification; feature selection; single-cell RNA-seq

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Perfilación de la Expresión Génica Idioma: En Revista: Stat Med Año: 2021 Tipo del documento: Article País de afiliación: China

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google