PCA via joint graph Laplacian and sparse constraint: Identification of differentially expressed genes and sample clustering on gene expression data.

Feng, Chun-Mei; Xu, Yong; Hou, Mi-Xiao; Dai, Ling-Yun; Shang, Jun-Liang

Feng, Chun-Mei; Xu, Yong; Hou, Mi-Xiao; Dai, Ling-Yun; Shang, Jun-Liang.

Afiliação

Feng CM; Bio-Computing Research Center, Harbin Institute of Technology, Shenzhen, 518055, Guangdong, People's Republic of China.
Xu Y; School of Information Science and Engineering, Qufu Normal University, Rizhao, 276826, People's Republic of China.
Hou MX; Bio-Computing Research Center, Harbin Institute of Technology, Shenzhen, 518055, Guangdong, People's Republic of China. yongxu@ymail.com.
Dai LY; Key Laboratory of Network Oriented Intelligent Computation, Shenzhen, 518055, People's Republic of China. yongxu@ymail.com.
Shang JL; Bio-Computing Research Center, Harbin Institute of Technology, Shenzhen, 518055, Guangdong, People's Republic of China.

BMC Bioinformatics ; 20(Suppl 22): 716, 2019 Dec 30.

Article em En | MEDLINE | ID: mdl-31888433

ABSTRACT

ABSTRACT

BACKGROUND:

In recent years, identification of differentially expressed genes and sample clustering have become hot topics in bioinformatics. Principal Component Analysis (PCA) is a widely used method in gene expression data. However, it has two

limitations:

first, the geometric structure hidden in data, e.g., pair-wise distance between data points, have not been explored. This information can facilitate sample clustering; second, the Principal Components (PCs) determined by PCA are dense, leading to hard interpretation. However, only a few of genes are related to the cancer. It is of great significance for the early diagnosis and treatment of cancer to identify a handful of the differentially expressed genes and find new cancer biomarkers.

RESULTS:

In this study, a new method gLSPCA is proposed to integrate both graph Laplacian and sparse constraint into PCA. gLSPCA on the one hand improves the clustering accuracy by exploring the internal geometric structure of the data, on the other hand identifies differentially expressed genes by imposing a sparsity constraint on the PCs.

CONCLUSIONS:

Experiments of gLSPCA and its comparison with existing methods, including Z-SPCA, GPower, PathSPCA, SPCArt, gLPCA, are performed on real datasets of both pancreatic cancer (PAAD) and head & neck squamous carcinoma (HNSC). The results demonstrate that gLSPCA is effective in identifying differentially expressed genes and sample clustering. In addition, the applications of gLSPCA on these datasets provide several new clues for the exploration of causative factors of PAAD and HNSC.

Assuntos

Algoritmos; Bases de Dados Genéticas; Perfilação da Expressão Gênica; Regulação Neoplásica da Expressão Gênica; Análise de Componente Principal; Análise por Conglomerados; Expressão Gênica; Humanos; Neoplasias/genética; Mapas de Interação de Proteínas

Palavras-chave

Differentially expressed genes; Gene expression data; Graph Laplacian; Principal component analysis; Sparse constraint

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Algoritmos / Regulação Neoplásica da Expressão Gênica / Perfilação da Expressão Gênica / Análise de Componente Principal / Bases de Dados Genéticas Tipo de estudo: Diagnostic_studies / Prognostic_studies / Screening_studies Limite: Humans Idioma: En Ano de publicação: 2019 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google