On triangle inequalities of correlation-based distances for gene expression profiles.

Chen, Jiaxing; Ng, Yen Kaow; Lin, Lu; Zhang, Xianglilan; Li, Shuaicheng

Chen, Jiaxing; Ng, Yen Kaow; Lin, Lu; Zhang, Xianglilan; Li, Shuaicheng.

Chen J; Department of Computer Science, City University of Hong Kong, Hong Kong, China.
Ng YK; Department of Computer Science, Beijing Normal University - Hong Kong Baptist University United International College, Zhuhai, People's Republic of China.
Lin L; Department of Computer Science, City University of Hong Kong, Hong Kong, China.
Zhang X; Department of Computer Science, City University of Hong Kong, Hong Kong, China.
Li S; State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Beijing, 100071, People's Republic of China. zhangxianglilan@gmail.com.

BMC Bioinformatics ; 24(1): 40, 2023 Feb 08.

Article en En | MEDLINE | ID: mdl-36755234

ABSTRACT

ABSTRACT

BACKGROUND:

Distance functions are fundamental for evaluating the differences between gene expression profiles. Such a function would output a low value if the profiles are strongly correlated-either negatively or positively-and vice versa. One popular distance function is the absolute correlation distance, [Formula see text], where [Formula see text] is similarity measure, such as Pearson or Spearman correlation. However, the absolute correlation distance fails to fulfill the triangle inequality, which would have guaranteed better performance at vector quantization, allowed fast data localization, as well as accelerated data clustering.

RESULTS:

In this work, we propose [Formula see text] as an alternative. We prove that [Formula see text] satisfies the triangle inequality when [Formula see text] represents Pearson correlation, Spearman correlation, or Cosine similarity. We show [Formula see text] to be better than [Formula see text], another variant of [Formula see text] that satisfies the triangle inequality, both analytically as well as experimentally. We empirically compared [Formula see text] with [Formula see text] in gene clustering and sample clustering experiment by real-world biological data. The two distances performed similarly in both gene clustering and sample clustering in hierarchical clustering and PAM (partitioning around medoids) clustering. However, [Formula see text] demonstrated more robust clustering. According to the bootstrap experiment, [Formula see text] generated more robust sample pair partition more frequently (P-value [Formula see text]). The statistics on the time a class "dissolved" also support the advantage of [Formula see text] in robustness.