Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets.
BMC Bioinformatics
; 24(1): 342, 2023 Sep 14.
Article
en En
| MEDLINE
| ID: mdl-37710192
BACKGROUND: Partitioning around medoids (PAM) is one of the most widely used and successful clustering method in many fields. One of its key advantages is that it only requires a distance or a dissimilarity between the individuals, and the fact that cluster centers are actual points in the data set means they can be taken as reliable representatives of their classes. However, its wider application is hampered by the large amount of memory needed to store the distance matrix (quadratic on the number of individuals) and also by the high computational cost of computing such distance matrix and, less importantly, by the cost of the clustering algorithm itself. RESULTS: Therefore, new software has been provided that addresses these issues. This software, provided under GPL license and usable as either an R package or a C++ library, calculates in parallel the distance matrix for different distances/dissimilarities ([Formula: see text], [Formula: see text], Pearson, cosine and weighted Euclidean) and also implements a parallel fast version of PAM (FASTPAM1) using any data type to reduce memory usage. Moreover, the parallel implementation uses all the cores available in modern computers which greatly reduces the execution time. Besides its general application, the software is especially useful for processing data of single cell experiments. It has been tested in problems including clustering of single cell experiments with up to 289,000 cells with the expression of about 29,000 genes per cell. CONCLUSIONS: Comparisons with other current packages in terms of execution time have been made. The method greatly outperforms the available R packages for distance matrix calculation and also improves the packages that implement the PAM itself. The software is available as an R package at https://CRAN.R-project.org/package=scellpam and as C++ libraries at https://github.com/JdMDE/jmatlib and https://github.com/JdMDE/ppamlib The package is useful for single cell RNA-seq studies but it is also applicable in other contexts where clustering of large data sets is required.
Palabras clave
Texto completo:
1
Base de datos:
MEDLINE
Asunto principal:
Programas Informáticos
/
Análisis de Expresión Génica de una Sola Célula
Idioma:
En
Revista:
BMC Bioinformatics
Asunto de la revista:
INFORMATICA MEDICA
Año:
2023
Tipo del documento:
Article