Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters.

Tellaroli, Paola; Bazzi, Marco; Donato, Michele; Brazzale, Alessandra R; Draghici, Sorin

Tellaroli, Paola; Bazzi, Marco; Donato, Michele; Brazzale, Alessandra R; Draghici, Sorin.

Affiliation

Tellaroli P; Department of Statistical Sciences, University of Padova, Padova, Italy.
Bazzi M; Department of Statistical Sciences, University of Padova, Padova, Italy.
Donato M; Department of Computer Science, Wayne State University, Detroit, MI, United States of America.
Brazzale AR; Department of Statistical Sciences, University of Padova, Padova, Italy.
Draghici S; Department of Computer Science, Wayne State University, Detroit, MI, United States of America.

PLoS One ; 11(3): e0152333, 2016.

Article in En | MEDLINE | ID: mdl-27015427

ABSTRACT

ABSTRACT

Four of the most common limitations of the many available clustering methods are i) the lack of a proper strategy to deal with outliers; ii) the need for a good a priori estimate of the number of clusters to obtain reasonable results; iii) the lack of a method able to detect when partitioning of a specific data set is not appropriate; and iv) the dependence of the result on the initialization. Here we propose Cross-clustering (CC), a partial clustering algorithm that overcomes these four limitations by combining the principles of two well established hierarchical clustering algorithms Ward's minimum variance and Complete-linkage. We validated CC by comparing it with a number of existing clustering methods, including Ward's and Complete-linkage. We show on both simulated and real datasets, that CC performs better than the other methods in terms of the identification of the correct number of clusters, the identification of outliers, and the determination of real cluster memberships. We used CC to cluster samples in order to identify disease subtypes, and on gene profiles, in order to determine groups of genes with the same behavior. Results obtained on a non-biological dataset show that the method is general enough to be successfully used in such diverse applications. The algorithm has been implemented in the statistical language R and is freely available from the CRAN contributed packages repository.

Subject(s)

Algorithms; Data Interpretation, Statistical; Genetic Linkage; Brain Neoplasms/genetics; Brain Neoplasms/metabolism; Breast Neoplasms/genetics; Breast Neoplasms/metabolism; Cluster Analysis; Cohort Studies; Computer Simulation; Databases, Factual; Female; Gene Expression Profiling; Gene Expression Regulation; Gene Expression Regulation, Neoplastic; Humans; Kaplan-Meier Estimate; Models, Genetic; Models, Statistical; Pattern Recognition, Automated/methods; ROC Curve; Software

Fulltext

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Algorithms / Data Interpretation, Statistical / Genetic Linkage Type of study: Etiology_studies / Incidence_studies / Observational_studies / Prognostic_studies / Risk_factors_studies Limits: Female / Humans Language: En Journal: PLoS One Year: 2016 Document type: Article

Fulltext

XML

PubMed Links

Search on Google