Identifying clusters in genomics data by recursive partitioning.
Stat Appl Genet Mol Biol
; 12(5): 637-52, 2013 Oct 01.
Article
em En
| MEDLINE
| ID: mdl-23942354
Genomics studies frequently involve clustering of molecular data to identify groups, but common clustering methods such as K-means clustering and hierarchical clustering do not determine the number of clusters. Methods for estimating the number of clusters typically focus on identifying the global structure in the data, however the discovery of substructures within clusters may also be of great biological interest. We propose a novel method, Partitioning Algorithm based on Recursive Thresholding (PART), that recursively uncovers distinct subgroups in the groups already identified. Outliers are common in high-dimensional genomics data and may mask the presence of substructure within a cluster. A crucial feature of the algorithm is the introduction of tentative splits of clusters to isolate outliers that might otherwise halt the recursion prematurely. The method is demonstrated on simulated as well as a wide range of real data sets from gene expression microarrays, where the correct clusters were known in advance. When subclusters are present and the variance is large or varies between the clusters, the proposed method performs better than two established global methods on simulated data. On the real data sets the overall performance of PART is superior to the global methods when used in combination with hierarchical clustering. The method is implemented in the R package clusterGenomics and is freely available from CRAN (The Comprehensive R Archive Network).
Texto completo:
1
Coleções:
01-internacional
Base de dados:
MEDLINE
Assunto principal:
Software
/
Perfilação da Expressão Gênica
/
Neoplasias
Tipo de estudo:
Risk_factors_studies
Limite:
Humans
Idioma:
En
Ano de publicação:
2013
Tipo de documento:
Article