Identifying clusters in genomics data by recursive partitioning.

Nilsen, Gro; Borgan, Ornulf; Liestøl, Knut; Lingjærde, Ole Christian

Nilsen, Gro; Borgan, Ornulf; Liestøl, Knut; Lingjærde, Ole Christian.

Stat Appl Genet Mol Biol ; 12(5): 637-52, 2013 Oct 01.

Article em En | MEDLINE | ID: mdl-23942354

RESUMO

Genomics studies frequently involve clustering of molecular data to identify groups, but common clustering methods such as K-means clustering and hierarchical clustering do not determine the number of clusters. Methods for estimating the number of clusters typically focus on identifying the global structure in the data, however the discovery of substructures within clusters may also be of great biological interest. We propose a novel method, Partitioning Algorithm based on Recursive Thresholding (PART), that recursively uncovers distinct subgroups in the groups already identified. Outliers are common in high-dimensional genomics data and may mask the presence of substructure within a cluster. A crucial feature of the algorithm is the introduction of tentative splits of clusters to isolate outliers that might otherwise halt the recursion prematurely. The method is demonstrated on simulated as well as a wide range of real data sets from gene expression microarrays, where the correct clusters were known in advance. When subclusters are present and the variance is large or varies between the clusters, the proposed method performs better than two established global methods on simulated data. On the real data sets the overall performance of PART is superior to the global methods when used in combination with hierarchical clustering. The method is implemented in the R package clusterGenomics and is freely available from CRAN (The Comprehensive R Archive Network).

Assuntos

Perfilação da Expressão Gênica; Neoplasias/genética; Software; Algoritmos; Análise por Conglomerados; Simulação por Computador; Interpretação Estatística de Dados; Genômica; Humanos; Modelos Biológicos; Modelos Estatísticos; Neoplasias/metabolismo; Transcriptoma

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Software / Perfilação da Expressão Gênica / Neoplasias Tipo de estudo: Risk_factors_studies Limite: Humans Idioma: En Ano de publicação: 2013 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google