Pesquisa | BVS IEC

Bayesian clustering with uncertain data.

Nicholls, Kath; Kirk, Paul D W; Wallace, Chris.

PLoS Comput Biol ; 20(9): e1012301, 2024 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-39226325

RESUMO

Clustering is widely used in bioinformatics and many other fields, with applications from exploratory analysis to prediction. Many types of data have associated uncertainty or measurement error, but this is rarely used to inform the clustering. We present Dirichlet Process Mixtures with Uncertainty (DPMUnc), an extension of a Bayesian nonparametric clustering algorithm which makes use of the uncertainty associated with data points. We show that DPMUnc out-performs existing methods on simulated data. We cluster immune-mediated diseases (IMD) using GWAS summary statistics, which have uncertainty linked with the sample size of the study. DPMUnc separates autoimmune from autoinflammatory diseases and isolates other subgroups such as adult-onset arthritis. We additionally consider how DPMUnc can be used to cluster gene expression datasets that have been summarised using gene signatures. We first introduce a novel procedure for generating a summary of a gene signature on a dataset different to the one where it was discovered, which incorporates a measure of the variability in expression across signature genes within each individual. We summarise three public gene expression datasets containing patients with a range of IMD, using three relevant gene signatures. We find association between disease and the clusters returned by DPMUnc, with clustering structure replicated across the datasets. The significance of this work is two-fold. Firstly, we demonstrate that when data has associated uncertainty, this uncertainty should be used to inform clustering and we present a method which does this, DPMUnc. Secondly, we present a procedure for using gene signatures in datasets other than where they were originally defined. We show the value of this procedure by summarising gene expression data from patients with immune-mediated diseases using relevant gene signatures, and clustering these patients using DPMUnc.

Assuntos

Algoritmos , Teorema de Bayes , Biologia Computacional , Humanos , Análise por Conglomerados , Incerteza , Biologia Computacional/métodos , Estudo de Associação Genômica Ampla/métodos , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Perfilação da Expressão Gênica/estatística & dados numéricos , Perfilação da Expressão Gênica/métodos , Bases de Dados Genéticas/estatística & dados numéricos , Simulação por Computador

Comparison of sparse biclustering algorithms for gene expression datasets.

Nicholls, Kath; Wallace, Chris.

Brief Bioinform ; 22(6)2021 11 05.

Artigo em Inglês | MEDLINE | ID: mdl-33951731

RESUMO

MOTIVATION: Gene clustering and sample clustering are commonly used to find patterns in gene expression datasets. However, genes may cluster differently in heterogeneous samples (e.g. different tissues or disease states), whilst traditional methods assume that clusters are consistent across samples. Biclustering algorithms aim to solve this issue by performing sample clustering and gene clustering simultaneously. Existing reviews of biclustering algorithms have yet to include a number of more recent algorithms and have based comparisons on simplistic simulated datasets without specific evaluation of biclusters in real datasets, using less robust metrics. RESULTS: We compared four classes of sparse biclustering algorithms on a range of simulated and real datasets. All algorithms generally struggled on simulated datasets with a large number of genes or implanted biclusters. We found that Bayesian algorithms with strict sparsity constraints had high accuracy on the simulated datasets and did not require any post-processing, but were considerably slower than other algorithm classes. We found that non-negative matrix factorisation algorithms performed poorly, but could be re-purposed for biclustering through a sparsity-inducing post-processing procedure we introduce; one such algorithm was one of the most highly ranked on real datasets. In a multi-tissue knockout mouse RNA-seq dataset, the algorithms rarely returned clusters containing samples from multiple different tissues, whilst such clusters were identified in a human dataset of more closely related cell types (sorted blood cell subsets). This highlights the need for further thought in the design and analysis of multi-tissue studies to avoid differences between tissues dominating the analysis. AVAILABILITY: Code to run the analysis is available at https://github.com/nichollskc/biclust_comp, including wrappers for each algorithm, implementations of evaluation metrics, and code to simulate datasets and perform pre- and post-processing. The full tables of results are available at https://doi.org/10.5281/zenodo.4581206.

Assuntos

Algoritmos , Bases de Dados Genéticas , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Análise de Sequência com Séries de Oligonucleotídeos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA