Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 38
Filtrar
1.
Methodology (Gott) ; 73(2): 314-339, 2024 Mar 11.
Artículo en Inglés | MEDLINE | ID: mdl-38577633

RESUMEN

The identification of sets of co-regulated genes that share a common function is a key question of modern genomics. Bayesian profile regression is a semi-supervised mixture modelling approach that makes use of a response to guide inference toward relevant clusterings. Previous applications of profile regression have considered univariate continuous, categorical, and count outcomes. In this work, we extend Bayesian profile regression to cases where the outcome is longitudinal (or multivariate continuous) and provide PReMiuMlongi, an updated version of PReMiuM, the R package for profile regression. We consider multivariate normal and Gaussian process regression response models and provide proof of principle applications to four simulation studies. The model is applied on budding yeast data to identify groups of genes co-regulated during the Saccharomyces cerevisiae cell cycle. We identify 4 distinct groups of genes associated with specific patterns of gene expression trajectories, along with the bound transcriptional factors, likely involved in their co-regulation process.

2.
Lancet Public Health ; 8(7): e535-e545, 2023 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-37393092

RESUMEN

BACKGROUND: To inform targeted public health strategies, it is crucial to understand how coexisting diseases develop over time and their associated impacts on patient outcomes and health-care resources. This study aimed to examine how psychosis, diabetes, and congestive heart failure, in a cluster of physical-mental health multimorbidity, develop and coexist over time, and to assess the associated effects of different temporal sequences of these diseases on life expectancy in Wales. METHODS: In this retrospective cohort study, we used population-scale, individual-level, anonymised, linked, demographic, administrative, and electronic health record data from the Wales Multimorbidity e-Cohort. We included data on all individuals aged 25 years and older who were living in Wales on Jan 1, 2000 (the start of follow-up), with follow-up continuing until Dec 31, 2019, first break in Welsh residency, or death. Multistate models were applied to these data to model trajectories of disease in multimorbidity and their associated effect on all-cause mortality, accounting for competing risks. Life expectancy was calculated as the restricted mean survival time (bound by the maximum follow-up of 20 years) for each of the transitions from the health states to death. Cox regression models were used to estimate baseline hazards for transitions between health states, adjusted for sex, age, and area-level deprivation (Welsh Index of Multiple Deprivation [WIMD] quintile). FINDINGS: Our analyses included data for 1 675 585 individuals (811 393 [48·4%] men and 864 192 [51·6%] women) with a median age of 51·0 years (IQR 37·0-65·0) at cohort entry. The order of disease acquisition in cases of multimorbidity had an important and complex association with patient life expectancy. Individuals who developed diabetes, psychosis, and congestive heart failure, in that order (DPC), had reduced life expectancy compared with people who developed the same three conditions in a different order: for a 50-year-old man in the third quintile of the WIMD (on which we based our main analyses to allow comparability), DPC was associated with a loss in life expectancy of 13·23 years (SD 0·80) compared with the general otherwise healthy or otherwise diseased population. Congestive heart failure as a single condition was associated with mean a loss in life expectancy of 12·38 years (0·00), and with a loss of 12·95 years (0·06) when preceded by psychosis and 13·45 years (0·13) when followed by psychosis. Findings were robust in people of older ages, more deprived populations, and women, except that the trajectory of psychosis, congestive heart failure, and diabetes was associated with higher mortality in women than men. Within 5 years of an initial diagnosis of diabetes, the risk of developing psychosis or congestive heart failure, or both, was increased. INTERPRETATION: The order in which individuals develop psychosis, diabetes, and congestive heart failure as combinations of conditions can substantially affect life expectancy. Multistate models offer a flexible framework to assess temporal sequences of diseases and allow identification of periods of increased risk of developing subsequent conditions and death. FUNDING: Health Data Research UK.


Asunto(s)
Diabetes Mellitus , Insuficiencia Cardíaca , Trastornos Psicóticos , Masculino , Humanos , Femenino , Adulto , Persona de Mediana Edad , Anciano , Web Semántica , Multimorbilidad , Estudios Retrospectivos , Gales/epidemiología , Diabetes Mellitus/epidemiología , Insuficiencia Cardíaca/epidemiología , Trastornos Psicóticos/epidemiología , Esperanza de Vida
3.
BMC Bioinformatics ; 24(1): 161, 2023 Apr 21.
Artículo en Inglés | MEDLINE | ID: mdl-37085771

RESUMEN

In this paper we propose PIICM, a probabilistic framework for dose-response prediction in high-throughput drug combination datasets. PIICM utilizes a permutation invariant version of the intrinsic co-regionalization model for multi-output Gaussian process regression, to predict dose-response surfaces in untested drug combination experiments. Coupled with an observation model that incorporates experimental uncertainty, PIICM is able to learn from noisily observed cell-viability measurements in settings where the underlying dose-response experiments are of varying quality, utilize different experimental designs, and the resulting training dataset is sparsely observed. We show that the model can accurately predict dose-response in held out experiments, and the resulting function captures relevant features indicating synergistic interaction between drugs.


Asunto(s)
Proyectos de Investigación , Incertidumbre , Combinación de Medicamentos
4.
Ann Appl Stat ; 16(4)2022 Dec 01.
Artículo en Inglés | MEDLINE | ID: mdl-36507469

RESUMEN

Understanding sub-cellular protein localisation is an essential component in the analysis of context specific protein function. Recent advances in quantitative mass-spectrometry (MS) have led to high resolution mapping of thousands of proteins to sub-cellular locations within the cell. Novel modelling considerations to capture the complex nature of these data are thus necessary. We approach analysis of spatial proteomics data in a non-parametric Bayesian framework, using K-component mixtures of Gaussian process regression models. The Gaussian process regression model accounts for correlation structure within a sub-cellular niche, with each mixture component capturing the distinct correlation structure observed within each niche. The availability of marker proteins (i.e. proteins with a priori known labelled locations) motivates a semi-supervised learning approach to inform the Gaussian process hyperparameters. We moreover provide an efficient Hamiltonian-within-Gibbs sampler for our model. Furthermore, we reduce the computational burden associated with inversion of covariance matrices by exploiting the structure in the covariance matrix. A tensor decomposition of our covariance matrices allows extended Trench and Durbin algorithms to be applied to reduce the computational complexity of inversion and hence accelerate computation. We provide detailed case-studies on Drosophila embryos and mouse pluripotent embryonic stem cells to illustrate the benefit of semi-supervised functional Bayesian modelling of the data.

5.
Nat Commun ; 13(1): 5948, 2022 10 10.
Artículo en Inglés | MEDLINE | ID: mdl-36216816

RESUMEN

The steady-state localisation of proteins provides vital insight into their function. These localisations are context specific with proteins translocating between different subcellular niches upon perturbation of the subcellular environment. Differential localisation, that is a change in the steady-state subcellular location of a protein, provides a step towards mechanistic insight of subcellular protein dynamics. High-accuracy high-throughput mass spectrometry-based methods now exist to map the steady-state localisation and re-localisation of proteins. Here, we describe a principled Bayesian approach, BANDLE, that uses these data to compute the probability that a protein differentially localises upon cellular perturbation. Extensive simulation studies demonstrate that BANDLE reduces the number of both type I and type II errors compared to existing approaches. Application of BANDLE to several datasets recovers well-studied translocations. In an application to cytomegalovirus infection, we obtain insights into the rewiring of the host proteome. Integration of other high-throughput datasets allows us to provide the functional context of these data.


Asunto(s)
Proteoma , Proteómica , Teorema de Bayes , Espectrometría de Masas/métodos , Proteoma/metabolismo , Proteómica/métodos , Fracciones Subcelulares/metabolismo
6.
BMC Bioinformatics ; 23(1): 290, 2022 Jul 21.
Artículo en Inglés | MEDLINE | ID: mdl-35864476

RESUMEN

BACKGROUND: Cluster analysis is an integral part of precision medicine and systems biology, used to define groups of patients or biomolecules. Consensus clustering is an ensemble approach that is widely used in these areas, which combines the output from multiple runs of a non-deterministic clustering algorithm. Here we consider the application of consensus clustering to a broad class of heuristic clustering algorithms that can be derived from Bayesian mixture models (and extensions thereof) by adopting an early stopping criterion when performing sampling-based inference for these models. While the resulting approach is non-Bayesian, it inherits the usual benefits of consensus clustering, particularly in terms of computational scalability and providing assessments of clustering stability/robustness. RESULTS: In simulation studies, we show that our approach can successfully uncover the target clustering structure, while also exploring different plausible clusterings of the data. We show that, when a parallel computation environment is available, our approach offers significant reductions in runtime compared to performing sampling-based Bayesian inference for the underlying model, while retaining many of the practical benefits of the Bayesian approach, such as exploring different numbers of clusters. We propose a heuristic to decide upon ensemble size and the early stopping criterion, and then apply consensus clustering to a clustering algorithm derived from a Bayesian integrative clustering method. We use the resulting approach to perform an integrative analysis of three 'omics datasets for budding yeast and find clusters of co-expressed genes with shared regulatory proteins. We validate these clusters using data external to the analysis. CONCLUSTIONS: Our approach can be used as a wrapper for essentially any existing sampling-based Bayesian clustering implementation, and enables meaningful clustering analyses to be performed using such implementations, even when computational Bayesian inference is not feasible, e.g. due to poor exploration of the target density (often as a result of increasing numbers of features) or a limited computational budget that does not along sufficient samples to drawn from a single chain. This enables researchers to straightforwardly extend the applicability of existing software to much larger datasets, including implementations of sophisticated models such as those that jointly model multiple datasets.


Asunto(s)
Algoritmos , Programas Informáticos , Teorema de Bayes , Análisis por Conglomerados , Consenso , Humanos
7.
Clin Epigenetics ; 14(1): 39, 2022 03 12.
Artículo en Inglés | MEDLINE | ID: mdl-35279219

RESUMEN

BACKGROUND: This work is aimed at improving the understanding of cardiometabolic syndrome pathophysiology and its relationship with thrombosis by generating a multi-omic disease signature. METHODS/RESULTS: We combined classic plasma biochemistry and plasma biomarkers with the transcriptional and epigenetic characterisation of cell types involved in thrombosis, obtained from two extreme phenotype groups (morbidly obese and lipodystrophy) and lean individuals to identify the molecular mechanisms at play, highlighting patterns of abnormal activation in innate immune phagocytic cells. Our analyses showed that extreme phenotype groups could be distinguished from lean individuals, and from each other, across all data layers. The characterisation of the same obese group, 6 months after bariatric surgery, revealed the loss of the abnormal activation of innate immune cells previously observed. However, rather than reverting to the gene expression landscape of lean individuals, this occurred via the establishment of novel gene expression landscapes. NETosis and its control mechanisms emerge amongst the pathways that show an improvement after surgical intervention. CONCLUSIONS: We showed that the morbidly obese and lipodystrophy groups, despite some differences, shared a common cardiometabolic syndrome signature. We also showed that this could be used to discriminate, amongst the normal population, those individuals with a higher likelihood of presenting with the disease, even when not displaying the classic features.


Asunto(s)
Lipodistrofia , Síndrome Metabólico , Obesidad Mórbida , Metilación de ADN , Epigénesis Genética , Humanos , Síndrome Metabólico/genética , Obesidad Mórbida/cirugía , Fenotipo
8.
Bioinformatics ; 38(9): 2529-2535, 2022 04 28.
Artículo en Inglés | MEDLINE | ID: mdl-35191485

RESUMEN

MOTIVATION: Inferring the parameters of models describing biological systems is an important problem in the reverse engineering of the mechanisms underlying these systems. Much work has focused on parameter inference of stochastic and ordinary differential equation models using Approximate Bayesian Computation (ABC). While there is some recent work on inference in spatial models, this remains an open problem. Simultaneously, advances in topological data analysis (TDA), a field of computational mathematics, have enabled spatial patterns in data to be characterized. RESULTS: Here, we focus on recent work using TDA to study different regimes of parameter space for a well-studied model of angiogenesis. We propose a method for combining TDA with ABC to infer parameters in the Anderson-Chaplain model of angiogenesis. We demonstrate that this topological approach outperforms ABC approaches that use simpler statistics based on spatial features of the data. This is a first step toward a general framework of spatial parameter inference for biological systems, for which there may be a variety of filtrations, vectorizations and summary statistics to be considered. AVAILABILITY AND IMPLEMENTATION: All code used to produce our results is available as a Snakemake workflow from github.com/tt104/tabc_angio.


Asunto(s)
Algoritmos , Teorema de Bayes , Simulación por Computador
9.
PLoS Genet ; 18(1): e1009975, 2022 01.
Artículo en Inglés | MEDLINE | ID: mdl-35085229

RESUMEN

Clustering genetic variants based on their associations with different traits can provide insight into their underlying biological mechanisms. Existing clustering approaches typically group variants based on the similarity of their association estimates for various traits. We present a new procedure for clustering variants based on their proportional associations with different traits, which is more reflective of the underlying mechanisms to which they relate. The method is based on a mixture model approach for directional clustering and includes a noise cluster that provides robustness to outliers. The procedure performs well across a range of simulation scenarios. In an applied setting, clustering genetic variants associated with body mass index generates groups reflective of distinct biological pathways. Mendelian randomization analyses support that the clusters vary in their effect on coronary heart disease, including one cluster that represents elevated body mass index with a favourable metabolic profile and reduced coronary heart disease risk. Analysis of the biological pathways underlying this cluster identifies inflammation as potentially explaining differences in the effects of increased body mass index on coronary heart disease.


Asunto(s)
Biología Computacional/métodos , Variación Genética , Obesidad/genética , Índice de Masa Corporal , Análisis por Conglomerados , Predisposición Genética a la Enfermedad , Estudio de Asociación del Genoma Completo , Humanos , Análisis de la Aleatorización Mendeliana , Modelos Genéticos
10.
Biostatistics ; 24(1): 85-107, 2022 12 12.
Artículo en Inglés | MEDLINE | ID: mdl-34363680

RESUMEN

Risk prediction models are a crucial tool in healthcare. Risk prediction models with a binary outcome (i.e., binary classification models) are often constructed using methodology which assumes the costs of different classification errors are equal. In many healthcare applications, this assumption is not valid, and the differences between misclassification costs can be quite large. For instance, in a diagnostic setting, the cost of misdiagnosing a person with a life-threatening disease as healthy may be larger than the cost of misdiagnosing a healthy person as a patient. In this article, we present Tailored Bayes (TB), a novel Bayesian inference framework which "tailors" model fitting to optimize predictive performance with respect to unbalanced misclassification costs. We use simulation studies to showcase when TB is expected to outperform standard Bayesian methods in the context of logistic regression. We then apply TB to three real-world applications, a cardiac surgery, a breast cancer prognostication task, and a breast cancer tumor classification task and demonstrate the improvement in predictive performance over standard methods.


Asunto(s)
Neoplasias de la Mama , Modelos Estadísticos , Humanos , Femenino , Teorema de Bayes , Modelos Logísticos , Simulación por Computador , Neoplasias de la Mama/diagnóstico
11.
Commun Biol ; 4(1): 810, 2021 06 29.
Artículo en Inglés | MEDLINE | ID: mdl-34188175

RESUMEN

The thermal stability of proteins can be altered when they interact with small molecules, other biomolecules or are subject to post-translation modifications. Thus monitoring the thermal stability of proteins under various cellular perturbations can provide insights into protein function, as well as potentially determine drug targets and off-targets. Thermal proteome profiling is a highly multiplexed mass-spectrommetry method for monitoring the melting behaviour of thousands of proteins in a single experiment. In essence, thermal proteome profiling assumes that proteins denature upon heating and hence become insoluble. Thus, by tracking the relative solubility of proteins at sequentially increasing temperatures, one can report on the thermal stability of a protein. Standard thermodynamics predicts a sigmoidal relationship between temperature and relative solubility and this is the basis of current robust statistical procedures. However, current methods do not model deviations from this behaviour and they do not quantify uncertainty in the melting profiles. To overcome these challenges, we propose the application of Bayesian functional data analysis tools which allow complex temperature-solubility behaviours. Our methods have improved sensitivity over the state-of-the art, identify new drug-protein associations and have less restrictive assumptions than current approaches. Our methods allows for comprehensive analysis of proteins that deviate from the predicted sigmoid behaviour and we uncover potentially biphasic phenomena with a series of published datasets.


Asunto(s)
Teorema de Bayes , Estabilidad Proteica , Proteoma , Solubilidad , Temperatura , Termodinámica
12.
Nat Commun ; 12(1): 2639, 2021 05 11.
Artículo en Inglés | MEDLINE | ID: mdl-33976128

RESUMEN

The placenta is the interface between mother and fetus and inadequate function contributes to short and long-term ill-health. The placenta is absent from most large-scale RNA-Seq datasets. We therefore analyze long and small RNAs (~101 and 20 million reads per sample respectively) from 302 human placentas, including 94 cases of preeclampsia (PE) and 56 cases of fetal growth restriction (FGR). The placental transcriptome has the seventh lowest complexity of 50 human tissues: 271 genes account for 50% of all reads. We identify multiple circular RNAs and validate 6 of these by Sanger sequencing across the back-splice junction. Using large-scale mass spectrometry datasets, we find strong evidence of peptides produced by translation of two circular RNAs. We also identify novel piRNAs which are clustered on Chr1 and Chr14. PE and FGR are associated with multiple and overlapping differences in mRNA, lincRNA and circRNA but fewer consistent differences in small RNAs. Of the three protein coding genes differentially expressed in both PE and FGR, one encodes a secreted protein FSTL3 (follistatin-like 3). Elevated serum levels of FSTL3 in pregnant women are predictive of subsequent PE and FGR. To aid visualization of our placenta transcriptome data, we develop a web application ( https://www.obgyn.cam.ac.uk/placentome/ ).


Asunto(s)
Retardo del Crecimiento Fetal/genética , Placenta/patología , Preeclampsia/genética , ARN/genética , Transcriptoma/genética , Biopsia , Conjuntos de Datos como Asunto , Femenino , Retardo del Crecimiento Fetal/sangre , Retardo del Crecimiento Fetal/patología , Proteínas Relacionadas con la Folistatina/sangre , Proteínas Relacionadas con la Folistatina/genética , Regulación del Desarrollo de la Expresión Génica , Humanos , Placenta/metabolismo , Preeclampsia/sangre , Preeclampsia/patología , Embarazo , ARN/metabolismo , RNA-Seq
13.
Nat Commun ; 12(1): 764, 2021 02 03.
Artículo en Inglés | MEDLINE | ID: mdl-33536417

RESUMEN

Genome-wide association studies (GWAS) have identified thousands of genomic regions affecting complex diseases. The next challenge is to elucidate the causal genes and mechanisms involved. One approach is to use statistical colocalization to assess shared genetic aetiology across multiple related traits (e.g. molecular traits, metabolic pathways and complex diseases) to identify causal pathways, prioritize causal variants and evaluate pleiotropy. We propose HyPrColoc (Hypothesis Prioritisation for multi-trait Colocalization), an efficient deterministic Bayesian algorithm using GWAS summary statistics that can detect colocalization across vast numbers of traits simultaneously (e.g. 100 traits can be jointly analysed in around 1 s). We perform a genome-wide multi-trait colocalization analysis of coronary heart disease (CHD) and fourteen related traits, identifying 43 regions in which CHD colocalized with ≥1 trait, including 5 previously unknown CHD loci. Across the 43 loci, we further integrate gene and protein expression quantitative trait loci to identify candidate causal genes.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Enfermedad Coronaria/genética , Predisposición Genética a la Enfermedad/genética , Estudio de Asociación del Genoma Completo/métodos , Sitios de Carácter Cuantitativo/genética , Enfermedad Coronaria/diagnóstico , Genómica/métodos , Humanos , Desequilibrio de Ligamiento , Polimorfismo de Nucleótido Simple , Reproducibilidad de los Resultados , Factores de Riesgo
14.
Bioinformatics ; 37(4): 531-541, 2021 05 01.
Artículo en Inglés | MEDLINE | ID: mdl-32915962

RESUMEN

MOTIVATION: Mendelian randomization is an epidemiological technique that uses genetic variants as instrumental variables to estimate the causal effect of a risk factor on an outcome. We consider a scenario in which causal estimates based on each variant in turn differ more strongly than expected by chance alone, but the variants can be divided into distinct clusters, such that all variants in the cluster have similar causal estimates. This scenario is likely to occur when there are several distinct causal mechanisms by which a risk factor influences an outcome with different magnitudes of causal effect. We have developed an algorithm MR-Clust that finds such clusters of variants, and so can identify variants that reflect distinct causal mechanisms. Two features of our clustering algorithm are that it accounts for differential uncertainty in the causal estimates, and it includes 'null' and 'junk' clusters, to provide protection against the detection of spurious clusters. RESULTS: Our algorithm correctly detected the number of clusters in a simulation analysis, outperforming methods that either do not account for uncertainty or do not include null and junk clusters. In an applied example considering the effect of blood pressure on coronary artery disease risk, the method detected four clusters of genetic variants. A post hoc hypothesis-generating search suggested that variants in the cluster with a negative effect of blood pressure on coronary artery disease risk were more strongly related to trunk fat percentage and other adiposity measures than variants not in this cluster. AVAILABILITY AND IMPLEMENTATION: MR-Clust can be downloaded from https://github.com/cnfoley/mrclust. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Análisis de la Aleatorización Mendeliana , Causalidad , Análisis por Conglomerados , Simulación por Computador , Factores de Riesgo
15.
Nat Comput Sci ; 1: 421-432, 2021 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-34993494

RESUMEN

Detecting genetic variants associated with traits (quantitative trait loci, QTL) requires genotyped study individuals. Here we describe BaseQTL, a Bayesian method that exploits allele-specific expression to map molecular QTL from sequencing reads (eQTL for gene expression) even when no genotypes are available. When used with genotypes to map eQTL, BaseQTL has lower error rates and increased power compared with existing QTL mapping methods. Running without genotypes limits how many tests can be performed, but due to the proximity of QTL variants to gene bodies, the 2.8% of variants within a 100 kB window that could be tested contained 26% of eQTL detectable with genotypes. eQTL effect estimates were invariably consistent between analyses performed with and without genotypes. Often, sequencing data may be generated in the absence of genotypes on patients and controls in differential expression studies, and we identified an apparent psoriasis-specific eQTL for GSTP1 in one such dataset, providing new insights into disease-dependent gene regulation.

16.
Genome Med ; 12(1): 106, 2020 11 25.
Artículo en Inglés | MEDLINE | ID: mdl-33239102

RESUMEN

BACKGROUND: Genome-wide association studies (GWAS) have identified pervasive sharing of genetic architectures across multiple immune-mediated diseases (IMD). By learning the genetic basis of IMD risk from common diseases, this sharing can be exploited to enable analysis of less frequent IMD where, due to limited sample size, traditional GWAS techniques are challenging. METHODS: Exploiting ideas from Bayesian genetic fine-mapping, we developed a disease-focused shrinkage approach to allow us to distill genetic risk components from GWAS summary statistics for a set of related diseases. We applied this technique to 13 larger GWAS of common IMD, deriving a reduced dimension "basis" that summarised the multidimensional components of genetic risk. We used independent datasets including the UK Biobank to assess the performance of the basis and characterise individual axes. Finally, we projected summary GWAS data for smaller IMD studies, with less than 1000 cases, to assess whether the approach was able to provide additional insights into genetic architecture of less common IMD or IMD subtypes, where cohort collection is challenging. RESULTS: We identified 13 IMD genetic risk components. The projection of independent UK Biobank data demonstrated the IMD specificity and accuracy of the basis even for traits with very limited case-size (e.g. vitiligo, 150 cases). Projection of additional IMD-relevant studies allowed us to add biological interpretation to specific components, e.g. related to raised eosinophil counts in blood and serum concentration of the chemokine CXCL10 (IP-10). On application to 22 rare IMD and IMD subtypes, we were able to not only highlight subtype-discriminating axes (e.g. for juvenile idiopathic arthritis) but also suggest eight novel genetic associations. CONCLUSIONS: Requiring only summary-level data, our unsupervised approach allows the genetic architectures across any range of clinically related traits to be characterised in fewer dimensions. This facilitates the analysis of studies with modest sample size by matching shared axes of both genetic and biological risk across a wider disease domain, and provides an evidence base for possible therapeutic repurposing opportunities.


Asunto(s)
Ingeniería Genética , Enfermedades del Sistema Inmune/genética , Teorema de Bayes , Estudio de Asociación del Genoma Completo , Humanos , Fenotipo , Polimorfismo de Nucleótido Simple , Factores de Riesgo , Tamaño de la Muestra
17.
PLoS Comput Biol ; 16(11): e1008288, 2020 11.
Artículo en Inglés | MEDLINE | ID: mdl-33166281

RESUMEN

The cell is compartmentalised into complex micro-environments allowing an array of specialised biological processes to be carried out in synchrony. Determining a protein's sub-cellular localisation to one or more of these compartments can therefore be a first step in determining its function. High-throughput and high-accuracy mass spectrometry-based sub-cellular proteomic methods can now shed light on the localisation of thousands of proteins at once. Machine learning algorithms are then typically employed to make protein-organelle assignments. However, these algorithms are limited by insufficient and incomplete annotation. We propose a semi-supervised Bayesian approach to novelty detection, allowing the discovery of additional, previously unannotated sub-cellular niches. Inference in our model is performed in a Bayesian framework, allowing us to quantify uncertainty in the allocation of proteins to new sub-cellular niches, as well as in the number of newly discovered compartments. We apply our approach across 10 mass spectrometry based spatial proteomic datasets, representing a diverse range of experimental protocols. Application of our approach to hyperLOPIT datasets validates its utility by recovering enrichment with chromatin-associated proteins without annotation and uncovers sub-nuclear compartmentalisation which was not identified in the original analysis. Moreover, using sub-cellular proteomics data from Saccharomyces cerevisiae, we uncover a novel group of proteins trafficking from the ER to the early Golgi apparatus. Overall, we demonstrate the potential for novelty detection to yield biologically relevant niches that are missed by current approaches.


Asunto(s)
Teorema de Bayes , Proteínas de Saccharomyces cerevisiae/metabolismo , Fracciones Subcelulares/metabolismo , Algoritmos , Animales , Conjuntos de Datos como Asunto , Humanos , Aprendizaje Automático , Espectrometría de Masas , Ratones , Proteómica
18.
Bioinformatics ; 36(18): 4789-4796, 2020 09 15.
Artículo en Inglés | MEDLINE | ID: mdl-32592464

RESUMEN

MOTIVATION: Diverse applications-particularly in tumour subtyping-have demonstrated the importance of integrative clustering techniques for combining information from multiple data sources. Cluster Of Clusters Analysis (COCA) is one such approach that has been widely applied in the context of tumour subtyping. However, the properties of COCA have never been systematically explored, and its robustness to the inclusion of noisy datasets is unclear. RESULTS: We rigorously benchmark COCA, and present Kernel Learning Integrative Clustering (KLIC) as an alternative strategy. KLIC frames the challenge of combining clustering structures as a multiple kernel learning problem, in which different datasets each provide a weighted contribution to the final clustering. This allows the contribution of noisy datasets to be down-weighted relative to more informative datasets. We compare the performances of KLIC and COCA in a variety of situations through simulation studies. We also present the output of KLIC and COCA in real data applications to cancer subtyping and transcriptional module discovery. AVAILABILITY AND IMPLEMENTATION: R packages klic and coca are available on the Comprehensive R Archive Network. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Neoplasias , Algoritmos , Análisis por Conglomerados , Consenso , Humanos , Almacenamiento y Recuperación de la Información , Neoplasias/genética
19.
Bioinformatics ; 36(5): 1484-1491, 2020 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-31608923

RESUMEN

MOTIVATION: Many methods have been developed to cluster genes on the basis of their changes in mRNA expression over time, using bulk RNA-seq or microarray data. However, single-cell data may present a particular challenge for these algorithms, since the temporal ordering of cells is not directly observed. One way to address this is to first use pseudotime methods to order the cells, and then apply clustering techniques for time course data. However, pseudotime estimates are subject to high levels of uncertainty, and failing to account for this uncertainty is liable to lead to erroneous and/or over-confident gene clusters. RESULTS: The proposed method, GPseudoClust, is a novel approach that jointly infers pseudotemporal ordering and gene clusters, and quantifies the uncertainty in both. GPseudoClust combines a recent method for pseudotime inference with non-parametric Bayesian clustering methods, efficient Markov Chain Monte Carlo sampling and novel subsampling strategies which aid computation. We consider a broad array of simulated and experimental datasets to demonstrate the effectiveness of GPseudoClust in a range of settings. AVAILABILITY AND IMPLEMENTATION: An implementation is available on GitHub: https://github.com/magStra/nonparametricSummaryPSM and https://github.com/magStra/GPseudoClust. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Análisis de la Célula Individual , Teorema de Bayes , Análisis por Conglomerados , Cadenas de Markov
20.
Stat Appl Genet Mol Biol ; 18(6)2019 12 12.
Artículo en Inglés | MEDLINE | ID: mdl-31829970

RESUMEN

The Dirichlet Process (DP) mixture model has become a popular choice for model-based clustering, largely because it allows the number of clusters to be inferred. The sequential updating and greedy search (SUGS) algorithm (Wang & Dunson, 2011) was proposed as a fast method for performing approximate Bayesian inference in DP mixture models, by posing clustering as a Bayesian model selection (BMS) problem and avoiding the use of computationally costly Markov chain Monte Carlo methods. Here we consider how this approach may be extended to permit variable selection for clustering, and also demonstrate the benefits of Bayesian model averaging (BMA) in place of BMS. Through an array of simulation examples and well-studied examples from cancer transcriptomics, we show that our method performs competitively with the current state-of-the-art, while also offering computational benefits. We apply our approach to reverse-phase protein array (RPPA) data from The Cancer Genome Atlas (TCGA) in order to perform a pan-cancer proteomic characterisation of 5157 tumour samples. We have implemented our approach, together with the original SUGS algorithm, in an open-source R package named sugsvarsel, which accelerates analysis by performing intensive computations in C++ and provides automated parallel processing. The R package is freely available from: https://github.com/ococrook/sugsvarsel.


Asunto(s)
Biología Computacional , Modelos Estadísticos , Neoplasias/metabolismo , Proteoma , Proteómica , Algoritmos , Teorema de Bayes , Biología Computacional/métodos , Humanos , Proteómica/métodos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...