RESUMO
Single-cell tissue atlases commonly use RNA abundances as surrogates for protein abundances. Yet, protein abundance also depends on the regulation of protein synthesis and degradation rates. To estimate the contributions of such post transcriptional regulation, we quantified the proteomes of 5,883 single cells from human testis using 3 distinct mass spectrometry methods (SCoPE2, pSCoPE, and plexDIA). To distinguish between biological and technical factors contributing to differences between protein and RNA levels, we developed BayesPG, a Bayesian model of transcript and protein abundance that systematically accounts for technical variation and infers biological differences. We use BayesPG to jointly model RNA and protein data collected from 29,709 single cells across different methods and datasets. BayesPG estimated consensus mRNA and protein levels for 3,861 gene products and quantified the relative protein-to-mRNA ratio (rPTR) for each gene across six distinct cell types in samples from human testis. About 28% of the gene products exhibited significant differences at protein and RNA levels and contributed to about 1, 500 significant GO groups. We observe that specialized and context specific functions, such as those related to spermatogenesis are regulated after transcription. Among hundreds of detected post translationally modified peptides, many show significant abundance differences across cell types. Furthermore, some phosphorylated peptides covary with kinases in a cell-type dependent manner, suggesting cell-type specific regulation. Our results demonstrate the potential of inferring protein regulation in from single-cell proteogenomic data and provide a generalizable model, BayesPG, for performing such analyses.
RESUMO
Analyzing proteins from single cells by tandem mass spectrometry (MS) has recently become technically feasible. While such analysis has the potential to accurately quantify thousands of proteins across thousands of single cells, the accuracy and reproducibility of the results may be undermined by numerous factors affecting experimental design, sample preparation, data acquisition and data analysis. We expect that broadly accepted community guidelines and standardized metrics will enhance rigor, data quality and alignment between laboratories. Here we propose best practices, quality controls and data-reporting recommendations to assist in the broad adoption of reliable quantitative workflows for single-cell proteomics. Resources and discussion forums are available at https://single-cell.net/guidelines .
Assuntos
Benchmarking , Proteômica , Benchmarking/métodos , Proteômica/métodos , Reprodutibilidade dos Testes , Proteínas/análise , Espectrometria de Massas em Tandem/métodos , Proteoma/análiseRESUMO
In recent years, metabolomics has been used as a powerful tool to better understand the physiology of neurodegenerative diseases and identify potential biomarkers for progression. We used targeted and untargeted aqueous, and lipidomic profiles of the metabolome from human cerebrospinal fluid to build multivariate predictive models distinguishing patients with Alzheimer's disease (AD), Parkinson's disease (PD), and healthy age-matched controls. We emphasize several statistical challenges associated with metabolomic studies where the number of measured metabolites far exceeds sample size. We found strong separation in the metabolome between PD and controls, as well as between PD and AD, with weaker separation between AD and controls. Consistent with existing literature, we found alanine, kynurenine, tryptophan, and serine to be associated with PD classification against controls, while alanine, creatine, and long chain ceramides were associated with AD classification against controls. We conducted a univariate pathway analysis of untargeted and targeted metabolite profiles and find that vitamin E and urea cycle metabolism pathways are associated with PD, while the aspartate/asparagine and c21-steroid hormone biosynthesis pathways are associated with AD. We also found that the amount of metabolite missingness varied by phenotype, highlighting the importance of examining missing data in future metabolomic studies.
RESUMO
We develop an envelope model for joint mean and covariance regression in the large p, small n setting. In contrast to existing envelope methods, which improve mean estimates by incorporating estimates of the covariance structure, we focus on identifying covariance heterogeneity by incorporating information about mean-level differences. We use a Monte Carlo EM algorithm to identify a low-dimensional subspace that explains differences in both means and covariances as a function of covariates, and then use MCMC to estimate the posterior uncertainty conditional on the inferred low-dimensional subspace. We demonstrate the utility of our model on a motivating application on the metabolomics of aging. We also provide R code that can be used to develop and test other generalizations of the response envelope model.
Assuntos
Algoritmos , Método de Monte CarloRESUMO
Quantifying the physiology of aging is essential for improving our understanding of age-related disease and the heterogeneity of healthy aging. Recent studies have shown that, in regression models using "-omic" platforms to predict chronological age, residual variation in predicted age is correlated with health outcomes, and suggest that these "omic clocks" provide measures of biological age. This paper presents predictive models for age using metabolomic profiles of cerebrospinal fluid (CSF) from healthy human subjects and finds that metabolite and lipid data are generally able to predict chronological age within 10 years. We use these models to predict the age of a cohort of subjects with Alzheimer's and Parkinson's disease and find an increase in prediction error, potentially indicating that the relationship between the metabolome and chronological age differs with these diseases. However, evidence is not found to support the hypothesis that our models will consistently overpredict the age of these subjects. In our analysis of control subjects, we find the carnitine shuttle, sucrose, biopterin, vitamin E metabolism, tryptophan, and tyrosine to be the most associated with age. We showcase the potential usefulness of age prediction models in a small data set (n = 85) and discuss techniques for drift correction, missing data imputation, and regularized regression, which can be used to help mitigate the statistical challenges that commonly arise in this setting. To our knowledge, this work presents the first multivariate predictive metabolomic and lipidomic models for age using mass spectrometry analysis of CSF.
Assuntos
Envelhecimento , Metabolômica , Biomarcadores/líquido cefalorraquidiano , Estudos de Coortes , Humanos , Espectrometria de Massas , Metaboloma , Metabolômica/métodosRESUMO
Saccharomyces yeast grow through mitotic cell division, converting resources into biomass. When cells experience starvation, sporulation is initiated and meiosis produces haploid cells inside a protective ascus. The protected spore state does not acquire resources and is partially protected from desiccation, heat, and caustic chemicals. Because cells cannot both be protected and acquire resources simultaneously, committing to sporulation represents a trade-off between current and future reproduction. Recent work has suggested that passaging through insect guts selects for spore formation, as surviving insect ingestion represents a major way that yeasts are vectored to new food sources. We subjected replicate populations from five yeast strains to passaging through insects, and evolved control populations by pipette passaging. We assayed populations for their propensity to sporulate after resource depletion. We found that ancestral domesticated strains produced fewer spores, and all strains evolved increased spore production in response to passaging through flies, but domesticated strains responded less. Exposure to flies led to a more rapid shift to sporulation that was more extreme in wild-derived strains. Our results indicate that insect passaging selects for spore production and suggest that domestication led to genetic canalization of the response to cues in the environment and initiation of sporulation.
Assuntos
Saccharomycetales , Haploidia , Meiose , Saccharomyces cerevisiae , Esporos FúngicosRESUMO
Data analyses typically rely upon assumptions about the missingness mechanisms that lead to observed versus missing data, assumptions that are typically unassessable. We explore an approach where the joint distribution of observed data and missing data are specified in a nonstandard way. In this formulation, which traces back to a representation of the joint distribution of the data and missingness mechanism, apparently first proposed by J. W. Tukey, the modeling assumptions about the distributions are either assessable or are designed to allow relatively easy incorporation of substantive knowledge about the problem at hand, thereby offering a possibly realistic portrayal of the data, both observed and missing. We develop Tukey's representation for exponential-family models, propose a computationally tractable approach to inference in this class of models, and offer some general theoretical comments. We then illustrate the utility of this approach with an example in systems biology.
RESUMO
Analysis by liquid chromatography and tandem mass spectrometry (LC-MS/MS) can identify and quantify thousands of proteins in microgram-level samples, such as those comprised of thousands of cells. This process, however, remains challenging for smaller samples, such as the proteomes of single mammalian cells, because reduced protein levels reduce the number of confidently sequenced peptides. To alleviate this reduction, we developed Data-driven Alignment of Retention Times for IDentification (DART-ID). DART-ID implements principled Bayesian frameworks for global retention time (RT) alignment and for incorporating RT estimates towards improved confidence estimates of peptide-spectrum-matches. When applied to bulk or to single-cell samples, DART-ID increased the number of data points by 30-50% at 1% FDR, and thus decreased missing data. Benchmarks indicate excellent quantification of peptides upgraded by DART-ID and support their utility for quantitative analysis, such as identifying cell types and cell-type specific proteins. The additional datapoints provided by DART-ID boost the statistical power and double the number of proteins identified as differentially abundant in monocytes and T-cells. DART-ID can be applied to diverse experimental designs and is freely available at http://dart-id.slavovlab.net.
Assuntos
Proteoma , Análise de Célula Única , Teorema de Bayes , Cromatografia Líquida/métodos , Espectrometria de Massas em Tandem/métodosRESUMO
Around the world, human populations have experienced large increases in average lifespan over the last 150 years, and while individuals are living longer, they are spending more years of life with multiple chronic morbidities. Researchers have used numerous laboratory animal models to understand the biological and environmental factors that influence aging, morbidity, and longevity. However, the most commonly studied animal species, laboratory mice and rats, do not experience environmental conditions similar to those to which humans are exposed, nor do we often diagnose them with many of the naturally occurring pathologies seen in humans. Recently, the companion dog has been proposed as a powerful model to better understand the genetic and environmental determinants of morbidity and mortality in humans. However, it is not known to what extent the age-related dynamics of morbidity, comorbidity, and mortality are shared between humans and dogs. Here, we present the first large-scale comparison of human and canine patterns of age-specific morbidity and mortality. We find that many chronic conditions that commonly occur in human populations (obesity, arthritis, hypothyroidism, and diabetes), and which are associated with comorbidities, are also associated with similarly high levels of comorbidity in companion dogs. We also find significant similarities in the effect of age on disease risk in humans and dogs, with neoplastic, congenital, and metabolic causes of death showing similar age trajectories between the two species. Overall, our study suggests that the companion dog may be an ideal translational model to study the many complex facets of human morbidity and mortality.
Assuntos
Envelhecimento , Animais , Modelos Animais de Doenças , Cães , Humanos , Estudos Longitudinais , MortalidadeRESUMO
Improving current models and hypotheses of cellular pathways is one of the major challenges of systems biology and functional genomics. There is a need for methods to build on established expert knowledge and reconcile it with results of new high-throughput studies. Moreover, the available sources of data are heterogeneous, and the data need to be integrated in different ways depending on which part of the pathway they are most informative for. In this paper, we introduce a compartment specific strategy to integrate edge, node and path data for refining a given network hypothesis. To carry out inference, we use a local-move Gibbs sampler for updating the pathway hypothesis from a compendium of heterogeneous data sources, and a new network regression idea for integrating protein attributes. We demonstrate the utility of this approach in a case study of the pheromone response MAPK pathway in the yeast S. cerevisiae.
RESUMO
Transcriptional and post-transcriptional regulation shape tissue-type-specific proteomes, but their relative contributions remain contested. Estimates of the factors determining protein levels in human tissues do not distinguish between (i) the factors determining the variability between the abundances of different proteins, i.e., mean-level-variability and, (ii) the factors determining the physiological variability of the same protein across different tissue types, i.e., across-tissues variability. We sought to estimate the contribution of transcript levels to these two orthogonal sources of variability, and found that scaled mRNA levels can account for most of the mean-level-variability but not necessarily for across-tissues variability. The reliable quantification of the latter estimate is limited by substantial measurement noise. However, protein-to-mRNA ratios exhibit substantial across-tissues variability that is functionally concerted and reproducible across different datasets, suggesting extensive post-transcriptional regulation. These results caution against estimating protein fold-changes from mRNA fold-changes between different cell-types, and highlight the contribution of post-transcriptional regulation to shaping tissue-type-specific proteomes.
Assuntos
Regulação da Expressão Gênica/genética , Especificidade de Órgãos/genética , Proteoma/genética , RNA Mensageiro/genética , Transcrição Gênica/genética , Bases de Dados de Proteínas , Humanos , Proteoma/análise , Proteoma/metabolismo , RNA Mensageiro/análise , RNA Mensageiro/metabolismoRESUMO
Heat causes protein misfolding and aggregation and, in eukaryotic cells, triggers aggregation of proteins and RNA into stress granules. We have carried out extensive proteomic studies to quantify heat-triggered aggregation and subsequent disaggregation in budding yeast, identifying >170 endogenous proteins aggregating within minutes of heat shock in multiple subcellular compartments. We demonstrate that these aggregated proteins are not misfolded and destined for degradation. Stable-isotope labeling reveals that even severely aggregated endogenous proteins are disaggregated without degradation during recovery from shock, contrasting with the rapid degradation observed for many exogenous thermolabile proteins. Although aggregation likely inactivates many cellular proteins, in the case of a heterotrimeric aminoacyl-tRNA synthetase complex, the aggregated proteins remain active with unaltered fidelity. We propose that most heat-induced aggregation of mature proteins reflects the operation of an adaptive, autoregulatory process of functionally significant aggregate assembly and disassembly that aids cellular adaptation to thermal stress.
Assuntos
Resposta ao Choque Térmico , Saccharomyces cerevisiae/citologia , Saccharomyces cerevisiae/fisiologia , Cicloeximida/farmacologia , Grânulos Citoplasmáticos/metabolismo , Agregados Proteicos , Biossíntese de Proteínas/efeitos dos fármacos , Inibidores da Síntese de Proteínas/farmacologia , Proteínas de Saccharomyces cerevisiae/química , Proteínas de Saccharomyces cerevisiae/metabolismoRESUMO
We consider the problem of quantifying the degree of coordination between transcription and translation, in yeast. Several studies have reported a surprising lack of coordination over the years, in organisms as different as yeast and human, using diverse technologies. However, a close look at this literature suggests that the lack of reported correlation may not reflect the biology of regulation. These reports do not control for between-study biases and structure in the measurement errors, ignore key aspects of how the data connect to the estimand, and systematically underestimate the correlation as a consequence. Here, we design a careful meta-analysis of 27 yeast data sets, supported by a multilevel model, full uncertainty quantification, a suite of sensitivity analyses and novel theory, to produce a more accurate estimate of the correlation between mRNA and protein levels-a proxy for coordination. From a statistical perspective, this problem motivates new theory on the impact of noise, model mis-specifications and non-ignorable missing data on estimates of the correlation between high dimensional responses. We find that the correlation between mRNA and protein levels is quite high under the studied conditions, in yeast, suggesting that post-transcriptional regulation plays a less prominent role than previously thought.
RESUMO
Cells respond to their environment by modulating protein levels through mRNA transcription and post-transcriptional control. Modest observed correlations between global steady-state mRNA and protein measurements have been interpreted as evidence that mRNA levels determine roughly 40% of the variation in protein levels, indicating dominant post-transcriptional effects. However, the techniques underlying these conclusions, such as correlation and regression, yield biased results when data are noisy, missing systematically, and collinear---properties of mRNA and protein measurements---which motivated us to revisit this subject. Noise-robust analyses of 24 studies of budding yeast reveal that mRNA levels explain more than 85% of the variation in steady-state protein levels. Protein levels are not proportional to mRNA levels, but rise much more rapidly. Regulation of translation suffices to explain this nonlinear effect, revealing post-transcriptional amplification of, rather than competition with, transcriptional signals. These results substantially revise widely credited models of protein-level regulation, and introduce multiple noise-aware approaches essential for proper analysis of many biological phenomena.
Assuntos
Regulação Fúngica da Expressão Gênica , Processamento Pós-Transcricional do RNA , RNA Mensageiro/genética , Proteínas de Saccharomyces cerevisiae/metabolismo , Saccharomyces cerevisiae/genética , Modelos Genéticos , RNA Mensageiro/metabolismo , Reprodutibilidade dos Testes , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/genética , Transcrição GênicaRESUMO
Genetic algorithms (GAs) have been used to find efficient solutions to numerous fundamental and applied problems. While GAs are a robust and flexible approach to solve complex problems, there are some situations under which they perform poorly. Here, we introduce a genetic algorithm approach that is able to solve complex tasks plagued by so-called ''golf-course''-like fitness landscapes. Our approach, which we denote variable environment genetic algorithms (VEGAs), is able to find highly efficient solutions by inducing environmental changes that require more complex solutions and thus creating an evolutionary drive. Using the density classification task, a paradigmatic computer science problem, as a case study, we show that more complex rules that preserve information about the solution to simpler tasks can adapt to more challenging environments. Interestingly, we find that conservative strategies, which have a bias toward the current state, evolve naturally as a highly efficient solution to the density classification task under noisy conditions.