RESUMO
Sepsis is a systemic response to infection with life-threatening consequences. Our understanding of the molecular and cellular impact of sepsis across organs remains rudimentary. Here, we characterize the pathogenesis of sepsis by measuring dynamic changes in gene expression across organs. To pinpoint molecules controlling organ states in sepsis, we compare the effects of sepsis on organ gene expression to those of 6 singles and 15 pairs of recombinant cytokines. Strikingly, we find that the pairwise effects of tumor necrosis factor plus interleukin (IL)-18, interferon-gamma or IL-1ß suffice to mirror the impact of sepsis across tissues. Mechanistically, we map the cellular effects of sepsis and cytokines by computing changes in the abundance of 195 cell types across 9 organs, which we validate by whole-mouse spatial profiling. Our work decodes the cytokine cacophony in sepsis into a pairwise cytokine message capturing the gene, cell and tissue responses of the host to the disease.
Assuntos
Citocinas , Sepse , Camundongos , Animais , Interleucina-6/genética , Fator de Necrose Tumoral alfa/metabolismo , Interferon gama , Sepse/genéticaRESUMO
Predicting phenotypes from genotypes is a fundamental task in quantitative genetics. With technological advances, it is now possible to measure multiple phenotypes in large samples. Multiple phenotypes can share their genetic component; therefore, modeling these phenotypes jointly may improve prediction accuracy by leveraging effects that are shared across phenotypes. However, effects can be shared across phenotypes in a variety of ways, so computationally efficient statistical methods are needed that can accurately and flexibly capture patterns of effect sharing. Here, we describe new Bayesian multivariate, multiple regression methods that, by using flexible priors, are able to model and adapt to different patterns of effect sharing and specificity across phenotypes. Simulation results show that these new methods are fast and improve prediction accuracy compared with existing methods in a wide range of settings where effects are shared. Further, in settings where effects are not shared, our methods still perform competitively with state-of-the-art methods. In real data analyses of expression data in the Genotype Tissue Expression (GTEx) project, our methods improve prediction performance on average for all tissues, with the greatest gains in tissues where effects are strongly shared, and in the tissues with smaller sample sizes. While we use gene expression prediction to illustrate our methods, the methods are generally applicable to any multi-phenotype applications, including prediction of polygenic scores and breeding values. Thus, our methods have the potential to provide improvements across fields and organisms.
Assuntos
Modelos Genéticos , Polimorfismo de Nucleotídeo Único , Teorema de Bayes , Genótipo , Fenótipo , Simulação por Computador , Expressão GênicaRESUMO
SUMMARY: Motivated by theoretical and practical issues that arise when applying Principal component analysis (PCA) to count data, Townes et al. introduced "Poisson GLM-PCA", a variation of PCA adapted to count data, as a tool for dimensionality reduction of single-cell RNA sequencing (scRNA-seq) data. However, fitting GLM-PCA is computationally challenging. Here we study this problem, and show that a simple algorithm, which we call "Alternating Poisson Regression" (APR), produces better quality fits, and in less time, than existing algorithms. APR is also memory-efficient and lends itself to parallel implementation on multi-core processors, both of which are helpful for handling large scRNA-seq datasets. We illustrate the benefits of this approach in three publicly available scRNA-seq datasets. The new algorithms are implemented in an R package, fastglmpca. AVAILABILITY AND IMPLEMENTATION: The fastglmpca R package is released on CRAN for Windows, macOS and Linux, and the source code is available at github.com/stephenslab/fastglmpca under the open source GPL-3 license. Scripts to reproduce the results in this paper are also available in the GitHub repository and on Zenodo.
Assuntos
Algoritmos , Análise de Sequência de RNA , Análise de Célula Única , Software , Análise de Célula Única/métodos , Análise de Sequência de RNA/métodos , Análise de Componente Principal , HumanosRESUMO
In recent work, Wang et al introduced the "Sum of Single Effects" (SuSiE) model, and showed that it provides a simple and efficient approach to fine-mapping genetic variants from individual-level data. Here we present new methods for fitting the SuSiE model to summary data, for example to single-SNP z-scores from an association study and linkage disequilibrium (LD) values estimated from a suitable reference panel. To develop these new methods, we first describe a simple, generic strategy for extending any individual-level data method to deal with summary data. The key idea is to replace the usual regression likelihood with an analogous likelihood based on summary data. We show that existing fine-mapping methods such as FINEMAP and CAVIAR also (implicitly) use this strategy, but in different ways, and so this provides a common framework for understanding different methods for fine-mapping. We investigate other common practical issues in fine-mapping with summary data, including problems caused by inconsistencies between the z-scores and LD estimates, and we develop diagnostics to identify these inconsistencies. We also present a new refinement procedure that improves model fits in some data sets, and hence improves overall reliability of the SuSiE fine-mapping results. Detailed evaluations of fine-mapping methods in a range of simulated data sets show that SuSiE applied to summary data is competitive, in both speed and accuracy, with the best available fine-mapping methods for summary data.
Assuntos
Modelos Genéticos , Polimorfismo de Nucleotídeo Único , Funções Verossimilhança , Desequilíbrio de Ligação , Polimorfismo de Nucleotídeo Único/genética , Reprodutibilidade dos TestesRESUMO
We introduce a simple new approach to variable selection in linear regression, with a particular focus on quantifying uncertainty in which variables should be selected. The approach is based on a new model - the "Sum of Single Effects" (SuSiE) model - which comes from writing the sparse vector of regression coefficients as a sum of "single-effect" vectors, each with one non-zero element. We also introduce a corresponding new fitting procedure - Iterative Bayesian Stepwise Selection (IBSS) - which is a Bayesian analogue of stepwise selection methods. IBSS shares the computational simplicity and speed of traditional stepwise methods, but instead of selecting a single variable at each step, IBSS computes a distribution on variables that captures uncertainty in which variable to select. We provide a formal justification of this intuitive algorithm by showing that it optimizes a variational approximation to the posterior distribution under the SuSiE model. Further, this approximate posterior distribution naturally yields convenient novel summaries of uncertainty in variable selection, providing a Credible Set of variables for each selection. Our methods are particularly well-suited to settings where variables are highly correlated and detectable effects are sparse, both of which are characteristics of genetic fine-mapping applications. We demonstrate through numerical experiments that our methods outperform existing methods for this task, and illustrate their application to fine-mapping genetic variants influencing alternative splicing in human cell-lines. We also discuss the potential and challenges for applying these methods to generic variable selection problems.
RESUMO
Genomic selection has been proposed as the standard method to predict breeding values in animal and plant breeding. Although some crops have benefited from this methodology, studies in Coffea are still emerging. To date, there have been no studies describing how well genomic prediction models work across populations and environments for different complex traits in coffee. Considering that predictive models are based on biological and statistical assumptions, it is expected that their performance vary depending on how well these assumptions align with the true genetic architecture of the phenotype. To investigate this, we used data from two recurrent selection populations of Coffea canephora, evaluated in two locations, and single nucleotide polymorphisms identified by Genotyping-by-Sequencing. In particular, we evaluated the performance of 13 statistical approaches to predict three important traits in the coffee-production of coffee beans, leaf rust incidence and yield of green beans. Analyses were performed for predictions within-environment, across locations and across populations to assess the reliability of genomic selection. Overall, differences in the prediction accuracy of the competing models were small, although the Bayesian methods showed a modest improvement over other methods, at the cost of more computation time. As expected, predictive accuracy for within-environment analysis, on average, were higher than predictions across locations and across populations. Our results support the potential of genomic selection to reshape traditional plant breeding schemes. In practice, we expect to increase the genetic gain per unit of time by reducing the length cycle of recurrent selection in coffee.
Assuntos
Coffea/genética , Meio Ambiente , Interação Gene-Ambiente , Genoma de Planta , Estudo de Associação Genômica Ampla , Genômica , Modelos Genéticos , Algoritmos , Genômica/métodos , Genótipo , Modelos Estatísticos , Fenótipo , Melhoramento Vegetal , Seleção GenéticaRESUMO
The vertebrate cranium is a prime example of the high evolvability of complex traits. While evidence of genes and developmental pathways underlying craniofacial shape determination is accumulating, we are still far from understanding how such variation at the genetic level is translated into craniofacial shape variation. Here we used 3D geometric morphometrics to map genes involved in shape determination in a population of outbred mice (Carworth Farms White, or CFW). We defined shape traits via principal component analysis of 3D skull and mandible measurements. We mapped genetic loci associated with shape traits at ~80,000 candidate single nucleotide polymorphisms in ~700 male mice. We found that craniofacial shape and size are highly heritable, polygenic traits. Despite the polygenic nature of the traits, we identified 17 loci that explain variation in skull shape, and 8 loci associated with variation in mandible shape. Together, the associated variants account for 11.4% of skull and 4.4% of mandible shape variation, however, the total additive genetic variance associated with phenotypic variation was estimated in ~45%. Candidate genes within the associated loci have known roles in craniofacial development; this includes 6 transcription factors and several regulators of bone developmental pathways. One gene, Mn1, has an unusually large effect on shape variation in our study. A knockout of this gene was previously shown to affect negatively the development of membranous bones of the cranial skeleton, and evolutionary analysis shows that the gene has arisen at the base of the bony vertebrates (Eutelostomi), where the ossified head first appeared. Therefore, Mn1 emerges as a key gene for both skull formation and within-population shape variation. Our study shows that it is possible to identify important developmental genes through genome-wide mapping of high-dimensional shape features in an outbred population.
Assuntos
Face/anatomia & histologia , Regulação da Expressão Gênica no Desenvolvimento , Crânio/anatomia & histologia , Animais , Masculino , Camundongos , Camundongos Mutantes , Polimorfismo de Nucleotídeo ÚnicoRESUMO
Pathway analyses of genome-wide association studies aggregate information over sets of related genes, such as genes in common pathways, to identify gene sets that are enriched for variants associated with disease. We develop a model-based approach to pathway analysis, and apply this approach to data from the Wellcome Trust Case Control Consortium (WTCCC) studies. Our method offers several benefits over existing approaches. First, our method not only interrogates pathways for enrichment of disease associations, but also estimates the level of enrichment, which yields a coherent way to promote variants in enriched pathways, enhancing discovery of genes underlying disease. Second, our approach allows for multiple enriched pathways, a feature that leads to novel findings in two diseases where the major histocompatibility complex (MHC) is a major determinant of disease susceptibility. Third, by modeling disease as the combined effect of multiple markers, our method automatically accounts for linkage disequilibrium among variants. Interrogation of pathways from eight pathway databases yields strong support for enriched pathways, indicating links between Crohn's disease (CD) and cytokine-driven networks that modulate immune responses; between rheumatoid arthritis (RA) and "Measles" pathway genes involved in immune responses triggered by measles infection; and between type 1 diabetes (T1D) and IL2-mediated signaling genes. Prioritizing variants in these enriched pathways yields many additional putative disease associations compared to analyses without enrichment. For CD and RA, 7 of 8 additional non-MHC associations are corroborated by other studies, providing validation for our approach. For T1D, prioritization of IL-2 signaling genes yields strong evidence for 7 additional non-MHC candidate disease loci, as well as suggestive evidence for several more. Of the 7 strongest associations, 4 are validated by other studies, and 3 (near IL-2 signaling genes RAF1, MAPK14, and FYN) constitute novel putative T1D loci for further study.
Assuntos
Doença de Crohn/genética , Diabetes Mellitus Tipo 1/genética , Estudo de Associação Genômica Ampla , Interleucina-2/genética , Artrite Reumatoide/genética , Artrite Reumatoide/patologia , Estudos de Casos e Controles , Doença de Crohn/patologia , Diabetes Mellitus Tipo 1/patologia , Diabetes Mellitus Tipo 2/genética , Diabetes Mellitus Tipo 2/patologia , Predisposição Genética para Doença , Humanos , Interleucina-2/metabolismo , Proteína Quinase 14 Ativada por Mitógeno/genética , Polimorfismo de Nucleotídeo Único , Proteínas Proto-Oncogênicas c-fyn/genética , Transdução de SinaisRESUMO
Both linear mixed models (LMMs) and sparse regression models are widely used in genetics applications, including, recently, polygenic modeling in genome-wide association studies. These two approaches make very different assumptions, so are expected to perform well in different situations. However, in practice, for a given dataset one typically does not know which assumptions will be more accurate. Motivated by this, we consider a hybrid of the two, which we refer to as a "Bayesian sparse linear mixed model" (BSLMM) that includes both these models as special cases. We address several key computational and statistical issues that arise when applying BSLMM, including appropriate prior specification for the hyper-parameters and a novel Markov chain Monte Carlo algorithm for posterior inference. We apply BSLMM and compare it with other methods for two polygenic modeling applications: estimating the proportion of variance in phenotypes explained (PVE) by available genotypes, and phenotype (or breeding value) prediction. For PVE estimation, we demonstrate that BSLMM combines the advantages of both standard LMMs and sparse regression modeling. For phenotype prediction it considerably outperforms either of the other two methods, as well as several other large-scale regression methods previously suggested for this problem. Software implementing our method is freely available from http://stephenslab.uchicago.edu/software.html.
Assuntos
Teorema de Bayes , Estudo de Associação Genômica Ampla , Herança Multifatorial/genética , Algoritmos , Simulação por Computador , Genótipo , Humanos , Cadeias de Markov , Modelos Genéticos , Método de Monte Carlo , SoftwareRESUMO
Summary: Motivated by theoretical and practical issues that arise when applying Principal Components Analysis (PCA) to count data, Townes et al introduced "Poisson GLM-PCA", a variation of PCA adapted to count data, as a tool for dimensionality reduction of single-cell RNA sequencing (RNA-seq) data. However, fitting GLM-PCA is computationally challenging. Here we study this problem, and show that a simple algorithm, which we call "Alternating Poisson Regression" (APR), produces better quality fits, and in less time, than existing algorithms. APR is also memory-efficient, and lends itself to parallel implementation on multi-core processors, both of which are helpful for handling large single-cell RNA-seq data sets. We illustrate the benefits of this approach in two published single-cell RNA-seq data sets. The new algorithms are implemented in an R package, fastglmpca. Availability and implementation: The fastglmpca R package is released on CRAN for Windows, macOS and Linux, and the source code is available at github.com/stephenslab/fastglmpca under the open source GPL-3 license. Scripts to reproduce the results in this paper are also available in the GitHub repository. Contact: mstephens@uchicago.edu. Supplementary information: Supplementary data are available on BioRxiv online.
RESUMO
We introduce mvSuSiE, a multi-trait fine-mapping method for identifying putative causal variants from genetic association data (individual-level or summary data). mvSuSiE learns patterns of shared genetic effects from data, and exploits these patterns to improve power to identify causal SNPs. Comparisons on simulated data show that mvSuSiE is competitive in speed, power and precision with existing multi-trait methods, and uniformly improves on single-trait fine-mapping (SuSiE) in each trait separately. We applied mvSuSiE to jointly fine-map 16 blood cell traits using data from the UK Biobank. By jointly analyzing the traits and modeling heterogeneous effect sharing patterns, we discovered a much larger number of causal SNPs (>3,000) compared with single-trait fine-mapping, and with narrower credible sets. mvSuSiE also more comprehensively characterized the ways in which the genetic variants affect one or more blood cell traits; 68% of causal SNPs showed significant effects in more than one blood cell type.
RESUMO
Parts-based representations, such as non-negative matrix factorization and topic modeling, have been used to identify structure from single-cell sequencing data sets, in particular structure that is not as well captured by clustering or other dimensionality reduction methods. However, interpreting the individual parts remains a challenge. To address this challenge, we extend methods for differential expression analysis by allowing cells to have partial membership to multiple groups. We call this grade of membership differential expression (GoM DE). We illustrate the benefits of GoM DE for annotating topics identified in several single-cell RNA-seq and ATAC-seq data sets.
Assuntos
Sequenciamento de Cromatina por Imunoprecipitação , Análise de Célula Única , Análise de Célula Única/métodos , Algoritmos , Análise por Conglomerados , Análise de Sequência de RNA/métodos , Perfilação da Expressão Gênica/métodosRESUMO
Parts-based representations, such as non-negative matrix factorization and topic modeling, have been used to identify structure from single-cell sequencing data sets, in particular structure that is not as well captured by clustering or other dimensionality reduction methods. However, interpreting the individual parts remains a challenge. To address this challenge, we extend methods for differential expression analysis by allowing cells to have partial membership to multiple groups. We call this grade of membership differential expression (GoM DE). We illustrate the benefits of GoM DE for annotating topics identified in several single-cell RNA-seq and ATAC-seq data sets.
RESUMO
Profiling tumors with single-cell RNA sequencing (scRNA-seq) has the potential to identify recurrent patterns of transcription variation related to cancer progression, and so produce new therapeutically-relevant insights. However, the presence of strong inter-tumor heterogeneity often obscures more subtle patterns that are shared across tumors, some of which may characterize clinically-relevant disease subtypes. Here we introduce a new statistical method to address this problem. We show that this method can help decompose transcriptional heterogeneity into interpretable components - including patient-specific, dataset-specific and shared components relevant to disease subtypes - and that, in the presence of strong inter-tumor heterogeneity, our method can produce more interpretable results than existing widely-used methods. Applied to data from three studies on pancreatic cancer adenocarcinoma (PDAC), our method produces a refined characterization of existing tumor subtypes (e.g. classical vs basal), and identifies a new gene expression program (GEP) that is prognostic of poor survival independent of established prognostic factors such as tumor stage and subtype. The new GEP is enriched for genes involved in a variety of stress responses, and suggests a potentially important role for the integrated stress response in PDAC development and prognosis.
RESUMO
Sepsis is a systemic response to infection with life-threatening consequences. Our understanding of the impact of sepsis across organs of the body is rudimentary. Here, using mouse models of sepsis, we generate a dynamic, organism-wide map of the pathogenesis of the disease, revealing the spatiotemporal patterns of the effects of sepsis across tissues. These data revealed two interorgan mechanisms key in sepsis. First, we discover a simplifying principle in the systemic behavior of the cytokine network during sepsis, whereby a hierarchical cytokine circuit arising from the pairwise effects of TNF plus IL-18, IFN-γ, or IL-1ß explains half of all the cellular effects of sepsis on 195 cell types across 9 organs. Second, we find that the secreted phospholipase PLA2G5 mediates hemolysis in blood, contributing to organ failure during sepsis. These results provide fundamental insights to help build a unifying mechanistic framework for the pathophysiological effects of sepsis on the body.
RESUMO
BACKGROUND: Genome-wide association studies of asthma have revealed robust associations with variation across the human leukocyte antigen (HLA) complex with independent associations in the HLA class I and class II regions for both childhood-onset asthma (COA) and adult-onset asthma (AOA). However, the specific variants and genes contributing to risk are unknown. METHODS: We used Bayesian approaches to perform genetic fine-mapping for COA and AOA (n=9432 and 21,556, respectively; n=318,167 shared controls) in White British individuals from the UK Biobank and to perform expression quantitative trait locus (eQTL) fine-mapping in immune (lymphoblastoid cell lines, n=398; peripheral blood mononuclear cells, n=132) and airway (nasal epithelial cells, n=188) cells from ethnically diverse individuals. We also examined putatively causal protein coding variation from protein crystal structures and conducted replication studies in independent multi-ethnic cohorts from the UK Biobank (COA n=1686; AOA n=3666; controls n=56,063). RESULTS: Genetic fine-mapping revealed both shared and distinct causal variation between COA and AOA in the class I region but only distinct causal variation in the class II region. Both gene expression levels and amino acid variation contributed to risk. Our results from eQTL fine-mapping and amino acid visualization suggested that the HLA-DQA1*03:01 allele and variation associated with expression of the nonclassical HLA-DQA2 and HLA-DQB2 genes accounted entirely for the most significant association with AOA in GWAS. Our studies also suggested a potentially prominent role for HLA-C protein coding variation in the class I region in COA. We replicated putatively causal variant associations in a multi-ethnic cohort. CONCLUSIONS: We highlight roles for both gene expression and protein coding variation in asthma risk and identified putatively causal variation and genes in the HLA region. A convergence of genomic, transcriptional, and protein coding evidence implicates the HLA-DQA2 and HLA-DQB2 genes and HLA-DQA1*03:01 allele in AOA.
Assuntos
Asma , Estudo de Associação Genômica Ampla , Adulto , Aminoácidos/genética , Asma/genética , Teorema de Bayes , Criança , Coenzima A/genética , Predisposição Genética para Doença , Humanos , Leucócitos Mononucleares , Polimorfismo de Nucleotídeo ÚnicoRESUMO
Signal denoising-also known as non-parametric regression-is often performed through shrinkage estimation in a transformed (e.g., wavelet) domain; shrinkage in the transformed domain corresponds to smoothing in the original domain. A key question in such applications is how much to shrink, or, equivalently, how much to smooth. Empirical Bayes shrinkage methods provide an attractive solution to this problem; they use the data to estimate a distribution of underlying "effects," hence automatically select an appropriate amount of shrinkage. However, most existing implementations of empirical Bayes shrinkage are less flexible than they could be-both in their assumptions on the underlying distribution of effects, and in their ability to handle heteroskedasticity-which limits their signal denoising applications. Here we address this by adopting a particularly flexible, stable and computationally convenient empirical Bayes shrinkage method and applying it to several signal denoising problems. These applications include smoothing of Poisson data and heteroskedastic Gaussian data. We show through empirical comparisons that the results are competitive with other methods, including both simple thresholding rules and purpose-built empirical Bayes procedures. Our methods are implemented in the R package smashr, "SMoothing by Adaptive SHrinkage in R," available at https://www.github.com/stephenslab/smashr.
RESUMO
Maximum likelihood estimation of mixture proportions has a long history, and continues to play an important role in modern statistics, including in development of nonparametric empirical Bayes methods. Maximum likelihood of mixture proportions has traditionally been solved using the expectation maximization (EM) algorithm, but recent work by Koenker & Mizera shows that modern convex optimization techniques-in particular, interior point methods-are substantially faster and more accurate than EM. Here, we develop a new solution based on sequential quadratic programming (SQP). It is substantially faster than the interior point method, and just as accurate. Our approach combines several ideas: first, it solves a reformulation of the original problem; second, it uses an SQP approach to make the best use of the expensive gradient and Hessian computations; third, the SQP iterations are implemented using an active set method to exploit the sparse nature of the quadratic subproblems; fourth, it uses accurate low-rank approximations for more efficient gradient and Hessian computations. We illustrate the benefits of the SQP approach in experiments on synthetic data sets and a large genetic association data set. In large data sets ( n ≈ 106 observations, m ≈ 103 mixture components), our implementation achieves at least 100-fold reduction in runtime compared with a state-of-the-art interior point solver. Our methods are implemented in Julia and in an R package available on CRAN (https://CRAN.R-project.org/package=mixsqp).
RESUMO
Making scientific analyses reproducible, well documented, and easily shareable is crucial to maximizing their impact and ensuring that others can build on them. However, accomplishing these goals is not easy, requiring careful attention to organization, workflow, and familiarity with tools that are not a regular part of every scientist's toolbox. We have developed an R package, workflowr, to help all scientists, regardless of background, overcome these challenges. Workflowr aims to instill a particular "workflow" - a sequence of steps to be repeated and integrated into research practice - that helps make projects more reproducible and accessible.This workflow integrates four key elements: (1) version control (via Git); (2) literate programming (via R Markdown); (3) automatic checks and safeguards that improve code reproducibility; and (4) sharing code and results via a browsable website. These features exploit powerful existing tools, whose mastery would take considerable study. However, the workflowr interface is simple enough that novice users can quickly enjoy its many benefits. By simply following the workflowr "workflow", R users can create projects whose results, figures, and development history are easily accessible on a static website - thereby conveniently shareable with collaborators by sending them a URL - and accompanied by source code and reproducibility safeguards. The workflowr R package is open source and available on CRAN, with full documentation and source code available at https://github.com/jdblischak/workflowr.
Assuntos
Disseminação de Informação , Software , Fluxo de Trabalho , Reprodutibilidade dos TestesRESUMO
We introduce new statistical methods for analyzing genomic data sets that measure many effects in many conditions (for example, gene expression changes under many treatments). These new methods improve on existing methods by allowing for arbitrary correlations in effect sizes among conditions. This flexible approach increases power, improves effect estimates and allows for more quantitative assessments of effect-size heterogeneity compared to simple shared or condition-specific assessments. We illustrate these features through an analysis of locally acting variants associated with gene expression (cis expression quantitative trait loci (eQTLs)) in 44 human tissues. Our analysis identifies more eQTLs than existing approaches, consistent with improved power. We show that although genetic effects on expression are extensively shared among tissues, effect sizes can still vary greatly among tissues. Some shared eQTLs show stronger effects in subsets of biologically related tissues (for example, brain-related tissues), or in only one tissue (for example, testis). Our methods are widely applicable, computationally tractable for many conditions and available online.