RESUMO
Late-Onset Alzheimer's disease (LOAD) is a common, complex genetic disorder well-known for its heterogeneous pathology. The genetic heterogeneity underlying common, complex diseases poses a major challenge for targeted therapies and the identification of novel disease-associated variants. Case-control approaches are often limited to examining a specific outcome in a group of heterogenous patients with different clinical characteristics. Here, we developed a novel approach to define relevant transcriptomic endophenotypes and stratify decedents based on molecular profiles in three independent human LOAD cohorts. By integrating post-mortem brain gene co-expression data from 2114 human samples with LOAD, we developed a novel quantitative, composite phenotype that can better account for the heterogeneity in genetic architecture underlying the disease. We used iterative weighted gene co-expression network analysis (WGCNA) to reduce data dimensionality and to isolate gene sets that are highly co-expressed within disease subtypes and represent specific molecular pathways. We then performed single variant association testing using whole genome-sequencing data for the novel composite phenotype in order to identify genetic loci that contribute to disease heterogeneity. Distinct LOAD subtypes were identified for all three study cohorts (two in ROSMAP, three in Mayo Clinic, and two in Mount Sinai Brain Bank). Single variant association analysis identified a genome-wide significant variant in TMEM106B (p-value < 5×10-8, rs1990620G) in the ROSMAP cohort that confers protection from the inflammatory LOAD subtype. Taken together, our novel approach can be used to stratify LOAD into distinct molecular subtypes based on affected disease pathways.
Assuntos
Doença de Alzheimer/genética , Genes Modificadores , Transcriptoma , Idoso , Idoso de 80 Anos ou mais , Doença de Alzheimer/patologia , Córtex Cerebral/metabolismo , Córtex Cerebral/patologia , Feminino , Perfilação da Expressão Gênica/métodos , Heterogeneidade Genética , Estudo de Associação Genômica Ampla/métodos , Humanos , Masculino , Proteínas de Membrana/genética , Proteínas do Tecido Nervoso/genética , Polimorfismo de Nucleotídeo ÚnicoRESUMO
MOTIVATION: Late onset Alzheimer's disease is currently a disease with no known effective treatment options. To better understand disease, new multi-omic data-sets have recently been generated with the goal of identifying molecular causes of disease. However, most analytic studies using these datasets focus on uni-modal analysis of the data. Here, we propose a data driven approach to integrate multiple data types and analytic outcomes to aggregate evidences to support the hypothesis that a gene is a genetic driver of the disease. The main algorithmic contributions of our article are: (i) a general machine learning framework to learn the key characteristics of a few known driver genes from multiple feature sets and identifying other potential driver genes which have similar feature representations, and (ii) A flexible ranking scheme with the ability to integrate external validation in the form of Genome Wide Association Study summary statistics. While we currently focus on demonstrating the effectiveness of the approach using different analytic outcomes from RNA-Seq studies, this method is easily generalizable to other data modalities and analysis types. RESULTS: We demonstrate the utility of our machine learning algorithm on two benchmark multiview datasets by significantly outperforming the baseline approaches in predicting missing labels. We then use the algorithm to predict and rank potential drivers of Alzheimer's. We show that our ranked genes show a significant enrichment for single nucleotide polymorphisms associated with Alzheimer's and are enriched in pathways that have been previously associated with the disease. AVAILABILITY AND IMPLEMENTATION: Source code and link to all feature sets is available at https://github.com/Sage-Bionetworks/EvidenceAggregatedDriverRanking.
Assuntos
Algoritmos , Doença de Alzheimer , Estudo de Associação Genômica Ampla , Doença de Alzheimer/genética , Humanos , Aprendizado de Máquina , SoftwareRESUMO
We present a computational framework, called DISCERN (DIfferential SparsE Regulatory Network), to identify informative topological changes in gene-regulator dependence networks inferred on the basis of mRNA expression datasets within distinct biological states. DISCERN takes two expression datasets as input: an expression dataset of diseased tissues from patients with a disease of interest and another expression dataset from matching normal tissues. DISCERN estimates the extent to which each gene is perturbed-having distinct regulator connectivity in the inferred gene-regulator dependencies between the disease and normal conditions. This approach has distinct advantages over existing methods. First, DISCERN infers conditional dependencies between candidate regulators and genes, where conditional dependence relationships discriminate the evidence for direct interactions from indirect interactions more precisely than pairwise correlation. Second, DISCERN uses a new likelihood-based scoring function to alleviate concerns about accuracy of the specific edges inferred in a particular network. DISCERN identifies perturbed genes more accurately in synthetic data than existing methods to identify perturbed genes between distinct states. In expression datasets from patients with acute myeloid leukemia (AML), breast cancer and lung cancer, genes with high DISCERN scores in each cancer are enriched for known tumor drivers, genes associated with the biological processes known to be important in the disease, and genes associated with patient prognosis, in the respective cancer. Finally, we show that DISCERN can uncover potential mechanisms underlying network perturbation by explaining observed epigenomic activity patterns in cancer and normal tissue types more accurately than alternative methods, based on the available epigenomic data from the ENCODE project.
Assuntos
Redes Reguladoras de Genes , Modelos Genéticos , Neoplasias/genética , Neoplasias da Mama/genética , Biologia Computacional , Simulação por Computador , Bases de Dados Genéticas , Epigênese Genética , Feminino , Regulação Neoplásica da Expressão Gênica , Humanos , Leucemia Mieloide Aguda/genética , Funções Verossimilhança , Neoplasias Pulmonares/genética , PrognósticoRESUMO
We define a new category of candidate tumor drivers in cancer genome evolution: 'selected expression regulators' (SERs)-genes driving dysregulated transcriptional programs in cancer evolution. The SERs are identified from genome-wide tumor expression data with a novel method, namely SPARROW ( SPAR: se selected exp R: essi O: n regulators identified W: ith penalized regression). SPARROW uncovers a previously unknown connection between cancer expression variation and driver events, by using a novel sparse regression technique. Our results indicate that SPARROW is a powerful complementary approach to identify candidate genes containing driver events that are hard to detect from sequence data, due to a large number of passenger mutations and lack of comprehensive sequence information from a sufficiently large number of samples. SERs identified by SPARROW reveal known driver mutations in multiple human cancers, along with known cancer-associated processes and survival-associated genes, better than popular methods for inferring gene expression networks. We demonstrate that when applied to acute myeloid leukemia expression data, SPARROW identifies an apoptotic biomarker (PYCARD) for an investigational drug obatoclax. The PYCARD and obatoclax association is validated in 30 AML patient samples.
Assuntos
Neoplasias Encefálicas/genética , Perfilação da Expressão Gênica , Glioblastoma/genética , Leucemia Mieloide Aguda/genética , Redes Reguladoras de Genes , Humanos , MutaçãoRESUMO
Recently, many statistical methods have been proposed to test for associations between rare genetic variants and complex traits. Most of these methods test for association by aggregating genetic variations within a predefined region, such as a gene. Although there is evidence that "aggregate" tests are more powerful than the single marker test, these tests generally ignore neutral variants and therefore are unable to identify specific variants driving the association with phenotype. We propose a novel aggregate rare-variant test that explicitly models a fraction of variants as neutral, tests associations at the gene-level, and infers the rare-variants driving the association. Simulations show that in the practical scenario where there are many variants within a given region of the genome with only a fraction causal our approach has greater power compared to other popular tests such as the Sequence Kernel Association Test (SKAT), the Weighted Sum Statistic (WSS), and the collapsing method of Morris and Zeggini (MZ). Our algorithm leverages a fast variational Bayes approximate inference methodology to scale to exome-wide analyses, a significant computational advantage over exact inference model selection methodologies. To demonstrate the efficacy of our methodology we test for associations between von Willebrand Factor (VWF) levels and VWF missense rare-variants imputed from the National Heart, Lung, and Blood Institute's Exome Sequencing project into 2,487 African Americans within the VWF gene. Our method suggests that a relatively small fraction (~10%) of the imputed rare missense variants within VWF are strongly associated with lower VWF levels in African Americans.
Assuntos
Teorema de Bayes , Estudos de Associação Genética/métodos , Variação Genética/genética , Fator de von Willebrand/genética , Negro ou Afro-Americano/genética , Algoritmos , Exoma/genética , Feminino , Humanos , Masculino , Modelos Genéticos , Mutação de Sentido Incorreto/genética , National Heart, Lung, and Blood Institute (U.S.) , Fenótipo , Projetos de Pesquisa , Análise de Sequência de DNA , Software , Estados Unidos , Fator de von Willebrand/análiseRESUMO
Researchers have successfully applied exome sequencing to discover causal variants in selected individuals with familial, highly penetrant disorders. We demonstrate the utility of exome sequencing followed by imputation for discovering low-frequency variants associated with complex quantitative traits. We performed exome sequencing in a reference panel of 761 African Americans and then imputed newly discovered variants into a larger sample of more than 13,000 African Americans for association testing with the blood cell traits hemoglobin, hematocrit, white blood count, and platelet count. First, we illustrate the feasibility of our approach by demonstrating genome-wide-significant associations for variants that are not covered by conventional genotyping arrays; for example, one such association is that between higher platelet count and an MPL c.117G>T (p.Lys39Asn) variant encoding a p.Lys39Asn amino acid substitution of the thrombopoietin receptor gene (p = 1.5 × 10(-11)). Second, we identified an association between missense variants of LCT and higher white blood count (p = 4 × 10(-13)). Third, we identified low-frequency coding variants that might account for allelic heterogeneity at several known blood cell-associated loci: MPL c.754T>C (p.Tyr252His) was associated with higher platelet count; CD36 c.975T>G (p.Tyr325(∗)) was associated with lower platelet count; and several missense variants at the α-globin gene locus were associated with lower hemoglobin. By identifying low-frequency missense variants associated with blood cell traits not previously reported by genome-wide association studies, we establish that exome sequencing followed by imputation is a powerful approach to dissecting complex, genetically heterogeneous traits in large population-based studies.
Assuntos
Negro ou Afro-Americano/genética , Células Sanguíneas/metabolismo , Exoma , Locos de Características Quantitativas , Característica Quantitativa Herdável , Adulto , Idoso , Feminino , Frequência do Gene , Estudo de Associação Genômica Ampla , Hematócrito , Doenças Hematológicas/genética , Hemoglobinas/genética , Humanos , Leucócitos/metabolismo , Masculino , Pessoa de Meia-Idade , Contagem de Plaquetas , Polimorfismo de Nucleotídeo Único , Estados Unidos , Adulto JovemRESUMO
Several genetic variants associated with platelet count and mean platelet volume (MPV) were recently reported in people of European ancestry. In this meta-analysis of 7 genome-wide association studies (GWAS) enrolling African Americans, our aim was to identify novel genetic variants associated with platelet count and MPV. For all cohorts, GWAS analysis was performed using additive models after adjusting for age, sex, and population stratification. For both platelet phenotypes, meta-analyses were conducted using inverse-variance weighted fixed-effect models. Platelet aggregation assays in whole blood were performed in the participants of the GeneSTAR cohort. Genetic variants in ten independent regions were associated with platelet count (Nâ=â16,388) with p<5×10(-8) of which 5 have not been associated with platelet count in previous GWAS. The novel genetic variants associated with platelet count were in the following regions (the most significant SNP, closest gene, and p-value): 6p22 (rs12526480, LRRC16A, pâ=â9.1×10(-9)), 7q11 (rs13236689, CD36, pâ=â2.8×10(-9)), 10q21 (rs7896518, JMJD1C, pâ=â2.3×10(-12)), 11q13 (rs477895, BAD, pâ=â4.9×10(-8)), and 20q13 (rs151361, SLMO2, pâ=â9.4×10(-9)). Three of these loci (10q21, 11q13, and 20q13) were replicated in European Americans (Nâ=â14,909) and one (11q13) in Hispanic Americans (Nâ=â3,462). For MPV (Nâ=â4,531), genetic variants in 3 regions were significant at p<5×10(-8), two of which were also associated with platelet count. Previously reported regions that were also significant in this study were 6p21, 6q23, 7q22, 12q24, and 19p13 for platelet count and 7q22, 17q11, and 19p13 for MPV. The most significant SNP in 1 region was also associated with ADP-induced maximal platelet aggregation in whole blood (12q24). Thus through a meta-analysis of GWAS enrolling African Americans, we have identified 5 novel regions associated with platelet count of which 3 were replicated in other ethnic groups. In addition, we also found one region associated with platelet aggregation that may play a potential role in atherothrombosis.
Assuntos
Negro ou Afro-Americano/genética , Plaquetas , Estudo de Associação Genômica Ampla , Contagem de Plaquetas , Adulto , Idoso , Plaquetas/metabolismo , Feminino , Genótipo , Humanos , Masculino , Pessoa de Meia-Idade , Agregação Plaquetária/genética , Polimorfismo de Nucleotídeo ÚnicoRESUMO
Penalized Multiple Regression (PMR) can be used to discover novel disease associations in GWAS datasets. In practice, proposed PMR methods have not been able to identify well-supported associations in GWAS that are undetectable by standard association tests and thus these methods are not widely applied. Here, we present a combined algorithmic and heuristic framework for PUMA (Penalized Unified Multiple-locus Association) analysis that solves the problems of previously proposed methods including computational speed, poor performance on genome-scale simulated data, and identification of too many associations for real data to be biologically plausible. The framework includes a new minorize-maximization (MM) algorithm for generalized linear models (GLM) combined with heuristic model selection and testing methods for identification of robust associations. The PUMA framework implements the penalized maximum likelihood penalties previously proposed for GWAS analysis (i.e. Lasso, Adaptive Lasso, NEG, MCP), as well as a penalty that has not been previously applied to GWAS (i.e. LOG). Using simulations that closely mirror real GWAS data, we show that our framework has high performance and reliably increases power to detect weak associations, while existing PMR methods can perform worse than single marker testing in overall performance. To demonstrate the empirical value of PUMA, we analyzed GWAS data for type 1 diabetes, Crohns's disease, and rheumatoid arthritis, three autoimmune diseases from the original Wellcome Trust Case Control Consortium. Our analysis replicates known associations for these diseases and we discover novel etiologically relevant susceptibility loci that are invisible to standard single marker tests, including six novel associations implicating genes involved in pancreatic function, insulin pathways and immune-cell function in type 1 diabetes; three novel associations implicating genes in pro- and anti-inflammatory pathways in Crohn's disease; and one novel association implicating a gene involved in apoptosis pathways in rheumatoid arthritis. We provide software for applying our PUMA analysis framework.
Assuntos
Estudo de Associação Genômica Ampla , Modelos Teóricos , Análise de Regressão , HumanosRESUMO
Breast cancer is the most common malignancy in women and is responsible for hundreds of thousands of deaths annually. As with most cancers, it is a heterogeneous disease and different breast cancer subtypes are treated differently. Understanding the difference in prognosis for breast cancer based on its molecular and phenotypic features is one avenue for improving treatment by matching the proper treatment with molecular subtypes of the disease. In this work, we employed a competition-based approach to modeling breast cancer prognosis using large datasets containing genomic and clinical information and an online real-time leaderboard program used to speed feedback to the modeling team and to encourage each modeler to work towards achieving a higher ranked submission. We find that machine learning methods combined with molecular features selected based on expert prior knowledge can improve survival predictions compared to current best-in-class methodologies and that ensemble models trained across multiple user submissions systematically outperform individual models within the ensemble. We also find that model scores are highly consistent across multiple independent evaluations. This study serves as the pilot phase of a much larger competition open to the whole research community, with the goal of understanding general strategies for model optimization using clinical and molecular profiling data and providing an objective, transparent system for assessing prognostic models.
Assuntos
Neoplasias da Mama , Biologia Computacional/métodos , Modelos Biológicos , Modelos Estatísticos , Análise de Sobrevida , Algoritmos , Análise por Conglomerados , Bases de Dados Factuais , Feminino , Perfilação da Expressão Gênica , Humanos , PrognósticoRESUMO
INTRODUCTION: Alzheimer's disease (AD) is the predominant dementia globally, with heterogeneous presentation and penetrance of clinical symptoms, variable presence of mixed pathologies, potential disease subtypes, and numerous associated endophenotypes. Beyond the difficulty of designing treatments that address the core pathological characteristics of the disease, therapeutic development is challenged by the uncertainty of which endophenotypic areas and specific targets implicated by those endophenotypes to prioritize for further translational research. However, publicly funded consortia driving large-scale open science efforts have produced multiple omic analyses that address both disease risk relevance and biological process involvement of genes across the genome. METHODS: Here we report the development of an informatic pipeline that draws from genetic association studies, predicted variant impact, and linkage with dementia associated phenotypes to create a genetic risk score. This is paired with a multi-omic risk score utilizing extensive sets of both transcriptomic and proteomic studies to identify system-level changes in expression associated with AD. These two elements combined constitute our target risk score that ranks AD risk genome-wide. The ranked genes are organized into endophenotypic space through the development of 19 biological domains associated with AD in the described genetics and genomics studies and accompanying literature. The biological domains are constructed from exhaustive Gene Ontology (GO) term compilations, allowing automated assignment of genes into objectively defined disease-associated biology. This rank-and-organize approach, performed genome-wide, allows the characterization of aggregations of AD risk across biological domains. RESULTS: The top AD-risk-associated biological domains are Synapse, Immune Response, Lipid Metabolism, Mitochondrial Metabolism, Structural Stabilization, and Proteostasis, with slightly lower levels of risk enrichment present within the other 13 biological domains. DISCUSSION: This provides an objective methodology to localize risk within specific biological endophenotypes and drill down into the most significantly associated sets of GO terms and annotated genes for potential therapeutic targets.
RESUMO
MOTIVATION: For many complex traits, including height, the majority of variants identified by genome-wide association studies (GWAS) have small effects, leaving a significant proportion of the heritable variation unexplained. Although many penalized multiple regression methodologies have been proposed to increase the power to detect associations for complex genetic architectures, they generally lack mechanisms for false-positive control and diagnostics for model over-fitting. Our methodology is the first penalized multiple regression approach that explicitly controls Type I error rates and provide model over-fitting diagnostics through a novel normally distributed statistic defined for every marker within the GWAS, based on results from a variational Bayes spike regression algorithm. RESULTS: We compare the performance of our method to the lasso and single marker analysis on simulated data and demonstrate that our approach has superior performance in terms of power and Type I error control. In addition, using the Women's Health Initiative (WHI) SNP Health Association Resource (SHARe) GWAS of African-Americans, we show that our method has power to detect additional novel associations with body height. These findings replicate by reaching a stringent cutoff of marginal association in a larger cohort. AVAILABILITY: An R-package, including an implementation of our variational Bayes spike regression (vBsr) algorithm, is available at http://kooperberg.fhcrc.org/soft.html.
Assuntos
Estatura/genética , Estudo de Associação Genômica Ampla , Modelos Estatísticos , Negro ou Afro-Americano/genética , Algoritmos , Teorema de Bayes , Loci Gênicos , Humanos , Polimorfismo de Nucleotídeo Único , Análise de RegressãoRESUMO
BACKGROUND: We propose a novel variational Bayes network reconstruction algorithm to extract the most relevant disease factors from high-throughput genomic data-sets. Our algorithm is the only scalable method for regularized network recovery that employs Bayesian model averaging and that can internally estimate an appropriate level of sparsity to ensure few false positives enter the model without the need for cross-validation or a model selection criterion. We use our algorithm to characterize the effect of genetic markers and liver gene expression traits on mouse obesity related phenotypes, including weight, cholesterol, glucose, and free fatty acid levels, in an experiment previously used for discovery and validation of network connections: an F2 intercross between the C57BL/6 J and C3H/HeJ mouse strains, where apolipoprotein E is null on the background. RESULTS: We identified eleven genes, Gch1, Zfp69, Dlgap1, Gna14, Yy1, Gabarapl1, Folr2, Fdft1, Cnr2, Slc24a3, and Ccl19, and a quantitative trait locus directly connected to weight, glucose, cholesterol, or free fatty acid levels in our network. None of these genes were identified by other network analyses of this mouse intercross data-set, but all have been previously associated with obesity or related pathologies in independent studies. In addition, through both simulations and data analysis we demonstrate that our algorithm achieves superior performance in terms of power and type I error control than other network recovery algorithms that use the lasso and have bounds on type I error control. CONCLUSIONS: Our final network contains 118 previously associated and novel genes affecting weight, cholesterol, glucose, and free fatty acid levels that are excellent obesity risk candidates.
Assuntos
Algoritmos , Teorema de Bayes , Obesidade/genética , Obesidade/metabolismo , Animais , Apolipoproteínas E/genética , Simulação por Computador , Humanos , Camundongos , Camundongos Endogâmicos C3H , Camundongos Endogâmicos C57BL , Locos de Características QuantitativasRESUMO
In this article, the authors propose to simultaneously test for marginal genetic association and gene-environment interaction to discover single nucleotide polymorphisms that may be involved in gene-environment or gene-treatment interaction. The asymptotic independence of the marginal association estimator and various interaction estimators leads to a simple and flexible way of combining the 2 tests, allowing for exploitation of gene-environment independence in estimating gene-environment interaction. The proposed test differs from the 2-df test proposed by Kraft et al. (Hum Hered. 2007;63(2):111-119) in two respects. First, for the genetic association component, it tests for marginal association, which is often the primary objective in inference, rather than the main effect in a model with gene-environment interaction. Second, the gene-environment testing component can easily exploit putative gene-environment independence using either the case-only estimator or the empirical Bayes estimator, depending on whether the goal is gene-treatment interaction in a randomized trial or gene-environment interaction in an observational study. The use of the proposed joint test is illustrated through simulations and a genetic study (1993-2005) from the Women's Health Initiative.
Assuntos
Interação Gene-Ambiente , Predisposição Genética para Doença/epidemiologia , Modelos Genéticos , Estudo de Associação Genômica Ampla , Humanos , Modelos EstatísticosRESUMO
Genetics play an important role in late-onset Alzheimer's Disease (AD) etiology and dozens of genetic variants have been implicated in AD risk through large-scale GWAS meta-analyses. However, the precise mechanistic effects of most of these variants have yet to be determined. Deeply phenotyped cohort data can reveal physiological changes associated with genetic risk for AD across an age spectrum that may provide clues to the biology of the disease. We utilized over 2000 high-quality quantitative measurements obtained from blood of 2831 cognitively normal adult clients of a consumer-based scientific wellness company, each with CLIA-certified whole-genome sequencing data. Measurements included: clinical laboratory blood tests, targeted chip-based proteomics, and metabolomics. We performed a phenome-wide association study utilizing this diverse blood marker data and 25 known AD genetic variants and an AD-specific polygenic risk score (PGRS), adjusting for sex, age, vendor (for clinical labs), and the first four genetic principal components; sex-SNP interactions were also assessed. We observed statistically significant SNP-analyte associations for five genetic variants after correction for multiple testing (for SNPs in or near NYAP1, ABCA7, INPP5D, and APOE), with effects detectable from early adulthood. The ABCA7 SNP and the APOE2 and APOE4 encoding alleles were associated with lipid variability, as seen in previous studies; in addition, six novel proteins were associated with the e2 allele. The most statistically significant finding was between the NYAP1 variant and PILRA and PILRB protein levels, supporting previous functional genomic studies in the identification of a putative causal variant within the PILRA gene. We did not observe associations between the PGRS and any analyte. Sex modified the effects of four genetic variants, with multiple interrelated immune-modulating effects associated with the PICALM variant. In post-hoc analysis, sex-stratified GWAS results from an independent AD case-control meta-analysis supported sex-specific disease effects of the PICALM variant, highlighting the importance of sex as a biological variable. Known AD genetic variation influenced lipid metabolism and immune response systems in a population of non-AD individuals, with associations observed from early adulthood onward. Further research is needed to determine whether and how these effects are implicated in early-stage biological pathways to AD. These analyses aim to complement ongoing work on the functional interpretation of AD-associated genetic variants.
Assuntos
Doença de Alzheimer , Transportadores de Cassetes de Ligação de ATP/genética , Adulto , Doença de Alzheimer/genética , Apolipoproteína E2/genética , Feminino , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Genômica , Humanos , Masculino , Polimorfismo de Nucleotídeo ÚnicoRESUMO
[This corrects the article DOI: 10.3389/fnagi.2021.735524.].
RESUMO
Cellular gene expression measurements contain regulatory information that can be used to discover novel network relationships. Here, we present a new algorithm for network reconstruction powered by the adaptive lasso, a theoretically and empirically well-behaved method for selecting the regulatory features of a network. Any algorithms designed for network discovery that make use of directed probabilistic graphs require perturbations, produced by either experiments or naturally occurring genetic variation, to successfully infer unique regulatory relationships from gene expression data. Our approach makes use of appropriately selected cis-expression Quantitative Trait Loci (cis-eQTL), which provide a sufficient set of independent perturbations for maximum network resolution. We compare the performance of our network reconstruction algorithm to four other approaches: the PC-algorithm, QTLnet, the QDG algorithm, and the NEO algorithm, all of which have been used to reconstruct directed networks among phenotypes leveraging QTL. We show that the adaptive lasso can outperform these algorithms for networks of ten genes and ten cis-eQTL, and is competitive with the QDG algorithm for networks with thirty genes and thirty cis-eQTL, with rich topologies and hundreds of samples. Using this novel approach, we identify unique sets of directed relationships in Saccharomyces cerevisiae when analyzing genome-wide gene expression data for an intercross between a wild strain and a lab strain. We recover novel putative network relationships between a tyrosine biosynthesis gene (TYR1), and genes involved in endocytosis (RCY1), the spindle checkpoint (BUB2), sulfonate catabolism (JLP1), and cell-cell communication (PRM7). Our algorithm provides a synthesis of feature selection methods and graphical model theory that has the potential to reveal new directed regulatory relationships from the analysis of population level genetic and gene expression data.
Assuntos
Algoritmos , Regulação Fúngica da Expressão Gênica/genética , Redes Reguladoras de Genes/genética , Modelos Genéticos , Biologia de Sistemas/métodos , Regulação Fúngica da Expressão Gênica/fisiologia , Redes Reguladoras de Genes/fisiologia , Genoma Fúngico , Redes e Vias Metabólicas/genética , Redes e Vias Metabólicas/fisiologia , Fenótipo , Locos de Características Quantitativas/genética , Locos de Características Quantitativas/fisiologia , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/fisiologia , Transdução de Sinais/genética , Transdução de Sinais/fisiologiaRESUMO
INTRODUCTION: Genome-wide association studies (GWAS) for late onset Alzheimer's disease (AD) may miss genetic variants relevant for delineating disease stages when using clinically defined case/control as a phenotype due to its loose definition and heterogeneity. METHODS: We use a transfer learning technique to train three-dimensional convolutional neural network (CNN) models based on structural magnetic resonance imaging (MRI) from the screening stage in the Alzheimer's Disease Neuroimaging Initiative consortium to derive image features that reflect AD progression. RESULTS: CNN-derived image phenotypes are significantly associated with fasting metabolites related to early lipid metabolic changes as well as insulin resistance and with genetic variants mapped to candidate genes enriched for amyloid beta degradation, tau phosphorylation, calcium ion binding-dependent synaptic loss, APP-regulated inflammation response, and insulin resistance. DISCUSSION: This is the first attempt to show that non-invasive MRI biomarkers are linked to AD progression characteristics, reinforcing their use in early AD diagnosis and monitoring.
RESUMO
BACKGROUND: Alzheimer's disease (AD) is an incurable neurodegenerative disease currently affecting 1.75% of the US population, with projected growth to 3.46% by 2050. Identifying common genetic variants driving differences in transcript expression that confer AD risk is necessary to elucidate AD mechanism and develop therapeutic interventions. We modify the FUSION transcriptome-wide association study (TWAS) pipeline to ingest gene expression values from multiple neocortical regions. METHODS: A combined dataset of 2003 genotypes clustered to 1000 Genomes individuals from Utah with Northern and Western European ancestry (CEU) was used to construct a training set of 790 genotypes paired to 888 RNASeq profiles from temporal cortex (TCX = 248), prefrontal cortex (FP = 50), inferior frontal gyrus (IFG = 41), superior temporal gyrus (STG = 34), parahippocampal cortex (PHG = 34), and dorsolateral prefrontal cortex (DLPFC = 461). Following within-tissue normalization and covariate adjustment, predictive weights to impute expression components based on a gene's surrounding cis-variants were trained. The FUSION pipeline was modified to support input of pre-scaled expression values and support cross validation with a repeated measure design arising from the presence of multiple transcriptome samples from the same individual across different tissues. RESULTS: Cis-variant architecture alone was informative to train weights and impute expression for 6780 (49.67%) autosomal genes, the majority of which significantly correlated with gene expression; FDR < 5%: N = 6775 (99.92%), Bonferroni: N = 6716 (99.06%). Validation of weights in 515 matched genotype to RNASeq profiles from the CommonMind Consortium (CMC) was (72.14%) in DLPFC profiles. Association of imputed expression components from all 2003 genotype profiles yielded 8 genes significantly associated with AD (FDR < 0.05): APOC1, EED, CD2AP, CEACAM19, CLPTM1, MTCH2, TREM2, and KNOP1. CONCLUSIONS: We provide evidence of cis-genetic variation conferring AD risk through 8 genes across six distinct genomic loci. Moreover, we provide expression weights for 6780 genes as a valuable resource to the community, which can be abstracted across the neocortex and a wide range of neuronal phenotypes.
Assuntos
Doença de Alzheimer/genética , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Neocórtex/metabolismo , Locos de Características Quantitativas , Transcriptoma , Biologia Computacional/métodos , Regulação da Expressão Gênica , Estudo de Associação Genômica Ampla/métodos , Humanos , Especificidade de Órgãos/genéticaRESUMO
Depression is a common condition, but current treatments are only effective in a subset of individuals. To identify new treatment targets, we integrated depression genome-wide association study (GWAS) results (N = 500,199) with human brain proteomes (N = 376) to perform a proteome-wide association study of depression followed by Mendelian randomization. We identified 19 genes that were consistent with being causal in depression, acting via their respective cis-regulated brain protein abundance. We replicated nine of these genes using an independent depression GWAS (N = 307,353) and another human brain proteomic dataset (N = 152). Eleven of the 19 genes also had cis-regulated mRNA levels that were associated with depression, based on integration of the depression GWAS with human brain transcriptomes (N = 888). Meta-analysis of the discovery and replication proteome-wide association study analyses identified 25 brain proteins consistent with being causal in depression, 20 of which were not previously implicated in depression by GWAS. Together, these findings provide promising brain protein targets for further mechanistic and therapeutic studies.
Assuntos
Encéfalo , Depressão/genética , Predisposição Genética para Doença/genética , Estudo de Associação Genômica Ampla/métodos , Proteoma/genética , Proteômica/métodos , Encéfalo/patologia , Bases de Dados Genéticas , Depressão/patologia , HumanosRESUMO
RNA editing is a feature of RNA maturation resulting in the formation of transcripts whose sequence differs from the genome template. Brain RNA editing may be altered in Alzheimer's disease (AD). Here, we analyzed data from 1,865 brain samples covering 9 brain regions from 1,074 unrelated subjects on a transcriptome-wide scale to identify inter-regional differences in RNA editing. We expand the list of known brain editing events by identifying 58,761 previously unreported events. We note that only a small proportion of these editing events are found at the protein level in our proteome-wide validation effort. We also identified the occurrence of editing events associated with AD dementia, neuropathological measures and longitudinal cognitive decline in: SYT11, MCUR1, SOD2, ORAI2, HSDL2, PFKP, and GPRC5B. Thus, we present an extended reference set of brain RNA editing events, identify a subset that are found to be expressed at the protein level, and extend the narrative of transcriptomic perturbation in AD to RNA editing.