RESUMO
Methods of estimating polygenic scores (PGSs) from genome-wide association studies are increasingly utilized. However, independent method evaluation is lacking, and method comparisons are often limited. Here, we evaluate polygenic scores derived via seven methods in five biobank studies (totaling about 1.2 million participants) across 16 diseases and quantitative traits, building on a reference-standardized framework. We conducted meta-analyses to quantify the effects of method choice, hyperparameter tuning, method ensembling, and the target biobank on PGS performance. We found that no single method consistently outperformed all others. PGS effect sizes were more variable between biobanks than between methods within biobanks when methods were well tuned. Differences between methods were largest for the two investigated autoimmune diseases, seropositive rheumatoid arthritis and type 1 diabetes. For most methods, cross-validation was more reliable for tuning hyperparameters than automatic tuning (without the use of target data). For a given target phenotype, elastic net models combining PGS across methods (ensemble PGS) tuned in the UK Biobank provided consistent, high, and cross-biobank transferable performance, increasing PGS effect sizes (ß coefficients) by a median of 5.0% relative to LDpred2 and MegaPRS (the two best-performing single methods when tuned with cross-validation). Our interactively browsable online-results and open-source workflow prspipe provide a rich resource and reference for the analysis of polygenic scoring methods across biobanks.
Assuntos
Bancos de Espécimes Biológicos , Estudo de Associação Genômica Ampla , Herança Multifatorial , Humanos , Herança Multifatorial/genética , Fenótipo , Diabetes Mellitus Tipo 1/genética , Polimorfismo de Nucleotídeo Único , Aprendizado de MáquinaRESUMO
With the development of high-throughput technologies, genomics datasets rapidly grow in size, including functional genomics data. This has allowed the training of large Deep Learning (DL) models to predict epigenetic readouts, such as protein binding or histone modifications, from genome sequences. However, large dataset sizes come at a price of data consistency, often aggregating results from a large number of studies, conducted under varying experimental conditions. While data from large-scale consortia are useful as they allow studying the effects of different biological conditions, they can also contain unwanted biases from confounding experimental factors. Here, we introduce Metadata-guided Feature Disentanglement (MFD)-an approach that allows disentangling biologically relevant features from potential technical biases. MFD incorporates target metadata into model training, by conditioning weights of the model output layer on different experimental factors. It then separates the factors into disjoint groups and enforces independence of the corresponding feature subspaces with an adversarially learned penalty. We show that the metadata-driven disentanglement approach allows for better model introspection, by connecting latent features to experimental factors, without compromising, or even improving performance in downstream tasks, such as enhancer prediction, or genetic variant discovery. The code will be made available at https://github.com/HealthML/MFD.
Assuntos
Genômica , Metadados , Genômica/métodos , Aprendizado Profundo , HumanosRESUMO
Small RNAs (sRNAs) are known to regulate pathogenic plant-microbe interactions. Emerging evidence from the study of these model systems suggests that microRNAs (miRNAs) can be translocated between microbes and plants to facilitate symbiosis. The roles of sRNAs in mutualistic mycorrhizal fungal interactions, however, are largely unknown. In this study, we characterized miRNAs encoded by the ectomycorrhizal fungus Pisolithus microcarpus and investigated their expression during mutualistic interaction with Eucalyptus grandis. Using sRNA sequencing data and in situ miRNA detection, a novel fungal miRNA, Pmic_miR-8, was found to be transported into E. grandis roots after interaction with P. microcarpus Further characterization experiments demonstrate that inhibition of Pmic_miR-8 negatively impacts the maintenance of mycorrhizal roots in E. grandis, while supplementation of Pmic_miR-8 led to deeper integration of the fungus into plant tissues. Target prediction and experimental testing suggest that Pmic_miR-8 may target the host NB-ARC domain containing transcripts, suggesting a potential role for this miRNA in subverting host signaling to stabilize the symbiotic interaction. Altogether, we provide evidence of previously undescribed cross-kingdom sRNA transfer from ectomycorrhizal fungi to plant roots, shedding light onto the involvement of miRNAs during the developmental process of mutualistic symbioses.
Assuntos
Basidiomycota/genética , Inativação Gênica , MicroRNAs/metabolismo , Micorrizas/genética , Simbiose/genética , Sequência de Bases , Basidiomycota/crescimento & desenvolvimento , Contagem de Colônia Microbiana , Perfilação da Expressão Gênica , Regulação Fúngica da Expressão Gênica , Genoma Fúngico , MicroRNAs/genética , Raízes de Plantas/microbiologia , RNA Mensageiro/genética , RNA Mensageiro/metabolismoRESUMO
MOTIVATION: Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking. RESULTS: We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures. AVAILABILITY AND IMPLEMENTATION: A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.
Assuntos
Benchmarking , Confiabilidade dos Dados , Humanos , Genótipo , Fenótipo , Herança MultifatorialRESUMO
Filamentous fungi, such as Neurospora crassa, are very efficient in deconstructing plant biomass by the secretion of an arsenal of plant cell wall-degrading enzymes, by remodeling metabolism to accommodate production of secreted enzymes, and by enabling transport and intracellular utilization of plant biomass components. Although a number of enzymes and transcriptional regulators involved in plant biomass utilization have been identified, how filamentous fungi sense and integrate nutritional information encoded in the plant cell wall into a regulatory hierarchy for optimal utilization of complex carbon sources is not understood. Here, we performed transcriptional profiling of N. crassa on 40 different carbon sources, including plant biomass, to provide data on how fungi sense simple to complex carbohydrates. From these data, we identified regulatory factors in N. crassa and characterized one (PDR-2) associated with pectin utilization and one with pectin/hemicellulose utilization (ARA-1). Using in vitro DNA affinity purification sequencing (DAP-seq), we identified direct targets of transcription factors involved in regulating genes encoding plant cell wall-degrading enzymes. In particular, our data clarified the role of the transcription factor VIB-1 in the regulation of genes encoding plant cell wall-degrading enzymes and nutrient scavenging and revealed a major role of the carbon catabolite repressor CRE-1 in regulating the expression of major facilitator transporter genes. These data contribute to a more complete understanding of cross talk between transcription factors and their target genes, which are involved in regulating nutrient sensing and plant biomass utilization on a global level.
Assuntos
Parede Celular/metabolismo , Proteínas Fúngicas/metabolismo , Neurospora crassa/genética , Pectinas/metabolismo , Polissacarídeos/metabolismo , Fatores de Transcrição/metabolismo , Biocombustíveis , Biomassa , Repressão Catabólica , Parede Celular/química , Regulação Fúngica da Expressão Gênica , Engenharia Metabólica/métodos , Redes e Vias Metabólicas/genética , Neurospora crassa/metabolismo , RNA-SeqRESUMO
Epigenomic mapping of enhancer-associated chromatin modifications facilitates the genome-wide discovery of tissue-specific enhancers in vivo. However, reliance on single chromatin marks leads to high rates of false-positive predictions. More sophisticated, integrative methods have been described, but commonly suffer from limited accessibility to the resulting predictions and reduced biological interpretability. Here we present the Limb-Enhancer Genie (LEG), a collection of highly accurate, genome-wide predictions of enhancers in the developing limb, available through a user-friendly online interface. We predict limb enhancers using a combination of >50 published limb-specific datasets and clusters of evolutionarily conserved transcription factor binding sites, taking advantage of the patterns observed at previously in vivo validated elements. By combining different statistical models, our approach outperforms current state-of-the-art methods and provides interpretable measures of feature importance. Our results indicate that including a previously unappreciated score that quantifies tissue-specific nuclease accessibility significantly improves prediction performance. We demonstrate the utility of our approach through in vivo validation of newly predicted elements. Moreover, we describe general features that can guide the type of datasets to include when predicting tissue-specific enhancers genome-wide, while providing an accessible resource to the general biological community and facilitating the functional interpretation of genetic studies of limb malformations.
Assuntos
Elementos Facilitadores Genéticos/genética , Extremidades/crescimento & desenvolvimento , Genômica/métodos , Crescimento e Desenvolvimento/genética , Software , Animais , Genoma/genética , Aprendizado de Máquina , CamundongosRESUMO
Polygenic scores (PGSs) offer the ability to predict genetic risk for complex diseases across the life course; a key benefit over short-term prediction models. To produce risk estimates relevant to clinical and public health decision-making, it is important to account for varying effects due to age and sex. Here, we develop a novel framework to estimate country-, age-, and sex-specific estimates of cumulative incidence stratified by PGS for 18 high-burden diseases. We integrate PGS associations from seven studies in four countries (N = 1,197,129) with disease incidences from the Global Burden of Disease. PGS has a significant sex-specific effect for asthma, hip osteoarthritis, gout, coronary heart disease and type 2 diabetes (T2D), with all but T2D exhibiting a larger effect in men. PGS has a larger effect in younger individuals for 13 diseases, with effects decreasing linearly with age. We show for breast cancer that, relative to individuals in the bottom 20% of polygenic risk, the top 5% attain an absolute risk for screening eligibility 16.3 years earlier. Our framework increases the generalizability of results from biobank studies and the accuracy of absolute risk estimates by appropriately accounting for age- and sex-specific PGS effects. Our results highlight the potential of PGS as a screening tool which may assist in the early prevention of common diseases.
Assuntos
Predisposição Genética para Doença , Herança Multifatorial , Humanos , Masculino , Feminino , Herança Multifatorial/genética , Incidência , Pessoa de Meia-Idade , Adulto , Idoso , Diabetes Mellitus Tipo 2/genética , Diabetes Mellitus Tipo 2/epidemiologia , Fatores de Risco , Medição de Risco/métodos , Carga Global da Doença , Fatores Sexuais , Fatores EtáriosRESUMO
Understanding the noncoding part of the genome, which encodes gene regulation, is necessary to identify genetic mechanisms of disease and translate findings from genome-wide association studies into actionable results for treatments and personalized care. Here we provide an overview of the computational analysis of noncoding regions, starting from gene-regulatory mechanisms and their representation in data. Deep learning methods, when applied to these data, highlight important regulatory sequence elements and predict the functional effects of genetic variants. These and other algorithms are used to predict damaging sequence variants. Finally, we introduce rare-variant association tests that incorporate functional annotations and predictions in order to increase interpretability and statistical power.
Assuntos
DNA , Estudo de Associação Genômica Ampla , Genoma , Algoritmos , Regulação da Expressão GênicaRESUMO
Here we present an exome-wide rare genetic variant association study for 30 blood biomarkers in 191,971 individuals in the UK Biobank. We compare gene-based association tests for separate functional variant categories to increase interpretability and identify 193 significant gene-biomarker associations. Genes associated with biomarkers were ~ 4.5-fold enriched for conferring Mendelian disorders. In addition to performing weighted gene-based variant collapsing tests, we design and apply variant-category-specific kernel-based tests that integrate quantitative functional variant effect predictions for missense variants, splicing and the binding of RNA-binding proteins. For these tests, we present a computationally efficient combination of the likelihood-ratio and score tests that found 36% more associations than the score test alone while also controlling the type-1 error. Kernel-based tests identified 13% more associations than their gene-based collapsing counterparts and had advantages in the presence of gain of function missense variants. We introduce local collapsing by amino acid position for missense variants and use it to interpret associations and identify potential novel gain of function variants in PIEZO1. Our results show the benefits of investigating different functional mechanisms when performing rare-variant association tests, and demonstrate pervasive rare-variant contribution to biomarker variability.
Assuntos
Exoma , Mutação de Sentido Incorreto , Exoma/genética , Estudos de Associação Genética , Marcadores Genéticos , Humanos , Canais Iônicos/genética , Sequenciamento do ExomaRESUMO
In recent years, numerous applications have demonstrated the potential of deep learning for an improved understanding of biological processes. However, most deep learning tools developed so far are designed to address a specific question on a fixed dataset and/or by a fixed model architecture. Here we present Janggu, a python library facilitates deep learning for genomics applications, aiming to ease data acquisition and model evaluation. Among its key features are special dataset objects, which form a unified and flexible data acquisition and pre-processing framework for genomics data that enables streamlining of future research applications through reusable components. Through a numpy-like interface, these dataset objects are directly compatible with popular deep learning libraries, including keras or pytorch. Janggu offers the possibility to visualize predictions as genomic tracks or by exporting them to the bigWig format as well as utilities for keras-based models. We illustrate the functionality of Janggu on several deep learning genomics applications. First, we evaluate different model topologies for the task of predicting binding sites for the transcription factor JunD. Second, we demonstrate the framework on published models for predicting chromatin effects. Third, we show that promoter usage measured by CAGE can be predicted using DNase hypersensitivity, histone modifications and DNA sequence features. We improve the performance of these models due to a novel feature in Janggu that allows us to include high-order sequence features. We believe that Janggu will help to significantly reduce repetitive programming overhead for deep learning applications in genomics, and will enable computational biologists to rapidly assess biological hypotheses.
Assuntos
Aprendizado Profundo , Genômica/métodos , Animais , Biologia Computacional , Processamento Eletrônico de Dados , HumanosRESUMO
Tumor initiation is often linked to a loss of cellular identity. Transcriptional programs determining cellular identity are preserved by epigenetically-acting chromatin factors. Although such regulators are among the most frequently mutated genes in cancer, it is not well understood how an abnormal epigenetic condition contributes to tumor onset. In this work, we investigated the gene signature of tumors caused by disruption of the Drosophila epigenetic regulator, polyhomeotic (ph). In larval tissue ph mutant cells show a shift towards an embryonic-like signature. Using loss- and gain-of-function experiments we uncovered the embryonic transcription factor knirps (kni) as a new oncogene. The oncogenic potential of kni lies in its ability to activate JAK/STAT signaling and block differentiation. Conversely, tumor growth in ph mutant cells can be substantially reduced by overexpressing a differentiation factor. This demonstrates that epigenetically derailed tumor conditions can be reversed when targeting key players in the transcriptional network.