RESUMO
Transcriptome-wide association studies (TWAS) aim to uncover genotype-phenotype relationships through a two-stage procedure: predicting gene expression from genotypes using an expression quantitative trait locus (eQTL) data set, then testing the predicted expression for trait associations. Accurate gene expression prediction in stage 1 is crucial, as it directly impacts the power to identify associations in stage 2. Currently, the first stage of such studies is primarily conducted using linear models like elastic net regression, which fail to capture the nonlinear relationships inherent in biological systems. Deep learning methods have the potential to model such nonlinear effects, but have yet to demonstrably outperform linear methods at this task. To address this gap, we propose a new deep learning architecture to predict gene expression from genotypic variation across individuals. Our method utilizes a learnable input scaling layer in conjunction with a convolutional encoder to capture nonlinear effects and higher-order interactions without compromising on interpretability. We further augment this approach to allow for parameter sharing across multiple networks, enabling us to utilize prior information for individual variants in the form of functional annotations. Evaluations on real-world genomic data show that our method consistently outperforms elastic net regression across a large set of heritable genes. Furthermore, our model statistically significantly improved predictive performance by leveraging functional annotations, whereas elastic net regression failed to show equivalent gains when using the same information, suggesting that our method can capture nonlinear functional information beyond the capability of linear models.
RESUMO
Recently, a non-parametric method has been proposed to impute the genetic component of a trait for a large set of genotyped individuals based on a separate genome-wide association study (GWAS) summary dataset of the same trait (from the same population). The imputed trait may contain linear, non-linear and epistatic effects of genetic variants, thus can be used for downstream linear or non-linear association analyses and machine learning tasks. Here, we propose an extension of the method to impute both genetic and environmental components of a trait using both single nucleotide polymorphism (SNP)-trait and omics-trait association summary data. We illustrate an application to a UK Biobank subset of individuals (n ≈ 80K) with both body mass index (BMI) GWAS data and metabolomic data. We divided the whole dataset into two equally sized and non-overlapping training and test datasets; we used the training data to build SNP- and metabolite-BMI association summary data and impute BMI on the test data. We compared the performance of the original and new imputation methods. As by the original method, the imputed BMI values by the new method largely retained SNP-BMI association information; however, the latter retained more information about BMI-environment associations and were more highly correlated with the original observed BMI values.
Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Humanos , Estudo de Associação Genômica Ampla/métodos , Fenótipo , Genótipo , Polimorfismo de Nucleotídeo Único/genéticaRESUMO
Recently a nonparametric method called LS-imputation has been proposed for large-scale trait imputation based on a GWAS summary dataset and a large set of genotyped individuals. The imputed trait values, along with the genotypes, can be treated as an individual-level dataset for downstream genetic analyses, including those that cannot be done with GWAS summary data. However, since the covariance matrix of the imputed trait values is often too large to calculate, the current method imposes a working assumption that the imputed trait values are identically and independently distributed, which is incorrect in truth. Here we propose a "divide and conquer/combine" strategy to estimate and account for the covariance matrix of the imputed trait values via batches, thus relaxing the incorrect working assumption. Applications of the methods to the UK Biobank data for marginal association analysis showed some improvement by the new method in some cases, but overall the original method performed well, which was explained by nearly constant variances of and mostly weak correlations among imputed trait values.
Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Humanos , Estudo de Associação Genômica Ampla/métodos , Fenótipo , GenótipoRESUMO
The expansive collection of genetic and phenotypic data within biobanks offers an unprecedented opportunity for biomedical research. However, the frequent occurrence of missing phenotypes presents a significant barrier to fully leveraging this potential. In our target application, on one hand, we have only a small and complete dataset with both genotypes and phenotypes to build a genetic prediction model, commonly called a polygenic (risk) score (PGS or PRS); on the other hand, we have a large dataset of genotypes (e.g. from a biobank) without the phenotype of interest. Our goal is to leverage the large dataset of genotypes (but without the phenotype) and a separate GWAS summary dataset of the phenotype to impute the phenotypes, which are then used as an individual-level dataset, along with the small complete dataset, to build a nonlinear model as PGS. More specifically, we trained some nonlinear models to 7 imputed and observed phenotypes from the UK Biobank data. We then trained an ensemble model to integrate these models for each trait, resulting in higher R2 values in prediction than using only the small complete (observed) dataset. Additionally, for 2 of the 7 traits, we observed that the nonlinear model trained with the imputed traits had higher R2 than using the imputed traits directly as the PGS, while for the remaining 5 traits, no improvement was found. These findings demonstrates the potential of leveraging existing genetic data and accounting for nonlinear genetic relationships to improve prediction accuracy for some traits.
RESUMO
Genome-wide association study (GWAS) summary data have become extremely useful in daily routine data analysis, largely facilitating new methods development and new applications. However, a severe limitation with the current use of GWAS summary data is its exclusive restriction to only linear single nucleotide polymorphism (SNP)-trait association analyses. To further expand the use of GWAS summary data, along with a large sample of individual-level genotypes, we propose a nonparametric method for large-scale imputation of the genetic component of the trait for the given genotypes. The imputed individual-level trait values, along with the individual-level genotypes, make it possible to conduct any analysis as with individual-level GWAS data, including nonlinear SNP-trait associations and predictions. We use the UK Biobank data to highlight the usefulness and effectiveness of the proposed method in three applications that currently cannot be done with only GWAS summary data (for SNP-trait associations): marginal SNP-trait association analysis under non-additive genetic models, detection of SNP-SNP interactions, and genetic prediction of a trait using a nonlinear model of SNPs.