Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 70
Filter
Add more filters

Publication year range
1.
Nature ; 618(7966): 774-781, 2023 Jun.
Article in English | MEDLINE | ID: mdl-37198491

ABSTRACT

Polygenic scores (PGSs) have limited portability across different groupings of individuals (for example, by genetic ancestries and/or social determinants of health), preventing their equitable use1-3. PGS portability has typically been assessed using a single aggregate population-level statistic (for example, R2)4, ignoring inter-individual variation within the population. Here, using a large and diverse Los Angeles biobank5 (ATLAS, n = 36,778) along with the UK Biobank6 (UKBB, n = 487,409), we show that PGS accuracy decreases individual-to-individual along the continuum of genetic ancestries7 in all considered populations, even within traditionally labelled 'homogeneous' genetic ancestries. The decreasing trend is well captured by a continuous measure of genetic distance (GD) from the PGS training data: Pearson correlation of -0.95 between GD and PGS accuracy averaged across 84 traits. When applying PGS models trained on individuals labelled as white British in the UKBB to individuals with European ancestries in ATLAS, individuals in the furthest GD decile have 14% lower accuracy relative to the closest decile; notably, the closest GD decile of individuals with Hispanic Latino American ancestries show similar PGS performance to the furthest GD decile of individuals with European ancestries. GD is significantly correlated with PGS estimates themselves for 82 of 84 traits, further emphasizing the importance of incorporating the continuum of genetic ancestries in PGS interpretation. Our results highlight the need to move away from discrete genetic ancestry clusters towards the continuum of genetic ancestries when considering PGSs.


Subject(s)
Multifactorial Inheritance , Racial Groups , Humans , Europe/ethnology , Hispanic or Latino/genetics , Multifactorial Inheritance/genetics , Racial Groups/genetics , United Kingdom , White People/genetics , European People/genetics , Los Angeles , Databases, Genetic
2.
Am J Hum Genet ; 2024 Oct 22.
Article in English | MEDLINE | ID: mdl-39471805

ABSTRACT

Large biobank samples provide an opportunity to integrate broad phenotyping, familial records, and molecular genetics data to study complex traits and diseases. We introduce Pearson-Aitken Family Genetic Risk Scores (PA-FGRS), a method for estimating disease liability from patterns of diagnoses in extended, age-censored genealogical records. We then apply the method to study a paradigmatic complex disorder, major depressive disorder (MDD), using the iPSYCH2015 case-cohort study of 30,949 MDD cases, 39,655 random population controls, and more than 2 million relatives. We show that combining PA-FGRS liabilities estimated from family records with molecular genotypes of probands improves three lines of inquiry. Incorporating PA-FGRS liabilities improves classification of MDD over and above polygenic scores, identifies robust genetic contributions to clinical heterogeneity in MDD associated with comorbidity, recurrence, and severity and can improve the power of genome-wide association studies. Our method is flexible and easy to use, and our study approaches are generalizable to other datasets and other complex traits and diseases.

3.
Am J Hum Genet ; 110(12): 2042-2055, 2023 Dec 07.
Article in English | MEDLINE | ID: mdl-37944514

ABSTRACT

LDpred2 is a widely used Bayesian method for building polygenic scores (PGSs). LDpred2-auto can infer the two parameters from the LDpred model, the SNP heritability h2 and polygenicity p, so that it does not require an additional validation dataset to choose best-performing parameters. The main aim of this paper is to properly validate the use of LDpred2-auto for inferring multiple genetic parameters. Here, we present a new version of LDpred2-auto that adds an optional third parameter α to its model, for modeling negative selection. We then validate the inference of these three parameters (or two, when using the previous model). We also show that LDpred2-auto provides per-variant probabilities of being causal that are well calibrated and can therefore be used for fine-mapping purposes. We also introduce a formula to infer the out-of-sample predictive performance r2 of the resulting PGS directly from the Gibbs sampler of LDpred2-auto. Finally, we extend the set of HapMap3 variants recommended to use with LDpred2 with 37% more variants to improve the coverage of this set, and we show that this new set of variants captures 12% more heritability and provides 6% more predictive performance, on average, in UK Biobank analyses.


Subject(s)
Genome-Wide Association Study , Multifactorial Inheritance , Humans , Bayes Theorem , Genome-Wide Association Study/methods , Multifactorial Inheritance/genetics , Polymorphism, Single Nucleotide/genetics
4.
Am J Hum Genet ; 109(1): 12-23, 2022 01 06.
Article in English | MEDLINE | ID: mdl-34995502

ABSTRACT

The low portability of polygenic scores (PGSs) across global populations is a major concern that must be addressed before PGSs can be used for everyone in the clinic. Indeed, prediction accuracy has been shown to decay as a function of the genetic distance between the training and test cohorts. However, such cohorts differ not only in their genetic distance but also in their geographical distance and their data collection and assaying, conflating multiple factors. In this study, we examine the extent to which PGSs are transferable between ancestries by deriving polygenic scores for 245 curated traits from the UK Biobank data and applying them in nine ancestry groups from the same cohort. By restricting both training and testing to the UK Biobank data, we reduce the risk of environmental and genotyping confounding from using different cohorts. We define the nine ancestry groups at a sub-continental level, based on a simple, robust, and effective method that we introduce here. We then apply two different predictive methods to derive polygenic scores for all 245 phenotypes and show a systematic and dramatic reduction in portability of PGSs trained using Northwestern European individuals and applied to nine ancestry groups. These analyses demonstrate that prediction already drops off within European ancestries and reduces globally in proportion to genetic distance. Altogether, our study provides unique and robust insights into the PGS portability problem.


Subject(s)
Genetic Association Studies/methods , Genetic Predisposition to Disease , Genetics, Population/methods , Multifactorial Inheritance , Algorithms , Alleles , Biological Specimen Banks , Genetic Variation , Genome-Wide Association Study , Genotype , Humans , Models, Genetic , Phenotype , Reproducibility of Results , United Kingdom
5.
Am J Hum Genet ; 109(3): 417-432, 2022 03 03.
Article in English | MEDLINE | ID: mdl-35139346

ABSTRACT

Genome-wide association studies (GWASs) have revolutionized human genetics, allowing researchers to identify thousands of disease-related genes and possible drug targets. However, case-control status does not account for the fact that not all controls may have lived through their period of risk for the disorder of interest. This can be quantified by examining the age-of-onset distribution and the age of the controls or the age of onset for cases. The age-of-onset distribution may also depend on information such as sex and birth year. In addition, family history is not routinely included in the assessment of control status. Here, we present LT-FH++, an extension of the liability threshold model conditioned on family history (LT-FH), which jointly accounts for age of onset and sex as well as family history. Using simulations, we show that, when family history and the age-of-onset distribution are available, the proposed approach yields statistically significant power gains over LT-FH and large power gains over genome-wide association study by proxy (GWAX). We applied our method to four psychiatric disorders available in the iPSYCH data and to mortality in the UK Biobank and found 20 genome-wide significant associations with LT-FH++, compared to ten for LT-FH and eight for a standard case-control GWAS. As more genetic data with linked electronic health records become available to researchers, we expect methods that account for additional health information, such as LT-FH++, to become even more beneficial.


Subject(s)
Genetic Predisposition to Disease , Genome-Wide Association Study , Age of Onset , Case-Control Studies , Genome-Wide Association Study/methods , Humans , Medical History Taking
6.
Nucleic Acids Res ; 51(12): e67, 2023 07 07.
Article in English | MEDLINE | ID: mdl-37224538

ABSTRACT

Polygenic risk scores (PRSs) are expected to play a critical role in precision medicine. Currently, PRS predictors are generally based on linear models using summary statistics, and more recently individual-level data. However, these predictors mainly capture additive relationships and are limited in data modalities they can use. We developed a deep learning framework (EIR) for PRS prediction which includes a model, genome-local-net (GLN), specifically designed for large-scale genomics data. The framework supports multi-task learning, automatic integration of other clinical and biochemical data, and model explainability. When applied to individual-level data from the UK Biobank, the GLN model demonstrated a competitive performance compared to established neural network architectures, particularly for certain traits, showcasing its potential in modeling complex genetic relationships. Furthermore, the GLN model outperformed linear PRS methods for Type 1 Diabetes, likely due to modeling non-additive genetic effects and epistasis. This was supported by our identification of widespread non-additive genetic effects and epistasis in the context of T1D. Finally, we constructed PRS models that integrated genotype, blood, urine, and anthropometric data and found that this improved performance for 93% of the 290 diseases and disorders considered. EIR is available at https://github.com/arnor-sigurdsson/EIR.


Subject(s)
Models, Genetic , Multifactorial Inheritance , Polymorphism, Single Nucleotide , Humans , Genetic Predisposition to Disease , Genome, Human , Genome-Wide Association Study , Genomics/methods , Genotype , Risk Factors
7.
Am J Hum Genet ; 108(6): 1001-1011, 2021 06 03.
Article in English | MEDLINE | ID: mdl-33964208

ABSTRACT

The accuracy of polygenic risk scores (PRSs) to predict complex diseases increases with the training sample size. PRSs are generally derived based on summary statistics from large meta-analyses of multiple genome-wide association studies (GWASs). However, it is now common for researchers to have access to large individual-level data as well, such as the UK Biobank data. To the best of our knowledge, it has not yet been explored how best to combine both types of data (summary statistics and individual-level data) to optimize polygenic prediction. The most widely used approach to combine data is the meta-analysis of GWAS summary statistics (meta-GWAS), but we show that it does not always provide the most accurate PRS. Through simulations and using 12 real case-control and quantitative traits from both iPSYCH and UK Biobank along with external GWAS summary statistics, we compare meta-GWAS with two alternative data-combining approaches, stacked clumping and thresholding (SCT) and meta-PRS. We find that, when large individual-level data are available, the linear combination of PRSs (meta-PRS) is both a simple alternative to meta-GWAS and often more accurate.


Subject(s)
Disease/genetics , Genetic Predisposition to Disease , Genome-Wide Association Study , Models, Statistical , Multifactorial Inheritance , Polymorphism, Single Nucleotide , Case-Control Studies , Humans , Phenotype
8.
J Urol ; : 101097JU0000000000004187, 2024 Aug 02.
Article in English | MEDLINE | ID: mdl-39093873

ABSTRACT

PURPOSE: Childhood incontinence is stigmatized and underprioritized, and a basic understanding of its pathogenesis is missing. Our goal was to identify risk-conferring genetic variants in daytime urinary incontinence (DUI). MATERIALS AND METHODS: We conducted a genome-wide association study in the Danish iPSYCH2015 cohort. Cases (3024) were identified through DUI diagnosis codes and redeemed prescriptions for DUI medication in patients aged 5 to 20 years. Controls (30,240), selected from the same sample, were matched to cases on sex and psychiatric diagnoses, if any, and down-sampled to a 1:10 case:control ratio. Replication was performed in the Icelandic deCODE cohort (5475 cases/287,773 controls). Single-nucleotide polymorphism heritability was calculated using the genome-based restricted maximum likelihood method. Cross-trait genetic correlation was estimated using linkage disequilibrium score regression. Polygenic risk scores generated with LDpred2-auto and BOLT-LMM were assessed for association. RESULTS: Variants on chromosome 6 (rs12210989, odds ratio [OR] 1.24, 95% CI 1.17-1.32, P = 3.21 × 10-12) and 20 (rs4809801, OR 1.18, 95% CI 1.11-1.25, P = 3.66 × 10-8) reached genome-wide significance and implicated the PRDM13 and RIPOR3 genes. Chromosome 6 findings were replicated (P = .024, OR 1.09, 95% CI 1.01-1.16). Liability scale heritability ranged from 10.20% (95% CI 6.40%-14.00%) to 15.30% (95% CI 9.66%-20.94%). DUI and nocturnal enuresis showed positive genetic correlation (rg = 1.28 ± 0.38, P = .0007). DUI was associated with attention-deficit/hyperactivity disorder (OR 1.098, 95% CI 1.046-1.152, P < .0001) and BMI (OR 1.129, 95% CI 1.081-1.178, P < .0001) polygenic risk. CONCLUSIONS: Common genetic variants contribute to the risk of childhood DUI, and genes important in neuronal development and detrusor smooth muscle activity were implicated. These findings may help guide identification of new treatment targets.

9.
Psychol Med ; : 1-10, 2024 Oct 14.
Article in English | MEDLINE | ID: mdl-39397681

ABSTRACT

BACKGROUND: The clinical course of major depressive disorder (MDD) is heterogeneous, and early-onset MDD often has a more severe and complex clinical course. Our goal was to determine whether polygenic scores (PGSs) for psychiatric disorders are associated with treatment trajectories in early-onset MDD treated in secondary care. METHODS: Data were drawn from the iPSYCH2015 sample, which includes all individuals born in Denmark between 1981 and 2008 who were treated in secondary care for depression between 1995 and 2015. We selected unrelated individuals of European ancestry with an MDD diagnosis between ages 10-25 (N = 10577). Seven-year trajectories of hospital contacts for depression were modeled using Latent Class Growth Analysis. Associations between PGS for MDD, bipolar disorder, schizophrenia, ADHD, and anorexia and trajectories of MDD contacts were modeled using multinomial logistic regressions. RESULTS: We identified four trajectory patterns: brief contact (65%), prolonged initial contact (20%), later re-entry (8%), and persistent contact (7%). Relative to the brief contact trajectory, higher PGS for ADHD was associated with a decreased odds of membership in the prolonged initial contact (odds ratio = 1.06, 95% confidence interval = 1.01-1.11) and persistent contact (1.12, 1.03-1.21) trajectories, while PGS-AN was associated with increased odds of membership in the persistent contact trajectory (1.12, 1.03-1.21). CONCLUSIONS: We found significant associations between polygenic liabilities for psychiatric disorders and treatment trajectories in patients with secondary-treated early-onset MDD. These findings help elucidate the relationship between a patient's genetics and their clinical course; however, the effect sizes are small and therefore unlikely to have predictive value in clinical settings.

10.
Psychol Med ; 54(9): 2073-2086, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38347808

ABSTRACT

BACKGROUND: Although several types of risk factors for anorexia nervosa (AN) have been identified, including birth-related factors, somatic, and psychosocial risk factors, their interplay with genetic susceptibility remains unclear. Genetic and epidemiological interplay in AN risk were examined using data from Danish nationwide registers. AN polygenic risk score (PRS) and risk factor associations, confounding from AN PRS and/or parental psychiatric history on the association between the risk factors and AN risk, and interactions between AN PRS and each level of target risk factor on AN risk were estimated. METHODS: Participants were individuals born in Denmark between 1981 and 2008 including nationwide-representative data from the iPSYCH2015, and Danish AN cases from the Anorexia Nervosa Genetics Initiative and Eating Disorder Genetics Initiative cohorts. A total of 7003 individuals with AN and 45 229 individuals without a registered AN diagnosis were included. We included 22 AN risk factors from Danish registers. RESULTS: Risk factors showing association with PRS for AN included urbanicity, parental ages, genitourinary tract infection, and parental socioeconomic factors. Risk factors showed the expected association to AN risk, and this association was only slightly attenuated when adjusted for parental history of psychiatric disorders or/and for the AN PRS. The interaction analyses revealed a differential effect of AN PRS according to the level of the following risk factors: sex, maternal age, genitourinary tract infection, C-section, parental socioeconomic factors and psychiatric history. CONCLUSIONS: Our findings provide evidence for interactions between AN PRS and certain risk-factors, illustrating potential diverse risk pathways to AN diagnosis.


Subject(s)
Anorexia Nervosa , Genetic Predisposition to Disease , Multifactorial Inheritance , Registries , Humans , Anorexia Nervosa/epidemiology , Anorexia Nervosa/genetics , Denmark/epidemiology , Female , Risk Factors , Male , Registries/statistics & numerical data , Adult , Adolescent , Young Adult , Parents/psychology
11.
Psychol Med ; : 1-10, 2024 May 27.
Article in English | MEDLINE | ID: mdl-38801094

ABSTRACT

BACKGROUND: Psychiatric disorders and type 2 diabetes mellitus (T2DM) are heritable, polygenic, and often comorbid conditions, yet knowledge about their potential shared familial risk is lacking. We used family designs and T2DM polygenic risk score (T2DM-PRS) to investigate the genetic associations between psychiatric disorders and T2DM. METHODS: We linked 659 906 individuals born in Denmark 1990-2000 to their parents, grandparents, and aunts/uncles using population-based registers. We compared rates of T2DM in relatives of children with and without a diagnosis of any or one of 11 specific psychiatric disorders, including neuropsychiatric and neurodevelopmental disorders, using Cox regression. In a genotyped sample (iPSYCH2015) of individuals born 1981-2008 (n = 134 403), we used logistic regression to estimate associations between a T2DM-PRS and these psychiatric disorders. RESULTS: Among 5 235 300 relative pairs, relatives of individuals with a psychiatric disorder had an increased risk for T2DM with stronger associations for closer relatives (parents:hazard ratio = 1.38, 95% confidence interval 1.35-1.42; grandparents: 1.14, 1.13-1.15; and aunts/uncles: 1.19, 1.16-1.22). In the genetic sample, one standard deviation increase in T2DM-PRS was associated with an increased risk for any psychiatric disorder (odds ratio = 1.11, 1.08-1.14). Both familial T2DM and T2DM-PRS were significantly associated with seven of 11 psychiatric disorders, most strongly with attention-deficit/hyperactivity disorder and conduct disorder, and inversely with anorexia nervosa. CONCLUSIONS: Our findings of familial co-aggregation and higher T2DM polygenic liability associated with psychiatric disorders point toward shared familial risk. This suggests that part of the comorbidity is explained by shared familial risks. The underlying mechanisms still remain largely unknown and the contributions of genetics and environment need further investigation.

12.
Twin Res Hum Genet ; 27(2): 69-79, 2024 Apr.
Article in English | MEDLINE | ID: mdl-38644690

ABSTRACT

While it is known that vitamin D deficiency is associated with adverse bone outcomes, it remains unclear whether low vitamin D status may increase the risk of a wider range of health outcomes. We had the opportunity to explore the association between common genetic variants associated with both 25 hydroxyvitamin D (25OHD) and the vitamin D binding protein (DBP, encoded by the GC gene) with a comprehensive range of health disorders and laboratory tests in a large academic medical center. We used summary statistics for 25OHD and DBP to generate polygenic scores (PGS) for 66,482 participants with primarily European ancestry and 13,285 participants with primarily African ancestry from the Vanderbilt University Medical Center Biobank (BioVU). We examined the predictive properties of PGS25OHD, and two scores related to DBP concentration with respect to 1322 health-related phenotypes and 315 laboratory-measured phenotypes from electronic health records. In those with European ancestry: (a) the PGS25OHD and PGSDBP scores, and individual SNPs rs4588 and rs7041 were associated with both 25OHD concentration and 1,25 dihydroxyvitamin D concentrations; (b) higher PGS25OHD was associated with decreased concentrations of triglycerides and cholesterol, and reduced risks of vitamin D deficiency, disorders of lipid metabolism, and diabetes. In general, the findings for the African ancestry group were consistent with findings from the European ancestry analyses. Our study confirms the utility of PGS and two key variants within the GC gene (rs4588 and rs7041) to predict the risk of vitamin D deficiency in clinical settings and highlights the shared biology between vitamin D-related genetic pathways a range of health outcomes.


Subject(s)
Vitamin D-Binding Protein , Vitamin D , Humans , Vitamin D-Binding Protein/genetics , Vitamin D/blood , Vitamin D/genetics , Vitamin D/analogs & derivatives , Female , Male , Middle Aged , Adult , Genome-Wide Association Study , Polymorphism, Single Nucleotide , White People/genetics , Phenotype , Aged , Vitamin D Deficiency/genetics , Vitamin D Deficiency/blood , Vitamin D Deficiency/epidemiology , Multifactorial Inheritance/genetics
13.
PLoS Genet ; 17(8): e1009713, 2021 08.
Article in English | MEDLINE | ID: mdl-34460823

ABSTRACT

Genome-wide association studies (GWASs) have uncovered a wealth of associations between common variants and human phenotypes. Here, we present an integrative analysis of GWAS summary statistics from 36 phenotypes to decipher multitrait genetic architecture and its link with biological mechanisms. Our framework incorporates multitrait association mapping along with an investigation of the breakdown of genetic associations into clusters of variants harboring similar multitrait association profiles. Focusing on two subsets of immunity and metabolism phenotypes, we then demonstrate how genetic variants within clusters can be mapped to biological pathways and disease mechanisms. Finally, for the metabolism set, we investigate the link between gene cluster assignment and the success of drug targets in randomized controlled trials.


Subject(s)
Computational Biology/methods , Polymorphism, Single Nucleotide , Quantitative Trait Loci , Cluster Analysis , Genetic Predisposition to Disease , Genome-Wide Association Study , Humans , Phenotype
14.
Psychol Med ; 53(1): 217-226, 2023 01.
Article in English | MEDLINE | ID: mdl-33949298

ABSTRACT

BACKGROUND: In this study, we examined the relationship between polygenic liability for depression and number of stressful life events (SLEs) as risk factors for early-onset depression treated in inpatient, outpatient or emergency room settings at psychiatric hospitals in Denmark. METHODS: Data were drawn from the iPSYCH2012 case-cohort sample, a population-based sample of individuals born in Denmark between 1981 and 2005. The sample included 18 532 individuals who were diagnosed with depression by a psychiatrist by age 31 years, and a comparison group of 20 184 individuals. Information on SLEs was obtained from nationwide registers and operationalized as a time-varying count variable. Hazard ratios and cumulative incidence rates were estimated using Cox regressions. RESULTS: Risk for depression increased by 35% with each standard deviation increase in polygenic liability (p < 0.0001), and 36% (p < 0.0001) with each additional SLE. There was a small interaction between polygenic liability and SLEs (ß = -0.04, p = 0.0009). The probability of being diagnosed with depression in a hospital-based setting between ages 15 and 31 years ranged from 1.5% among males in the lowest quartile of polygenic liability with 0 events by age 15, to 18.8% among females in the highest quartile of polygenic liability with 4+ events by age 15. CONCLUSIONS: These findings suggest that although there is minimal interaction between polygenic liability and SLEs as risk factors for hospital-treated depression, combining information on these two important risk factors could potentially be useful for identifying high-risk individuals.


Subject(s)
Depression , Life Change Events , Male , Female , Humans , Infant , Adult , Cohort Studies , Risk Factors , Proportional Hazards Models , Case-Control Studies
15.
Brain ; 145(2): 555-568, 2022 04 18.
Article in English | MEDLINE | ID: mdl-35022648

ABSTRACT

Febrile seizures represent the most common type of pathological brain activity in young children and are influenced by genetic, environmental and developmental factors. In a minority of cases, febrile seizures precede later development of epilepsy. We conducted a genome-wide association study of febrile seizures in 7635 cases and 83 966 controls identifying and replicating seven new loci, all with P < 5 × 10-10. Variants at two loci were functionally related to altered expression of the fever response genes PTGER3 and IL10, and four other loci harboured genes (BSN, ERC2, GABRG2, HERC1) influencing neuronal excitability by regulating neurotransmitter release and binding, vesicular transport or membrane trafficking at the synapse. Four previously reported loci (SCN1A, SCN2A, ANO3 and 12q21.33) were all confirmed. Collectively, the seven novel and four previously reported loci explained 2.8% of the variance in liability to febrile seizures, and the single nucleotide polymorphism heritability based on all common autosomal single nucleotide polymorphisms was 10.8%. GABRG2, SCN1A and SCN2A are well-established epilepsy genes and, overall, we found positive genetic correlations with epilepsies (rg = 0.39, P = 1.68 × 10-4). Further, we found that higher polygenic risk scores for febrile seizures were associated with epilepsy and with history of hospital admission for febrile seizures. Finally, we found that polygenic risk of febrile seizures was lower in febrile seizure patients with neuropsychiatric disease compared to febrile seizure patients in a general population sample. In conclusion, this largest genetic investigation of febrile seizures to date implicates central fever response genes as well as genes affecting neuronal excitability, including several known epilepsy genes. Further functional and genetic studies based on these findings will provide important insights into the complex pathophysiological processes of seizures with and without fever.


Subject(s)
Epilepsy , Seizures, Febrile , Anoctamins/genetics , Child , Child, Preschool , Epilepsy/genetics , Fever/complications , Fever/genetics , Genome-Wide Association Study , Humans , NAV1.1 Voltage-Gated Sodium Channel/genetics , Seizures, Febrile/genetics
16.
Am J Hum Genet ; 105(6): 1213-1221, 2019 12 05.
Article in English | MEDLINE | ID: mdl-31761295

ABSTRACT

Polygenic prediction has the potential to contribute to precision medicine. Clumping and thresholding (C+T) is a widely used method to derive polygenic scores. When using C+T, several p value thresholds are tested to maximize predictive ability of the derived polygenic scores. Along with this p value threshold, we propose to tune three other hyper-parameters for C+T. We implement an efficient way to derive thousands of different C+T scores corresponding to a grid over four hyper-parameters. For example, it takes a few hours to derive 123K different C+T scores for 300K individuals and 1M variants using 16 physical cores. We find that optimizing over these four hyper-parameters improves the predictive performance of C+T in both simulations and real data applications as compared to tuning only the p value threshold. A particularly large increase can be noted when predicting depression status, from an AUC of 0.557 (95% CI: [0.544-0.569]) when tuning only the p value threshold to an AUC of 0.592 (95% CI: [0.580-0.604]) when tuning all four hyper-parameters we propose for C+T. We further propose stacked clumping and thresholding (SCT), a polygenic score that results from stacking all derived C+T scores. Instead of choosing one set of hyper-parameters that maximizes prediction in some training set, SCT learns an optimal linear combination of all C+T scores by using an efficient penalized regression. We apply SCT to eight different case-control diseases in the UK biobank data and find that SCT substantially improves prediction accuracy with an average AUC increase of 0.035 over standard C+T.


Subject(s)
Algorithms , Disease/genetics , Genetic Predisposition to Disease , Genome-Wide Association Study , Multifactorial Inheritance/genetics , Polymorphism, Single Nucleotide , Biological Specimen Banks , Case-Control Studies , Computer Simulation , Humans , Models, Genetic , United Kingdom
17.
Bioinformatics ; 36(22-23): 5424-5431, 2021 Apr 01.
Article in English | MEDLINE | ID: mdl-33326037

ABSTRACT

MOTIVATION: Polygenic scores have become a central tool in human genetics research. LDpred is a popular method for deriving polygenic scores based on summary statistics and a matrix of correlation between genetic variants. However, LDpred has limitations that may reduce its predictive performance. RESULTS: Here, we present LDpred2, a new version of LDpred that addresses these issues. We also provide two new options in LDpred2: a 'sparse' option that can learn effects that are exactly 0, and an 'auto' option that directly learns the two LDpred parameters from data. We benchmark predictive performance of LDpred2 against the previous version on simulated and real data, demonstrating substantial improvements in robustness and predictive accuracy compared to LDpred1. We then show that LDpred2 also outperforms other polygenic score methods recently developed, with a mean AUC over the 8 real traits analyzed here of 65.1%, compared to 63.8% for lassosum, 62.9% for PRS-CS and 61.5% for SBayesR. Note that LDpred2 provides more accurate polygenic scores when run genome-wide, instead of per chromosome. AVAILABILITY AND IMPLEMENTATION: LDpred2 is implemented in R package bigsnpr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

18.
Mol Biol Evol ; 37(7): 2153-2154, 2020 07 01.
Article in English | MEDLINE | ID: mdl-32343802

ABSTRACT

R package pcadapt is a user-friendly R package for performing genome scans for local adaptation. Here, we present version 4 of pcadapt which substantially improves computational efficiency while providing similar results. This improvement is made possible by using a different format for storing genotypes and a different algorithm for computing principal components of the genotype matrix, which is the most computationally demanding step in method pcadapt. These changes are seamlessly integrated into the existing pcadapt package, and users will experience a large reduction in computation time (by a factor of 20-60 in our analyses) as compared with previous versions.


Subject(s)
Adaptation, Biological , Genomics/methods , Software
19.
Bioinformatics ; 36(16): 4449-4457, 2020 08 15.
Article in English | MEDLINE | ID: mdl-32415959

ABSTRACT

MOTIVATION: Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr. RESULTS: For example, we find that PC19-PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16-18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data. AVAILABILITY AND IMPLEMENTATION: R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genetics, Population , Software , Algorithms , Humans , Linkage Disequilibrium , Principal Component Analysis
SELECTION OF CITATIONS
SEARCH DETAIL