Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 205
Filter
Add more filters

Publication year range
1.
Genome Res ; 33(7): 1032-1041, 2023 07.
Article in English | MEDLINE | ID: mdl-37197991

ABSTRACT

Mendelian randomization (MR) has emerged as a powerful approach to leverage genetic instruments to infer causality between pairs of traits in observational studies. However, the results of such studies are susceptible to biases owing to weak instruments, as well as the confounding effects of population stratification and horizontal pleiotropy. Here, we show that family data can be leveraged to design MR tests that are provably robust to confounding from population stratification, assortative mating, and dynastic effects. We show in simulations that our approach, MR-Twin, is robust to confounding from population stratification and is not affected by weak instrument bias, whereas standard MR methods yield inflated false positive rates. We then conduct an exploratory analysis of MR-Twin and other MR methods applied to 121 trait pairs in the UK Biobank data set. Our results suggest that confounding from population stratification can lead to false positives for existing MR methods, whereas MR-Twin is immune to this type of confounding, and that MR-Twin can help assess whether traditional approaches may be inflated owing to confounding from population stratification.


Subject(s)
Mendelian Randomization Analysis , Reproduction , Bias , Genome-Wide Association Study , Mendelian Randomization Analysis/methods , Phenotype , Humans
2.
Nat Methods ; 19(4): 429-440, 2022 04.
Article in English | MEDLINE | ID: mdl-35396482

ABSTRACT

Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses.


Subject(s)
Metagenome , Metagenomics , Archaea/genetics , Metagenomics/methods , Reproducibility of Results , Sequence Analysis, DNA , Software
3.
PLoS Genet ; 18(11): e1010447, 2022 11.
Article in English | MEDLINE | ID: mdl-36342933

ABSTRACT

We introduce pleiotropic association test (PAT) for joint analysis of multiple traits using genome-wide association study (GWAS) summary statistics. The method utilizes the decomposition of phenotypic covariation into genetic and environmental components to create a likelihood ratio test statistic for each genetic variant. Though PAT does not directly interpret which trait(s) drive the association, a per trait interpretation of the omnibus p-value is provided through an extension to the meta-analysis framework, m-values. In simulations, we show PAT controls the false positive rate, increases statistical power, and is robust to model misspecifications of genetic effect. Additionally, simulations comparing PAT to three multi-trait methods, HIPO, MTAG, and ASSET, show PAT identified 15.3% more omnibus associations over the next best method. When these associations were interpreted on a per trait level using m-values, PAT had 37.5% more true per trait interpretations with a 0.92% false positive assignment rate. When analyzing four traits from the UK Biobank, PAT discovered 22,095 novel variants. Through the m-values interpretation framework, the number of per trait associations for two traits were almost tripled and were nearly doubled for another trait relative to the original single trait GWAS.


Subject(s)
Genome-Wide Association Study , Polymorphism, Single Nucleotide , Genetic Pleiotropy , Genome-Wide Association Study/methods , Phenotype , Polymorphism, Single Nucleotide/genetics , Meta-Analysis as Topic
4.
Am J Hum Genet ; 108(1): 36-48, 2021 01 07.
Article in English | MEDLINE | ID: mdl-33352115

ABSTRACT

Identifying and interpreting pleiotropic loci is essential to understanding the shared etiology among diseases and complex traits. A common approach to mapping pleiotropic loci is to meta-analyze GWAS summary statistics across multiple traits. However, this strategy does not account for the complex genetic architectures of traits, such as genetic correlations and heritabilities. Furthermore, the interpretation is challenging because phenotypes often have different characteristics and units. We propose PLEIO (Pleiotropic Locus Exploration and Interpretation using Optimal test), a summary-statistic-based framework to map and interpret pleiotropic loci in a joint analysis of multiple diseases and complex traits. Our method maximizes power by systematically accounting for genetic correlations and heritabilities of the traits in the association test. Any set of related phenotypes, binary or quantitative traits with different units, can be combined seamlessly. In addition, our framework offers interpretation and visualization tools to help downstream analyses. Using our method, we combined 18 traits related to cardiovascular disease and identified 13 pleiotropic loci, which showed four different patterns of associations.


Subject(s)
Genetic Pleiotropy/genetics , Genome-Wide Association Study/methods , Cardiovascular Diseases/genetics , Genetic Predisposition to Disease/genetics , Humans , Phenotype , Polymorphism, Single Nucleotide/genetics , Quantitative Trait Loci/genetics
5.
Brief Bioinform ; 23(4)2022 07 18.
Article in English | MEDLINE | ID: mdl-35753701

ABSTRACT

Advances in whole-genome sequencing (WGS) promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from WGS data presents a substantial number of challenges and a plethora of SV detection methods have been developed. Currently, evidence that investigators can use to select appropriate SV detection tools is lacking. In this article, we have evaluated the performance of SV detection tools on mouse and human WGS data using a comprehensive polymerase chain reaction-confirmed gold standard set of SVs and the genome-in-a-bottle variant set, respectively. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of the SV detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance as the SV detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low- and ultralow-pass sequencing data as well as for different deletion length categories.


Subject(s)
Benchmarking , Genome, Human , Animals , High-Throughput Nucleotide Sequencing/methods , Humans , Mice , Whole Genome Sequencing/methods
6.
PLoS Genet ; 17(9): e1009733, 2021 09.
Article in English | MEDLINE | ID: mdl-34543273

ABSTRACT

Increasingly large Genome-Wide Association Studies (GWAS) have yielded numerous variants associated with many complex traits, motivating the development of "fine mapping" methods to identify which of the associated variants are causal. Additionally, GWAS of the same trait for different populations are increasingly available, raising the possibility of refining fine mapping results further by leveraging different linkage disequilibrium (LD) structures across studies. Here, we introduce multiple study causal variants identification in associated regions (MsCAVIAR), a method that extends the popular CAVIAR fine mapping framework to a multiple study setting using a random effects model. MsCAVIAR only requires summary statistics and LD as input, accounts for uncertainty in association statistics using a multivariate normal model, allows for multiple causal variants at a locus, and explicitly models the possibility of different SNP effect sizes in different populations. We demonstrate the efficacy of MsCAVIAR in both a simulation study and a trans-ethnic, trans-biobank fine mapping analysis of High Density Lipoprotein (HDL).


Subject(s)
Genome-Wide Association Study , Causality , Chromosome Mapping/methods , Humans , Linkage Disequilibrium , Lipoproteins, HDL/genetics , Polymorphism, Single Nucleotide
7.
BMC Genomics ; 23(1): 260, 2022 Apr 04.
Article in English | MEDLINE | ID: mdl-35379194

ABSTRACT

BACKGROUND: The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused global disruption of human health and activity. Being able to trace the early outbreak of SARS-CoV-2 within a locality can inform public health measures and provide insights to contain or prevent viral transmission. Investigation of the transmission history requires efficient sequencing methods and analytic strategies, which can be generally useful in the study of viral outbreaks. METHODS: The County of Los Angeles (hereafter, LA County) sustained a large outbreak of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). To learn about the transmission history, we carried out surveillance viral genome sequencing to determine 142 viral genomes from unique patients seeking care at the University of California, Los Angeles (UCLA) Health System. 86 of these genomes were from samples collected before April 19, 2020. RESULTS: We found that the early outbreak in LA County, as in other international air travel hubs, was seeded by multiple introductions of strains from Asia and Europe. We identified a USA-specific strain, B.1.43, which was found predominantly in California and Washington State. While samples from LA County carried the ancestral B.1.43 genome, viral genomes from neighboring counties in California and from counties in Washington State carried additional mutations, suggesting a potential origin of B.1.43 in Southern California. We quantified the transmission rate of SARS-CoV-2 over time, and found evidence that the public health measures put in place in LA County to control the virus were effective at preventing transmission, but might have been undermined by the many introductions of SARS-CoV-2 into the region. CONCLUSION: Our work demonstrates that genome sequencing can be a powerful tool for investigating outbreaks and informing the public health response. Our results reinforce the critical need for the USA to have coordinated inter-state responses to the pandemic.


Subject(s)
COVID-19 , COVID-19/epidemiology , Disease Outbreaks , Genomics , Humans , Los Angeles/epidemiology , SARS-CoV-2/genetics
8.
PLoS Biol ; 17(6): e3000333, 2019 06.
Article in English | MEDLINE | ID: mdl-31220077

ABSTRACT

Developing new software tools for analysis of large-scale biological data is a key component of advancing modern biomedical research. Scientific reproduction of published findings requires running computational tools on data generated by such studies, yet little attention is presently allocated to the installability and archival stability of computational software tools. Scientific journals require data and code sharing, but none currently require authors to guarantee the continuing functionality of newly published tools. We have estimated the archival stability of computational biology software tools by performing an empirical analysis of the internet presence for 36,702 omics software resources published from 2005 to 2017. We found that almost 28% of all resources are currently not accessible through uniform resource locators (URLs) published in the paper they first appeared in. Among the 98 software tools selected for our installability test, 51% were deemed "easy to install," and 28% of the tools failed to be installed at all because of problems in the implementation. Moreover, for papers introducing new software, we found that the number of citations significantly increased when authors provided an easy installation process. We propose for incorporation into journal policy several practical solutions for increasing the widespread installability and archival stability of published bioinformatics software.


Subject(s)
Computational Biology/methods , Information Dissemination/methods , Information Storage and Retrieval/methods , Biomedical Research , Databases, Factual , Humans , Internet , Software/trends
9.
Nature ; 538(7626): 523-527, 2016 10 27.
Article in English | MEDLINE | ID: mdl-27760116

ABSTRACT

Three-dimensional physical interactions within chromosomes dynamically regulate gene expression in a tissue-specific manner. However, the 3D organization of chromosomes during human brain development and its role in regulating gene networks dysregulated in neurodevelopmental disorders, such as autism or schizophrenia, are unknown. Here we generate high-resolution 3D maps of chromatin contacts during human corticogenesis, permitting large-scale annotation of previously uncharacterized regulatory relationships relevant to the evolution of human cognition and disease. Our analyses identify hundreds of genes that physically interact with enhancers gained on the human lineage, many of which are under purifying selection and associated with human cognitive function. We integrate chromatin contacts with non-coding variants identified in schizophrenia genome-wide association studies (GWAS), highlighting multiple candidate schizophrenia risk genes and pathways, including transcription factors involved in neurogenesis, and cholinergic signalling molecules, several of which are supported by independent expression quantitative trait loci and gene expression analyses. Genome editing in human neural progenitors suggests that one of these distal schizophrenia GWAS loci regulates FOXG1 expression, supporting its potential role as a schizophrenia risk gene. This work provides a framework for understanding the effect of non-coding regulatory elements on human brain development and the evolution of cognition, and highlights novel mechanisms underlying neuropsychiatric disorders.


Subject(s)
Brain/embryology , Brain/metabolism , Chromatin/chemistry , Chromatin/genetics , Chromosomes, Human/chemistry , Chromosomes, Human/genetics , Gene Expression Regulation, Developmental , Nucleic Acid Conformation , Chromatin/metabolism , Chromosomes, Human/metabolism , Cognition , Enhancer Elements, Genetic/genetics , Epigenesis, Genetic , Forkhead Transcription Factors/genetics , Genetic Predisposition to Disease/genetics , Genome-Wide Association Study , Humans , Nerve Tissue Proteins/genetics , Neural Stem Cells/metabolism , Neurogenesis , Organ Specificity , Polymorphism, Single Nucleotide/genetics , Promoter Regions, Genetic/genetics , Reproducibility of Results , Schizophrenia/genetics , Schizophrenia/pathology
10.
PLoS Genet ; 15(12): e1008481, 2019 12.
Article in English | MEDLINE | ID: mdl-31834882

ABSTRACT

Many disease risk loci identified in genome-wide association studies are present in non-coding regions of the genome. Previous studies have found enrichment of expression quantitative trait loci (eQTLs) in disease risk loci, indicating that identifying causal variants for gene expression is important for elucidating the genetic basis of not only gene expression but also complex traits. However, detecting causal variants is challenging due to complex genetic correlation among variants known as linkage disequilibrium (LD) and the presence of multiple causal variants within a locus. Although several fine-mapping approaches have been developed to overcome these challenges, they may produce large sets of putative causal variants when true causal variants are in high LD with many non-causal variants. In eQTL studies, there is an additional source of information that can be used to improve fine-mapping called allelic imbalance (AIM) that measures imbalance in gene expression on two chromosomes of a diploid organism. In this work, we develop a novel statistical method that leverages both AIM and total expression data to detect causal variants that regulate gene expression. We illustrate through simulations and application to 10 tissues of the Genotype-Tissue Expression (GTEx) dataset that our method identifies the true causal variants with higher specificity than an approach that uses only eQTL information. Across all tissues and genes, our method achieves a median reduction rate of 11% in the number of putative causal variants. We use chromatin state data from the Roadmap Epigenomics Consortium to show that the putative causal variants identified by our method are enriched for active regions of the genome, providing orthogonal support that our method identifies causal variants with increased specificity.


Subject(s)
Allelic Imbalance , Chromatin/genetics , Chromosome Mapping/methods , Quantitative Trait Loci , Genetic Predisposition to Disease , Genome-Wide Association Study , Humans , Linkage Disequilibrium , Multifactorial Inheritance , Polymorphism, Single Nucleotide
11.
PLoS Genet ; 15(12): e1008528, 2019 12.
Article in English | MEDLINE | ID: mdl-31869344

ABSTRACT

Asthma is a chronic inflammatory disease of the airways with contributions from genes, environmental exposures, and their interactions. While genome-wide association studies (GWAS) in humans have identified ~200 susceptibility loci, the genetic factors that modulate risk of asthma through gene-environment (GxE) interactions remain poorly understood. Using the Hybrid Mouse Diversity Panel (HMDP), we sought to identify the genetic determinants of airway hyperreactivity (AHR) in response to diesel exhaust particles (DEP), a model traffic-related air pollutant. As measured by invasive plethysmography, AHR under control and DEP-exposed conditions varied 3-4-fold in over 100 inbred strains from the HMDP. A GWAS with linear mixed models mapped two loci significantly associated with lung resistance under control exposure to chromosomes 2 (p = 3.0x10-6) and 19 (p = 5.6x10-7). The chromosome 19 locus harbors Il33 and is syntenic to asthma association signals observed at the IL33 locus in humans. A GxE GWAS for post-DEP exposure lung resistance identified a significantly associated locus on chromosome 3 (p = 2.5x10-6). Among the genes at this locus is Dapp1, an adaptor molecule expressed in immune-related and mucosal tissues, including the lung. Dapp1-deficient mice exhibited significantly lower AHR than control mice but only after DEP exposure, thus functionally validating Dapp1 as one of the genes underlying the GxE association at this locus. In summary, our results indicate that some of the genetic determinants for asthma-related phenotypes may be shared between mice and humans, as well as the existence of GxE interactions in mice that modulate lung function in response to air pollution exposures relevant to humans.


Subject(s)
Adaptor Proteins, Signal Transducing/genetics , Air Pollutants/toxicity , Asthma/genetics , Bronchial Hyperreactivity/chemically induced , Lipoproteins/genetics , Vehicle Emissions/toxicity , Animals , Asthma/chemically induced , Bronchial Hyperreactivity/genetics , Chromosome Mapping , Disease Models, Animal , Female , Gene-Environment Interaction , Genetic Predisposition to Disease , Genome-Wide Association Study , Humans , Male , Mice , Plethysmography
12.
Am J Physiol Lung Cell Mol Physiol ; 320(1): L41-L62, 2021 01 01.
Article in English | MEDLINE | ID: mdl-33050709

ABSTRACT

In this study, a genetically diverse panel of 43 mouse strains was exposed to ammonia, and genome-wide association mapping was performed employing a single-nucleotide polymorphism (SNP) assembly. Transcriptomic analysis was used to help resolve the genetic determinants of ammonia-induced acute lung injury. The encoded proteins were prioritized based on molecular function, nonsynonymous SNP within a functional domain or SNP within the promoter region that altered expression. This integrative functional approach revealed 14 candidate genes that included Aatf, Avil, Cep162, Hrh4, Lama3, Plcb4, and Ube2cbp, which had significant SNP associations, and Aff1, Bcar3, Cntn4, Kcnq5, Prdm10, Ptcd3, and Snx19, which had suggestive SNP associations. Of these genes, Bcar3, Cep162, Hrh4, Kcnq5, and Lama3 are particularly noteworthy and had pathophysiological roles that could be associated with acute lung injury in several ways.


Subject(s)
Acute Lung Injury/pathology , Ammonia/toxicity , Genetic Markers , Genetic Predisposition to Disease , Genome-Wide Association Study , Polymorphism, Single Nucleotide , Transcriptome , Acute Lung Injury/chemically induced , Acute Lung Injury/genetics , Animals , Female , Gene Expression Regulation , Humans , Mice , Mice, Inbred BALB C , Mice, Inbred CBA
13.
PLoS Genet ; 14(12): e1007309, 2018 12.
Article in English | MEDLINE | ID: mdl-30589851

ABSTRACT

A genome-wide association study (GWAS) seeks to identify genetic variants that contribute to the development and progression of a specific disease. Over the past 10 years, new approaches using mixed models have emerged to mitigate the deleterious effects of population structure and relatedness in association studies. However, developing GWAS techniques to accurately test for association while correcting for population structure is a computational and statistical challenge. Using laboratory mouse strains as an example, our review characterizes the problem of population structure in association studies and describes how it can cause false positive associations. We then motivate mixed models in the context of unmodeled factors.


Subject(s)
Genetics, Population , Genome-Wide Association Study/methods , Models, Genetic , Animals , Bias , Disease/genetics , Female , Genome-Wide Association Study/statistics & numerical data , Humans , Linear Models , Male , Mice , Models, Statistical , Pedigree , Phenotype , Phylogeny , Polymorphism, Single Nucleotide
14.
BMC Biol ; 18(1): 92, 2020 07 28.
Article in English | MEDLINE | ID: mdl-32723395

ABSTRACT

An amendment to this paper has been published and can be accessed via the original article.

15.
BMC Biol ; 18(1): 37, 2020 04 07.
Article in English | MEDLINE | ID: mdl-32264902

ABSTRACT

Metagenomics studies leverage genomic reference databases to generate discoveries in basic science and translational research. However, current microbial studies use disparate reference databases that lack consistent standards of specimen inclusion, data preparation, taxon labelling and accessibility, hindering their quality and comprehensiveness, and calling for the establishment of recommendations for reference genome database assembly. Here, we analyze existing fungal and bacterial databases and discuss guidelines for the development of a master reference database that promises to improve the quality and quantity of omics research.


Subject(s)
Bacteria/genetics , Databases, Genetic/standards , Fungi/genetics , Metagenomics/standards , Metagenomics/instrumentation
16.
Am J Hum Genet ; 100(5): 789-802, 2017 May 04.
Article in English | MEDLINE | ID: mdl-28475861

ABSTRACT

Recent successes in genome-wide association studies (GWASs) make it possible to address important questions about the genetic architecture of complex traits, such as allele frequency and effect size. One lesser-known aspect of complex traits is the extent of allelic heterogeneity (AH) arising from multiple causal variants at a locus. We developed a computational method to infer the probability of AH and applied it to three GWASs and four expression quantitative trait loci (eQTL) datasets. We identified a total of 4,152 loci with strong evidence of AH. The proportion of all loci with identified AH is 4%-23% in eQTLs, 35% in GWASs of high-density lipoprotein (HDL), and 23% in GWASs of schizophrenia. For eQTLs, we observed a strong correlation between sample size and the proportion of loci with AH (R2 = 0.85, p = 2.2 × 10-16), indicating that statistical power prevents identification of AH in other loci. Understanding the extent of AH may guide the development of new methods for fine mapping and association mapping of complex traits.


Subject(s)
Alleles , Gene Frequency , Quantitative Trait Loci , Databases, Genetic , Genetic Association Studies , Humans , Linkage Disequilibrium , Models, Molecular , Phenotype
17.
Genet Epidemiol ; 42(1): 49-63, 2018 02.
Article in English | MEDLINE | ID: mdl-29114909

ABSTRACT

BACKGROUND: Epistasis and gene-environment interactions are known to contribute significantly to variation of complex phenotypes in model organisms. However, their identification in human association studies remains challenging for myriad reasons. In the case of epistatic interactions, the large number of potential interacting sets of genes presents computational, multiple hypothesis correction, and other statistical power issues. In the case of gene-environment interactions, the lack of consistently measured environmental covariates in most disease studies precludes searching for interactions and creates difficulties for replicating studies. RESULTS: In this work, we develop a new statistical approach to address these issues that leverages genetic ancestry, defined as the proportion of ancestry derived from each ancestral population (e.g., the fraction of European/African ancestry in African Americans), in admixed populations. We applied our method to gene expression and methylation data from African American and Latino admixed individuals, respectively, identifying nine interactions that were significant at P<5×10-8. We show that two of the interactions in methylation data replicate, and the remaining six are significantly enriched for low P-values (P<1.8×10-6). CONCLUSION: We show that genetic ancestry can be a useful proxy for unknown and unmeasured covariates in the search for interaction effects. These results have important implications for our understanding of the genetic architecture of complex traits.


Subject(s)
Black People/genetics , Black or African American/genetics , Epistasis, Genetic/genetics , Gene-Environment Interaction , Hispanic or Latino/genetics , Models, Genetic , White People/genetics , DNA Methylation , Humans , Phenotype
18.
BMC Genomics ; 20(Suppl 5): 423, 2019 Jun 06.
Article in English | MEDLINE | ID: mdl-31167634

ABSTRACT

BACKGROUND: High throughput sequencing has spurred the development of metagenomics, which involves the direct analysis of microbial communities in various environments such as soil, ocean water, and the human body. Many existing methods based on marker genes or k-mers have limited sensitivity or are too computationally demanding for many users. Additionally, most work in metagenomics has focused on bacteria and archaea, neglecting to study other key microbes such as viruses and eukaryotes. RESULTS: Here we present a method, MiCoP (Microbiome Community Profiling), that uses fast-mapping of reads to build a comprehensive reference database of full genomes from viruses and eukaryotes to achieve maximum read usage and enable the analysis of the virome and eukaryome in each sample. We demonstrate that mapping of metagenomic reads is feasible for the smaller viral and eukaryotic reference databases. We show that our method is accurate on simulated and mock community data and identifies many more viral and fungal species than previously-reported results on real data from the Human Microbiome Project. CONCLUSIONS: MiCoP is a mapping-based method that proves more effective than existing methods at abundance profiling of viruses and eukaryotes in metagenomic samples. MiCoP can be used to detect the full diversity of these communities. The code, data, and documentation are publicly available on GitHub at: https://github.com/smangul1/MiCoP .


Subject(s)
Computational Biology/methods , Fungi/genetics , Genetic Markers , Metagenomics/methods , Microbiota , Sequence Analysis, DNA/methods , Viruses/genetics , Algorithms , Fungi/classification , Genome, Fungal , Genome, Viral , High-Throughput Nucleotide Sequencing/methods , Humans , Viruses/classification
19.
Am J Hum Genet ; 99(1): 89-103, 2016 Jul 07.
Article in English | MEDLINE | ID: mdl-27292110

ABSTRACT

Genome-wide association studies (GWASs) have been successful in detecting variants correlated with phenotypes of clinical interest. However, the power to detect these variants depends on the number of individuals whose phenotypes are collected, and for phenotypes that are difficult to collect, the sample size might be insufficient to achieve the desired statistical power. The phenotype of interest is often difficult to collect, whereas surrogate phenotypes or related phenotypes are easier to collect and have already been collected in very large samples. This paper demonstrates how we take advantage of these additional related phenotypes to impute the phenotype of interest or target phenotype and then perform association analysis. Our approach leverages the correlation structure between phenotypes to perform the imputation. The correlation structure can be estimated from a smaller complete dataset for which both the target and related phenotypes have been collected. Under some assumptions, the statistical power can be computed analytically given the correlation structure of the phenotypes used in imputation. In addition, our method can impute the summary statistic of the target phenotype as a weighted linear combination of the summary statistics of related phenotypes. Thus, our method is applicable to datasets for which we have access only to summary statistics and not to the raw genotypes. We illustrate our approach by analyzing associated loci to triglycerides (TGs), body mass index (BMI), and systolic blood pressure (SBP) in the Northern Finland Birth Cohort dataset.


Subject(s)
Genome-Wide Association Study/methods , Phenotype , Animals , Blood Pressure/genetics , Body Mass Index , Cohort Studies , Datasets as Topic , Finland , Genotype , Humans , Mice , Models, Genetic , Multifactorial Inheritance , Reproducibility of Results , Research Design , Sample Size , Triglycerides/blood
20.
Am J Hum Genet ; 98(6): 1181-1192, 2016 06 02.
Article in English | MEDLINE | ID: mdl-27259052

ABSTRACT

Estimation of heritability is fundamental in genetic studies. Recently, heritability estimation using linear mixed models (LMMs) has gained popularity because these estimates can be obtained from unrelated individuals collected in genome-wide association studies. Typically, heritability estimation under LMMs uses the restricted maximum likelihood (REML) approach. Existing methods for the construction of confidence intervals and estimators of SEs for REML rely on asymptotic properties. However, these assumptions are often violated because of the bounded parameter space, statistical dependencies, and limited sample size, leading to biased estimates and inflated or deflated confidence intervals. Here, we show that the estimation of confidence intervals by state-of-the-art methods is inaccurate, especially when the true heritability is relatively low or relatively high. We further show that these inaccuracies occur in datasets including thousands of individuals. Such biases are present, for example, in estimates of heritability of gene expression in the Genotype-Tissue Expression project and of lipid profiles in the Ludwigshafen Risk and Cardiovascular Health study. We also show that often the probability that the genetic component is estimated as 0 is high even when the true heritability is bounded away from 0, emphasizing the need for accurate confidence intervals. We propose a computationally efficient method, ALBI (accurate LMM-based heritability bootstrap confidence intervals), for estimating the distribution of the heritability estimator and for constructing accurate confidence intervals. Our method can be used as an add-on to existing methods for estimating heritability and variance components, such as GCTA, FaST-LMM, GEMMA, or EMMAX.


Subject(s)
Cardiovascular Diseases/genetics , Confidence Intervals , Gene-Environment Interaction , Multifactorial Inheritance/genetics , Polymorphism, Single Nucleotide/genetics , Quantitative Trait, Heritable , Computer Simulation , Genome-Wide Association Study , Genotype , Humans , Models, Genetic , Models, Statistical
SELECTION OF CITATIONS
SEARCH DETAIL