Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 50
Filtrar
1.
Am J Hum Genet ; 109(2): 253-269, 2022 02 03.
Artigo em Inglês | MEDLINE | ID: mdl-35065708

RESUMO

Mucus obstruction is a central feature in the cystic fibrosis (CF) airways. A genome-wide association study (GWAS) of lung disease by the CF Gene Modifier Consortium (CFGMC) identified a significant locus containing two mucin genes, MUC20 and MUC4. Expression quantitative trait locus (eQTL) analysis using human nasal epithelia (HNE) from 94 CF-affected Canadians in the CFGMC demonstrated MUC4 eQTLs that mirrored the lung association pattern in the region, suggesting that MUC4 expression may mediate CF lung disease. Complications arose, however, with colocalization testing using existing methods: the locus is complex and the associated SNPs span a 0.2 Mb region with high linkage disequilibrium (LD) and evidence of allelic heterogeneity. We previously developed the Simple Sum (SS), a powerful colocalization test in regions with allelic heterogeneity, but SS assumed eQTLs to be present to achieve type I error control. Here we propose a two-stage SS (SS2) colocalization test that avoids a priori eQTL assumptions, accounts for multiple hypothesis testing and the composite null hypothesis, and enables meta-analysis. We compare SS2 to published approaches through simulation and demonstrate type I error control for all settings with the greatest power in the presence of high LD and allelic heterogeneity. Applying SS2 to the MUC20/MUC4 CF lung disease locus with eQTLs from CF HNE revealed significant colocalization with MUC4 (p = 1.31 × 10-5) rather than with MUC20. The SS2 is a powerful method to inform the responsible gene(s) at a locus and guide future functional studies. SS2 has been implemented in the application LocusFocus.


Assuntos
Sistemas de Transporte de Aminoácidos/genética , Fibrose Cística/genética , Modelos Estatísticos , Mucina-4/genética , Mucinas/genética , Locos de Características Quantitativas , Alelos , Sistemas de Transporte de Aminoácidos/metabolismo , Fibrose Cística/metabolismo , Fibrose Cística/patologia , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Heterogeneidade Genética , Genoma Humano , Estudo de Associação Genômica Ampla , Humanos , Desequilíbrio de Ligação , Pulmão/metabolismo , Pulmão/patologia , Mucina-4/metabolismo , Mucinas/metabolismo , Mucosa Nasal/metabolismo , Mucosa Nasal/patologia , Polimorfismo de Nucleotídeo Único
2.
J Proteome Res ; 23(6): 2298-2305, 2024 Jun 07.
Artigo em Inglês | MEDLINE | ID: mdl-38809146

RESUMO

Multiple hypothesis testing is an integral component of data analysis for large-scale technologies such as proteomics, transcriptomics, or metabolomics, for which the false discovery rate (FDR) and positive FDR (pFDR) have been accepted as error estimation and control measures. The pFDR is the expectation of false discovery proportion (FDP), which refers to the ratio of the number of null hypotheses to that of all rejected hypotheses. In practice, the expectation of ratio is approximated by the ratio of expectation; however, the conditions for transforming the former into the latter have not been investigated. This work derives exact integral expressions for the expectation (pFDR) and variance of FDP. The widely used approximation (ratio of expectations) is shown to be a particular case (in the limit of a large sample size) of the integral formula for pFDR. A recurrence formula is provided to compute the pFDR for a predefined number of null hypotheses. The variance of FDP was approximated for a practical application in peptide identification using forward and reversed protein sequences. The simulations demonstrate that the integral expression exhibits better accuracy than the approximate formula in the case of a small number of hypotheses. For large sample sizes, the pFDRs obtained by the integral expression and approximation do not differ substantially. Applications to proteomics data sets are included.


Assuntos
Proteômica , Proteômica/métodos , Algoritmos , Reações Falso-Positivas , Peptídeos/análise , Peptídeos/química , Peptídeos/metabolismo , Simulação por Computador , Humanos
3.
Genet Epidemiol ; 47(2): 167-184, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-36465006

RESUMO

Mediation hypothesis testing for a large number of mediators is challenging due to the composite structure of the null hypothesis, H 0 : α ß = 0 ${H}_{0}:\alpha \beta =0$ ( α $\alpha $ : effect of the exposure on the mediator after adjusting for confounders; ß $\beta $ : effect of the mediator on the outcome after adjusting for exposure and confounders). In this paper, we reviewed three classes of methods for large-scale one at a time mediation hypothesis testing. These methods are commonly used for continuous outcomes and continuous mediators assuming there is no exposure-mediator interaction so that the product α ß $\alpha \beta $ has a causal interpretation as the indirect effect. The first class of methods ignores the impact of different structures under the composite null hypothesis, namely, (1) α = 0 , ß ≠ 0 $\alpha =0,\beta \ne 0$ ; (2) α ≠ 0 , ß = 0 $\alpha \ne 0,\beta =0$ ; and (3) α = ß = 0 $\alpha =\beta =0$ . The second class of methods weights the reference distribution under each case of the null to form a mixture reference distribution. The third class constructs a composite test statistic using the three p values obtained under each case of the null so that the reference distribution of the composite statistic is approximately U ( 0 , 1 ) $U(0,1)$ . In addition to these existing methods, we developed the Sobel-comp method belonging to the second class, which uses a corrected mixture reference distribution for Sobel's test statistic. We performed extensive simulation studies to compare all six methods belonging to these three classes in terms of the false positive rates (FPRs) under the null hypothesis and the true positive rates under the alternative hypothesis. We found that the second class of methods which uses a mixture reference distribution could best maintain the FPRs at the nominal level under the null hypothesis and had the greatest true positive rates under the alternative hypothesis. We applied all methods to study the mediation mechanism of DNA methylation sites in the pathway from adult socioeconomic status to glycated hemoglobin level using data from the Multi-Ethnic Study of Atherosclerosis (MESA). We provide guidelines for choosing the optimal mediation hypothesis testing method in practice and develop an R package medScan available on the CRAN for implementing all the six methods.


Assuntos
Modelos Genéticos , Modelos Estatísticos , Adulto , Humanos , Simulação por Computador , Projetos de Pesquisa
4.
J Exp Bot ; 75(17): 5377-5389, 2024 Sep 11.
Artigo em Inglês | MEDLINE | ID: mdl-38954539

RESUMO

Linear mixed models (LMMs) are a commonly used method for genome-wide association studies (GWAS) that aim to detect associations between genetic markers and phenotypic measurements in a population of individuals while accounting for population structure and cryptic relatedness. In a standard GWAS, hundreds of thousands to millions of statistical tests are performed, requiring control for multiple hypothesis testing. Typically, static corrections that penalize the number of tests performed are used to control for the family-wise error rate, which is the probability of making at least one false positive. However, it has been shown that in practice this threshold is too conservative for normally distributed phenotypes and not stringent enough for non-normally distributed phenotypes. Therefore, permutation-based LMM approaches have recently been proposed to provide a more realistic threshold that takes phenotypic distributions into account. In this work, we discuss the advantages of permutation-based GWAS approaches, including new simulations and results from a re-analysis of all publicly available Arabidopsis phenotypes from the AraPheno database.


Assuntos
Arabidopsis , Estudo de Associação Genômica Ampla , Fenótipo , Arabidopsis/genética , Modelos Genéticos , Modelos Lineares , Simulação por Computador
5.
Stat Med ; 43(1): 61-88, 2024 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-37927105

RESUMO

Multiple hypothesis testing has been widely applied to problems dealing with high-dimensional data, for example, the selection of important variables or features from a large number of candidates while controlling the error rate. The most prevailing measure of error rate used in multiple hypothesis testing is the false discovery rate (FDR). In recent years, the local false discovery rate (fdr) has drawn much attention, due to its advantage of accessing the confidence of individual hypotheses. However, most methods estimate fdr through P $$ P $$ -values or statistics with known null distributions, which are sometimes unavailable or unreliable. Adopting the innovative methodology of competition-based procedures, for example, the knockoff filter, this paper proposes a new approach, named TDfdr, to fdr estimation, which is free of P $$ P $$ -values or known null distributions. Extensive simulation studies demonstrate that TDfdr can accurately estimate the fdr with two competition-based procedures. We applied the TDfdr method to two real biomedical tasks. One is to identify significantly differentially expressed proteins related to the COVID-19 disease, and the other is to detect mutations in the genotypes of HIV-1 that are associated with drug resistance. Higher discovery power was observed compared to existing popular methods.


Assuntos
Algoritmos , Projetos de Pesquisa , Humanos , Simulação por Computador
6.
BMC Bioinformatics ; 24(1): 170, 2023 Apr 26.
Artigo em Inglês | MEDLINE | ID: mdl-37101120

RESUMO

BACKGROUND: Genome-wide tests, including genome-wide association studies (GWAS) of germ-line genetic variants, driver tests of cancer somatic mutations, and transcriptome-wide association tests of RNAseq data, carry a high multiple testing burden. This burden can be overcome by enrolling larger cohorts or alleviated by using prior biological knowledge to favor some hypotheses over others. Here we compare these two methods in terms of their abilities to boost the power of hypothesis testing. RESULTS: We provide a quantitative estimate for progress in cohort sizes and present a theoretical analysis of the power of oracular hard priors: priors that select a subset of hypotheses for testing, with an oracular guarantee that all true positives are within the tested subset. This theory demonstrates that for GWAS, strong priors that limit testing to 100-1000 genes provide less power than typical annual 20-40% increases in cohort sizes. Furthermore, non-oracular priors that exclude even a small fraction of true positives from the tested set can perform worse than not using a prior at all. CONCLUSION: Our results provide a theoretical explanation for the continued dominance of simple, unbiased univariate hypothesis tests for GWAS: if a statistical question can be answered by larger cohort sizes, it should be answered by larger cohort sizes rather than by more complicated biased methods involving priors. We suggest that priors are better suited for non-statistical aspects of biology, such as pathway structure and causality, that are not yet easily captured by standard hypothesis tests.


Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Humanos , Densidade Demográfica , Transcriptoma
7.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34459489

RESUMO

In genome-wide association studies (GWAS), it has become commonplace to test millions of single-nucleotide polymorphisms (SNPs) for phenotypic association. Gene-based testing can improve power to detect weak signal by reducing multiple testing and pooling signal strength. While such tests account for linkage disequilibrium (LD) structure of SNP alleles within each gene, current approaches do not capture LD of SNPs falling in different nearby genes, which can induce correlation of gene-based test statistics. We introduce an algorithm to account for this correlation. When a gene's test statistic is independent of others, it is assessed separately; when test statistics for nearby genes are strongly correlated, their SNPs are agglomerated and tested as a locus. To provide insight into SNPs and genes driving association within loci, we develop an interactive visualization tool to explore localized signal. We demonstrate our approach in the context of weakly powered GWAS for autism spectrum disorder, which is contrasted to more highly powered GWAS for schizophrenia and educational attainment. To increase power for these analyses, especially those for autism, we use adaptive $P$-value thresholding, guided by high-dimensional metadata modeled with gradient boosted trees, highlighting when and how it can be most useful. Notably our workflow is based on summary statistics.


Assuntos
Algoritmos , Biologia Computacional/métodos , Predisposição Genética para Doença , Testes Genéticos/normas , Estudo de Associação Genômica Ampla/métodos , Estudo de Associação Genômica Ampla/normas , Alelos , Mapeamento Cromossômico , Bases de Dados Genéticas , Testes Genéticos/métodos , Humanos , Desequilíbrio de Ligação , Fenótipo , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas
8.
Stat Med ; 42(12): 1931-1945, 2023 05 30.
Artigo em Inglês | MEDLINE | ID: mdl-36914221

RESUMO

The analysis of large-scale datasets, especially in biomedical contexts, frequently involves a principled screening of multiple hypotheses. The celebrated two-group model jointly models the distribution of the test statistics with mixtures of two competing densities, the null and the alternative distributions. We investigate the use of weighted densities and, in particular, non-local densities as working alternative distributions, to enforce separation from the null and thus refine the screening procedure. We show how these weighted alternatives improve various operating characteristics, such as the Bayesian false discovery rate, of the resulting tests for a fixed mixture proportion with respect to a local, unweighted likelihood approach. Parametric and nonparametric model specifications are proposed, along with efficient samplers for posterior inference. By means of a simulation study, we exhibit how our model compares with both well-established and state-of-the-art alternatives in terms of various operating characteristics. Finally, to illustrate the versatility of our method, we conduct three differential expression analyses with publicly-available datasets from genomic studies of heterogeneous nature.


Assuntos
Genômica , Humanos , Funções Verossimilhança , Teorema de Bayes , Simulação por Computador
9.
Proc Natl Acad Sci U S A ; 117(26): 15028-15035, 2020 06 30.
Artigo em Inglês | MEDLINE | ID: mdl-32522875

RESUMO

To correct for a large number of hypothesis tests, most researchers rely on simple multiple testing corrections. Yet, new methodologies of selective inference could potentially improve power while retaining statistical guarantees, especially those that enable exploration of test statistics using auxiliary information (covariates) to weight hypothesis tests for association. We explore one such method, adaptive P-value thresholding (AdaPT), in the framework of genome-wide association studies (GWAS) and gene expression/coexpression studies, with particular emphasis on schizophrenia (SCZ). Selected SCZ GWAS association P values play the role of the primary data for AdaPT; single-nucleotide polymorphisms (SNPs) are selected because they are gene expression quantitative trait loci (eQTLs). This natural pairing of SNPs and genes allow us to map the following covariate values to these pairs: GWAS statistics from genetically correlated bipolar disorder, the effect size of SNP genotypes on gene expression, and gene-gene coexpression, captured by subnetwork (module) membership. In all, 24 covariates per SNP/gene pair were included in the AdaPT analysis using flexible gradient boosted trees. We demonstrate a substantial increase in power to detect SCZ associations using gene expression information from the developing human prefrontal cortex. We interpret these results in light of recent theories about the polygenic nature of SCZ. Importantly, our entire process for identifying enrichment and creating features with independent complementary data sources can be implemented in many different high-throughput settings to ultimately improve power.


Assuntos
Transtorno Bipolar/genética , Esquizofrenia/genética , Algoritmos , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Genótipo , Humanos , Herança Multifatorial , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas
10.
Biostatistics ; 22(1): 68-81, 2021 01 28.
Artigo em Inglês | MEDLINE | ID: mdl-31135886

RESUMO

The false discovery rate (FDR) measures the proportion of false discoveries among a set of hypothesis tests called significant. This quantity is typically estimated based on p-values or test statistics. In some scenarios, there is additional information available that may be used to more accurately estimate the FDR. We develop a new framework for formulating and estimating FDRs and q-values when an additional piece of information, which we call an "informative variable", is available. For a given test, the informative variable provides information about the prior probability a null hypothesis is true or the power of that particular test. The FDR is then treated as a function of this informative variable. We consider two applications in genomics. Our first application is a genetics of gene expression (eQTL) experiment in yeast where every genetic marker and gene expression trait pair are tested for associations. The informative variable in this case is the distance between each genetic marker and gene. Our second application is to detect differentially expressed genes in an RNA-seq study carried out in mice. The informative variable in this study is the per-gene read depth. The framework we develop is quite general, and it should be useful in a broad range of scientific applications.


Assuntos
Marcadores Genéticos , Genômica , Animais , Perfilação da Expressão Gênica , Marcadores Genéticos/genética , Genômica/métodos , Camundongos , Probabilidade , Saccharomyces cerevisiae/genética
11.
Stat Med ; 40(24): 5333-5350, 2021 10 30.
Artigo em Inglês | MEDLINE | ID: mdl-34636081

RESUMO

The problem of testing multiple hypotheses using a group sequential procedure often arises in clinical trials. We review several group sequential Holm (GSHM) type procedures proposed in the literature and clarify the relationships between them. In particular, we show which procedures are equivalent or, if different, which are more powerful and what are their pros and cons. We propose a step-up group sequential Hochberg (GSHC) procedure as a reverse application of a particular step-down GSHM procedure. We conducted an extensive simulation study to evaluate the familywise error rate (FWER) and power properties of that GSHM procedure and the GSHC procedure and found that the GSHC procedure controls FWER more closely and is more powerful. All procedures are illustrated with a common numerical example, the data for which are chosen to bring out the differences between them. A real case study is also presented to illustrate application of these procedures. R programs for applying the proposed procedures, additional simulation results, and the proof of the FWER control of the GSHC procedure in a special case are provided in Supplementary Material.


Assuntos
Ensaios Clínicos como Assunto , Simulação por Computador
12.
Artigo em Inglês | MEDLINE | ID: mdl-33156000

RESUMO

Combining correlated p-values from multiple hypothesis testing is a most frequently used method for integrating information in genetic and genomic data analysis. However, most existing methods for combining independent p-values from individual component problems into a single unified p-value are unsuitable for the correlational structure among p-values from multiple hypothesis testing. Although some existing p-value combination methods had been modified to overcome the potential limitations, there is no uniformly most powerful method for combining correlated p-values in genetic data analysis. Therefore, providing a p-value combination method that can robustly control type I errors and keep the good power rates is necessary. In this paper, we propose an empirical method based on the gamma distribution (EMGD) for combining dependent p-values from multiple hypothesis testing. The proposed test, EMGD, allows for flexible accommodating the highly correlated p-values from the multiple hypothesis testing into a unified p-value for examining the combined hypothesis that we are interested in. The EMGD retains the robustness character of the empirical Brown's method (EBM) for pooling the dependent p-values from multiple hypothesis testing. Moreover, the EMGD keeps the character of the method based on the gamma distribution that simultaneously retains the advantages of the z-transform test and the gamma-transform test for combining dependent p-values from multiple statistical tests. The two characters lead to the EMGD that can keep the robust power for combining dependent p-values from multiple hypothesis testing. The performance of the proposed method EMGD is illustrated with simulations and real data applications by comparing with the existing methods, such as Kost and McDermott's method, the EBM and the harmonic mean p-value method.

13.
Mol Cell Proteomics ; 18(4): 796-805, 2019 04.
Artigo em Inglês | MEDLINE | ID: mdl-30647073

RESUMO

Within the last several years, top-down proteomics has emerged as a high throughput technique for protein and proteoform identification. This technique has the potential to identify and characterize thousands of proteoforms within a single study, but the absence of accurate false discovery rate (FDR) estimation could hinder the adoption and consistency of top-down proteomics in the future. In automated identification and characterization of proteoforms, FDR calculation strongly depends on the context of the search. The context includes MS data quality, the database being interrogated, the search engine, and the parameters of the search. Particular to top-down proteomics-there are four molecular levels of study: proteoform spectral match (PrSM), protein, isoform, and proteoform. Here, a context-dependent framework for calculating an accurate FDR at each level was designed, implemented, and validated against a manually curated training set with 546 confirmed proteoforms. We examined several search contexts and found that an FDR calculated at the PrSM level under-reported the true FDR at the protein level by an average of 24-fold. We present a new open-source tool, the TDCD_FDR_Calculator, which provides a scalable, context-dependent FDR calculation that can be applied post-search to enhance the quality of results in top-down proteomics from any search engine.


Assuntos
Proteômica/métodos , Algoritmos , Bases de Dados de Proteínas , Humanos , Isoformas de Proteínas/metabolismo , Reprodutibilidade dos Testes
14.
Stat Med ; 39(26): 3772-3786, 2020 11 20.
Artigo em Inglês | MEDLINE | ID: mdl-32706424

RESUMO

Clinical trials routinely involve multiple hypothesis testing. The closed testing procedure (CTP) is a fundamental principle in testing multiple hypotheses. This article presents an improved CTP in which intersection hypotheses can be tested at a level greater than α such that the control of the familywise error rate at level α remains. Consequently, our method uniformly improves the power of discovering false hypotheses over the original CTP. We illustrate that an improvement by our method exists for many commonly used tests. An empirical study on the effectiveness of a glucose-lowering drug is provided.


Assuntos
Ensaios Clínicos como Assunto , Projetos de Pesquisa , Humanos
15.
Genet Epidemiol ; 42(7): 590-607, 2018 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-30120797

RESUMO

Concerns over reproducibility in research has reinvigorated the discourse on P-values as measures of statistical evidence. In a position statement by the American Statistical Association board of directors, they warn of P-value misuse and refer to the availability of alternatives. Despite the common practice of comparing P-values across different hypothesis tests in genetics, it is well-appreciated that P-values must be interpreted alongside the sample size and experimental design used for their computation. Here, we discuss the evidential statistical paradigm (EP), an alternative to Bayesian and Frequentist paradigms, that has been implemented in human genetics studies. Using applications in Cystic Fibrosis genetic association analyses, and describing recent theoretical developments, we review how to measure statistical evidence using the EP in the presence of covariates, model misspecification, and for composite hypotheses. Novel graphical displays are presented, and software for their computation is highlighted. The implications of multiple hypothesis testing for the EP are delineated in the analyses, demonstrating a view more consistent with scientific reasoning; the EP provides a theoretical justification for replication that is a requirement in genetic association studies. As genetic studies grow in size and complexity, a fresh look at measures of statistical evidence that are sensible amid the analysis of big data are required.


Assuntos
Modelos Genéticos , Modelos Estatísticos , Antiporters/genética , Teorema de Bayes , Fibrose Cística/genética , Estudos de Associação Genética , Predisposição Genética para Doença , Humanos , Funções Verossimilhança , Mecônio/metabolismo , Probabilidade , Reprodutibilidade dos Testes , Tamanho da Amostra , Software , Transportadores de Sulfato
16.
Biometrics ; 75(2): 650-662, 2019 06.
Artigo em Inglês | MEDLINE | ID: mdl-30430537

RESUMO

Identification of differentially expressed genes (DE genes) is commonly conducted in modern biomedical research. However, unwanted variation inevitably arises during the data collection process, which can make the detection results heavily biased. Various methods have been suggested for removing the unwanted variation while keeping the biological variation to ensure a reliable analysis result. Removing unwanted variation (RUV) has recently been proposed for this purpose, which works by virtue of negative control genes. On the other hand, outliers frequently appear in modern high-throughput genetic data, which can heavily affect the performances of RUV and its downstream analysis. In this work, we propose a robust RUV-testing procedure (a robust RUV procedure to remove unwanted variance, followed by a robust testing procedure to identify DE genes) via γ -divergence. The advantages of our method are twofold: (a) it does not involve any modeling for the outlier distribution, which makes it applicable to various situations; (b) it is easy to implement in the sense that its robustness is controlled by a single tuning parameter γ of γ -divergence, and a data-driven criterion is developed to select γ . When applied to real data sets, our method can successfully remove unwanted variation, and was able to identify more DE genes than conventional methods.


Assuntos
Perfilação da Expressão Gênica/métodos , Modelos Genéticos , Interpretação Estatística de Dados , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos
17.
Biom J ; 61(2): 424-441, 2019 03.
Artigo em Inglês | MEDLINE | ID: mdl-30589104

RESUMO

Next-generation sequencing (NGS) experiments are often performed in biomedical research nowadays, leading to methodological challenges related to the high-dimensional and complex nature of the recorded data. In this work we review some of the issues that arise in disorder detection from NGS experiments, that is, when the focus is the detection of deletion and duplication disorders for homozygosity and heterozygosity in DNA sequencing. A statistical model to cope with guanine/cytosine bias and phasing and prephasing phenomena at base level is proposed, and a goodness-of-fit procedure for disorder detection is derived. The method combines the proper evaluation of local p-values (one for each DNA base) with suitable corrections for multiple comparisons and the discrete nature of the p-values. A global test for the detection of disorders in the whole DNA region is proposed too. The performance of the introduced procedures is investigated through simulations. A real data illustration is provided.


Assuntos
Bioestatística/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , Heterozigoto , Homozigoto , Modelos Estatísticos , Método de Monte Carlo
18.
Genet Epidemiol ; 41(5): 455-466, 2017 07.
Artigo em Inglês | MEDLINE | ID: mdl-28421636

RESUMO

Tissue factor pathway inhibitor (TFPI) regulates the formation of intravascular blood clots, which manifest clinically as ischemic heart disease, ischemic stroke, and venous thromboembolism (VTE). TFPI plasma levels are heritable, but the genetics underlying TFPI plasma level variability are poorly understood. Herein we report the first genome-wide association scan (GWAS) of TFPI plasma levels, conducted in 251 individuals from five extended French-Canadian Families ascertained on VTE. To improve discovery, we also applied a hypothesis-driven (HD) GWAS approach that prioritized single nucleotide polymorphisms (SNPs) in (1) hemostasis pathway genes, and (2) vascular endothelial cell (EC) regulatory regions, which are among the highest expressers of TFPI. Our GWAS identified 131 SNPs with suggestive evidence of association (P-value < 5 × 10-8 ), but no SNPs reached the genome-wide threshold for statistical significance. Hemostasis pathway genes were not enriched for TFPI plasma level associated SNPs (global hypothesis test P-value = 0.147), but EC regulatory regions contained more TFPI plasma level associated SNPs than expected by chance (global hypothesis test P-value = 0.046). We therefore stratified our genome-wide SNPs, prioritizing those in EC regulatory regions via stratified false discovery rate (sFDR) control, and reranked the SNPs by q-value. The minimum q-value was 0.27, and the top-ranked SNPs did not show association evidence in the MARTHA replication sample of 1,033 unrelated VTE cases. Although this study did not result in new loci for TFPI, our work lays out a strategy to utilize epigenomic data in prioritization schemes for future GWAS studies.


Assuntos
Biomarcadores/sangue , Lipoproteínas/sangue , Lipoproteínas/genética , Polimorfismo de Nucleotídeo Único/genética , Sequências Reguladoras de Ácido Nucleico/genética , Tromboembolia Venosa/sangue , Tromboembolia Venosa/genética , Adulto , Canadá , Células Cultivadas , Endotélio Vascular/metabolismo , Endotélio Vascular/patologia , Epigenômica , Feminino , Estudo de Associação Genômica Ampla/métodos , Humanos , Masculino , Tromboembolia Venosa/diagnóstico
19.
Genet Epidemiol ; 40(1): 45-56, 2016 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-26626037

RESUMO

The genetic basis of multiple phenotypes such as gene expression, metabolite levels, or imaging features is often investigated by testing a large collection of hypotheses, probing the existence of association between each of the traits and hundreds of thousands of genotyped variants. Appropriate multiplicity adjustment is crucial to guarantee replicability of findings, and the false discovery rate (FDR) is frequently adopted as a measure of global error. In the interest of interpretability, results are often summarized so that reporting focuses on variants discovered to be associated to some phenotypes. We show that applying FDR-controlling procedures on the entire collection of hypotheses fails to control the rate of false discovery of associated variants as well as the expected value of the average proportion of false discovery of phenotypes influenced by such variants. We propose a simple hierarchical testing procedure that allows control of both these error rates and provides a more reliable basis for the identification of variants with functional effects. We demonstrate the utility of this approach through simulation studies comparing various error rates and measures of power for genetic association studies of multiple traits. Finally, we apply the proposed method to identify genetic variants that impact flowering phenotypes in Arabidopsis thaliana, expanding the set of discoveries.


Assuntos
Estudos de Associação Genética , Arabidopsis/genética , Arabidopsis/fisiologia , Simulação por Computador , Flores/fisiologia , Estudo de Associação Genômica Ampla , Modelos Genéticos , Fenótipo , Polimorfismo de Nucleotídeo Único , Reprodutibilidade dos Testes
20.
Biometrics ; 73(2): 562-570, 2017 06.
Artigo em Inglês | MEDLINE | ID: mdl-27599149

RESUMO

Standard false discovery rate (FDR) procedures can provide misleading inference when testing multiple null hypotheses with heterogeneous multinomial data. For example, in the motivating study the goal is to identify species of bacteria near the roots of wheat plants (rhizobacteria) that are moderately or strongly associated with productivity. However, standard procedures discover the most abundant species even when their association is weak and fail to discover many moderate and strong associations when the species are not abundant. This article provides a new FDR-controlling method based on a finite mixture of multinomial distributions and shows that it tends to discover more moderate and strong associations and fewer weak associations when the data are heterogeneous across tests. The new method is applied to the rhizobacteria data and performs favorably over competing methods.


Assuntos
Modelos Estatísticos
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa