Bioinformatics ; 2019 Aug 08.
Artigo em Inglês | MEDLINE | ID: mdl-31393554


MOTIVATION: Genome-wide association study (GWAS) analyses, at sufficient sample sizes and power, have successfully revealed biological insights for several complex traits. RICOPILI, an open sourced Perl-based pipeline was developed to address the challenges of rapidly processing large scale multi-cohort GWAS studies including quality control, imputation and downstream analyses. The pipeline is computationally efficient with portability to a wide range of high-performance computing (HPC) environments. SUMMARY: RICOPILI was created as the Psychiatric Genomics Consortium (PGC) pipeline for GWAS and adopted by other users. The pipeline features i) technical and genomic quality control in case-control and trio cohorts ii) genome-wide phasing and imputation iv) association analysis v) meta-analysis vi) polygenic risk scoring and vii) replication analysis. Notably, a major differentiator from other GWAS pipelines, RICOPILI leverages on automated parallelization and cluster job management approaches for rapid production of imputed genome-wide data. A comprehensive meta-analysis of simulated GWAS data has been incorporated demonstrating each step of the pipeline. This includes all the associated visualization plots, to allow ease of data interpretation and manuscript preparation. Simulated GWAS datasets are also packaged with the pipeline for user training tutorials and developer work. AVAILABILITY AND IMPLEMENTATION: RICOPILI has a flexible architecture to allow for ongoing development and incorporation of newer available algorithms and is adaptable to various HPC environments (QSUB, BSUB, SLURM and others). Specific links for genomic resources are either directly provided in this paper or via tutorials and external links. The central location hosting scripts and tutorials is found at this URL: SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Nat Genet ; 51(5): 793-803, 2019 05.
Artigo em Inglês | MEDLINE | ID: mdl-31043756


Bipolar disorder is a highly heritable psychiatric disorder. We performed a genome-wide association study (GWAS) including 20,352 cases and 31,358 controls of European descent, with follow-up analysis of 822 variants with P < 1 × 10-4 in an additional 9,412 cases and 137,760 controls. Eight of the 19 variants that were genome-wide significant (P < 5 × 10-8) in the discovery GWAS were not genome-wide significant in the combined analysis, consistent with small effect sizes and limited power but also with genetic heterogeneity. In the combined analysis, 30 loci were genome-wide significant, including 20 newly identified loci. The significant loci contain genes encoding ion channels, neurotransmitter transporters and synaptic components. Pathway analysis revealed nine significantly enriched gene sets, including regulation of insulin secretion and endocannabinoid signaling. Bipolar I disorder is strongly genetically correlated with schizophrenia, driven by psychosis, whereas bipolar II disorder is more strongly correlated with major depressive disorder. These findings address key clinical questions and provide potential biological mechanisms for bipolar disorder.

Transtorno Bipolar/genética , Loci Gênicos , Transtorno Bipolar/classificação , Estudos de Casos e Controles , Transtorno Depressivo Maior/genética , Feminino , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Humanos , Masculino , Polimorfismo de Nucleotídeo Único , Transtornos Psicóticos/genética , Esquizofrenia/genética , Biologia de Sistemas
Nat Genet ; 50(4): 559-571, 2018 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-29632382


We aggregated coding variant data for 81,412 type 2 diabetes cases and 370,832 controls of diverse ancestry, identifying 40 coding variant association signals (P < 2.2 × 10-7); of these, 16 map outside known risk-associated loci. We make two important observations. First, only five of these signals are driven by low-frequency variants: even for these, effect sizes are modest (odds ratio ≤1.29). Second, when we used large-scale genome-wide association data to fine-map the associated variants in their regional context, accounting for the global enrichment of complex trait associations in coding sequence, compelling evidence for coding variant causality was obtained for only 16 signals. At 13 others, the associated coding variants clearly represent 'false leads' with potential to generate erroneous mechanistic inference. Coding variant associations offer a direct route to biological insight for complex diseases and identification of validated therapeutic targets; however, appropriate mechanistic inference requires careful specification of their causal contribution to disease predisposition.

J Clin Endocrinol Metab ; 100(1): E173-81, 2015 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-25361180


CONTEXT: T4-binding globulin (TBG), a protein secreted by the liver, is the main thyroid hormone (TH) transporter in human serum. TBG deficiency is characterized by reduced serum TH levels, but normal free TH and TSH and absent clinical manifestations. The inherited form of TBG deficiency is usually due to a mutation in the TBG gene located on the X-chromosome. OBJECTIVE: Among the 75 families with X-chromosome-linked TBG deficiency identified in our laboratory, no mutations in the TBG gene were found in four families. The aim of the study was to identify the mechanism of TBG deficiency in these four families using biochemical and genetic studies. DESIGN: Observational cohort, prospective. SETTING: University research center. PATIENTS: Four families with inherited TBG deficiency and no mutations in the TBG gene. INTERVENTION: Clinical evaluation, thyroid function tests, and targeted resequencing of 1 Mb of the X-chromosome. RESULTS: Next-generation sequencing identified a novel G to A variant 20 kb downstream of the TBG gene in all four families. In silico analysis predicted that the variant resides within a liver-specific enhancer. In vitro studies confirmed the enhancer activity of a 2.2-kb fragment of genomic DNA containing the novel variant and showed that the mutation reduces the activity of this enhancer. The affected subjects share a haplotype of 8 Mb surrounding the mutation, and the most recent common ancestor among the four families was estimated to be 19.5 generations ago (95% confidence intervals, 10.4-37). CONCLUSIONS: To our knowledge, the present study is the first report of an inherited endocrine disorder caused by a mutation in an enhancer region.

Elementos Facilitadores Genéticos , Fígado/metabolismo , Mutação , Globulina de Ligação a Tiroxina/genética , Adolescente , Adulto , Criança , Feminino , Haplótipos , Humanos , Masculino , Pessoa de Meia-Idade , Linhagem , Doenças da Glândula Tireoide/genética , Doenças da Glândula Tireoide/metabolismo , Globulina de Ligação a Tiroxina/metabolismo , Adulto Jovem
Bioinformatics ; 31(2): 187-93, 2015 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-25270638


MOTIVATION: The development of cost-effective next-generation sequencing methods has spurred the development of high-throughput bioinformatics tools for detection of sequence variation. With many disparate variant-calling algorithms available, investigators must ask, 'Which method is best for my data?' Machine learning research has shown that so-called ensemble methods that combine the output of multiple models can dramatically improve classifier performance. Here we describe a novel variant-calling approach based on an ensemble of variant-calling algorithms, which we term the Consensus Genotyper for Exome Sequencing (CGES). CGES uses a two-stage voting scheme among four algorithm implementations. While our ensemble method can accept variants generated by any variant-calling algorithm, we used GATK2.8, SAMtools, FreeBayes and Atlas-SNP2 in building CGES because of their performance, widespread adoption and diverse but complementary algorithms. RESULTS: We apply CGES to 132 samples sequenced at the Hudson Alpha Institute for Biotechnology (HAIB, Huntsville, AL) using the Nimblegen Exome Capture and Illumina sequencing technology. Our sample set consisted of 40 complete trios, two families of four, one parent-child duo and two unrelated individuals. CGES yielded the fewest total variant calls (N(CGES) = 139° 897), the highest Ts/Tv ratio (3.02), the lowest Mendelian error rate across all genotypes (0.028%), the highest rediscovery rate from the Exome Variant Server (EVS; 89.3%) and 1000 Genomes (1KG; 84.1%) and the highest positive predictive value (PPV; 96.1%) for a random sample of previously validated de novo variants. We describe these and other quality control (QC) metrics from consensus data and explain how the CGES pipeline can be used to generate call sets of varying quality stringency, including consensus calls present across all four algorithms, calls that are consistent across any three out of four algorithms, calls that are consistent across any two out of four algorithms or a more liberal set of all calls made by any algorithm. AVAILABILITY AND IMPLEMENTATION: To enable accessible, efficient and reproducible analysis, we implement CGES both as a stand-alone command line tool available for download in GitHub and as a set of Galaxy tools and workflows configured to execute on parallel computers. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Algoritmos , Transtorno Autístico/genética , Exoma/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Polimorfismo de Nucleotídeo Único/genética , Software , Sequência Consenso , Interpretação Estatística de Dados , Testes Genéticos , Genótipo , Humanos
Genet Epidemiol ; 38(5): 402-15, 2014 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-24799323


High-confidence prediction of complex traits such as disease risk or drug response is an ultimate goal of personalized medicine. Although genome-wide association studies have discovered thousands of well-replicated polymorphisms associated with a broad spectrum of complex traits, the combined predictive power of these associations for any given trait is generally too low to be of clinical relevance. We propose a novel systems approach to complex trait prediction, which leverages and integrates similarity in genetic, transcriptomic, or other omics-level data. We translate the omic similarity into phenotypic similarity using a method called Kriging, commonly used in geostatistics and machine learning. Our method called OmicKriging emphasizes the use of a wide variety of systems-level data, such as those increasingly made available by comprehensive surveys of the genome, transcriptome, and epigenome, for complex trait prediction. Furthermore, our OmicKriging framework allows easy integration of prior information on the function of subsets of omics-level data from heterogeneous sources without the sometimes heavy computational burden of Bayesian approaches. Using seven disease datasets from the Wellcome Trust Case Control Consortium (WTCCC), we show that OmicKriging allows simple integration of sparse and highly polygenic components yielding comparable performance at a fraction of the computing time of a recently published Bayesian sparse linear mixed model method. Using a cellular growth phenotype, we show that integrating mRNA and microRNA expression data substantially increases performance over either dataset alone. Using clinical statin response, we show improved prediction over existing methods. We provide an R package to implement OmicKriging (

Biologia Computacional/métodos , Predisposição Genética para Doença/genética , Herança Multifatorial/genética , Teorema de Bayes , Estudos de Casos e Controles , Processos de Crescimento Celular/genética , LDL-Colesterol/sangue , Humanos , MicroRNAs/genética , Modelos Genéticos , Fenótipo , RNA Mensageiro/genética , Sinvastatina/farmacologia , Software , Biologia de Sistemas/métodos , Fatores de Tempo