RESUMO
We improve the efficiency of population genetic file formats and GWAS computation by leveraging the distribution of samples in population-level genetic data. We identify conditional exchangeability of these data, recommending finite state entropy algorithms as an arithmetic code naturally suited for compression of population genetic data. We show between [Formula: see text] and [Formula: see text] speed and size improvements over modern dictionary compression methods that are often used for population genetic data such as Zstd and Zlib in computation and decompression tasks. We provide open source prototype software for multi-phenotype GWAS with finite state entropy compression demonstrating significant space saving and speed comparable to the state-of-the-art.
Assuntos
Compressão de Dados , Algoritmos , Entropia , Genética Populacional , Estudo de Associação Genômica Ampla , SoftwareRESUMO
UK Biobank is a major prospective epidemiological study, including multimodal brain imaging, genetics and ongoing health outcomes. Previously, we published genome-wide associations of 3,144 brain imaging-derived phenotypes, with a discovery sample of 8,428 individuals. Here we present a new open resource of genome-wide association study summary statistics, using the 2020 data release, almost tripling the discovery sample size. We now include the X chromosome and new classes of imaging-derived phenotypes (subcortical volumes and tissue contrast). Previously, we found 148 replicated clusters of associations between genetic variants and imaging phenotypes; in this study, we found 692, including 12 on the X chromosome. We describe some of the newly found associations, focusing on the X chromosome and autosomal associations involving the new classes of imaging-derived phenotypes. Our novel associations implicate, for example, pathways involved in the rare X-linked STAR (syndactyly, telecanthus and anogenital and renal malformations) syndrome, Alzheimer's disease and mitochondrial disorders.