ABSTRACT
Structural variants are implicated in numerous diseases and make up the majority of varying nucleotides among human genomes. Here we describe an integrated set of eight structural variant classes comprising both balanced and unbalanced variants, which we constructed using short-read DNA sequencing data and statistically phased onto haplotype blocks in 26 human populations. Analysing this set, we identify numerous gene-intersecting structural variants exhibiting population stratification and describe naturally occurring homozygous gene knockouts that suggest the dispensability of a variety of human genes. We demonstrate that structural variants are enriched on haplotypes identified by genome-wide association studies and exhibit enrichment for expression quantitative trait loci. Additionally, we uncover appreciable levels of structural variant complexity at different scales, including genic loci subject to clusters of repeated rearrangement and complex structural variants with multiple breakpoints likely to have formed through individual mutational events. Our catalogue will enhance future studies into structural variant demography, functional impact and disease association.
Subject(s)
Genetic Variation/genetics , Genome, Human/genetics , Physical Chromosome Mapping , Amino Acid Sequence , Genetic Predisposition to Disease , Genetics, Medical , Genetics, Population , Genome-Wide Association Study , Genomics , Genotype , Haplotypes/genetics , Homozygote , Humans , Molecular Sequence Data , Mutation Rate , Polymorphism, Single Nucleotide/genetics , Quantitative Trait Loci/genetics , Sequence Analysis, DNA , Sequence Deletion/geneticsABSTRACT
3% of the population develops saccular intracranial aneurysms (sIAs), a complex trait, with a sporadic and a familial form. Subarachnoid hemorrhage from sIA (sIA-SAH) is a devastating form of stroke. Certain rare genetic variants are enriched in the Finns, a population isolate with a small founder population and bottleneck events. As the sIA-SAH incidence in Finland is >2× increased, such variants may associate with sIA in the Finnish population. We tested 9.4 million variants for association in 760 Finnish sIA patients (enriched for familial sIA), and in 2,513 matched controls with case-control status and with the number of sIAs. The most promising loci (p<5E-6) were replicated in 858 Finnish sIA patients and 4,048 controls. The frequencies and effect sizes of the replicated variants were compared to a continental European population using 717 Dutch cases and 3,004 controls. We discovered four new high-risk loci with low frequency lead variants. Three were associated with the case-control status: 2q23.3 (MAF 2.1%, OR 1.89, p 1.42×10-9); 5q31.3 (MAF 2.7%, OR 1.66, p 3.17×10-8); 6q24.2 (MAF 2.6%, OR 1.87, p 1.87×10-11) and one with the number of sIAs: 7p22.1 (MAF 3.3%, RR 1.59, p 6.08×-9). Two of the associations (5q31.3, 6q24.2) replicated in the Dutch sample. The 7p22.1 locus was strongly differentiated; the lead variant was more frequent in Finland (4.6%) than in the Netherlands (0.3%). Additionally, we replicated a previously inconclusive locus on 2q33.1 in all samples tested (OR 1.27, p 1.87×10-12). The five loci explain 2.1% of the sIA heritability in Finland, and may relate to, but not explain, the increased incidence of sIA-SAH in Finland. This study illustrates the utility of population isolates, familial enrichment, dense genotype imputation and alternate phenotyping in search for variants associated with complex diseases.
Subject(s)
Genome-Wide Association Study , Intracranial Aneurysm/genetics , Stroke/genetics , Subarachnoid Hemorrhage/genetics , Chromosomes, Human, Pair 2/genetics , Europe , Finland , Gene Frequency , Genetic Predisposition to Disease , Genetic Variation , Genetics, Population , Humans , Intracranial Aneurysm/pathology , Risk Factors , Stroke/pathology , Subarachnoid Hemorrhage/pathologyABSTRACT
MOTIVATION: Given the current costs of next-generation sequencing, large studies carry out low-coverage sequencing followed by application of methods that leverage linkage disequilibrium to infer genotypes. We propose a novel method that assumes study samples are sequenced at low coverage and genotyped on a genome-wide microarray, as in the 1000 Genomes Project (1KGP). We assume polymorphic sites have been detected from the sequencing data and that genotype likelihoods are available at these sites. We also assume that the microarray genotypes have been phased to construct a haplotype scaffold. We then phase each polymorphic site using an MCMC algorithm that iteratively updates the unobserved alleles based on the genotype likelihoods at that site and local haplotype information. We use a multivariate normal model to capture both allele frequency and linkage disequilibrium information around each site. When sequencing data are available from trios, Mendelian transmission constraints are easily accommodated into the updates. The method is highly parallelizable, as it analyses one position at a time. RESULTS: We illustrate the performance of the method compared with other methods using data from Phase 1 of the 1KGP in terms of genotype accuracy, phasing accuracy and downstream imputation performance. We show that the haplotype panel we infer in African samples, which was based on a trio-phased scaffold, increases downstream imputation accuracy for rare variants (R2 increases by >0.05 for minor allele frequency <1%), and this will translate into a boost in power to detect associations. These results highlight the value of incorporating microarray genotypes when calling variants from next-generation sequence data. AVAILABILITY: The method (called MVNcall) is implemented in a C++ program and is available from http://www.stats.ox.ac.uk/â¼marchini/#software.
Subject(s)
Genotyping Techniques , Haplotypes , High-Throughput Nucleotide Sequencing , Polymorphism, Single Nucleotide , Algorithms , Alleles , Chromosomes, Human, Pair 20 , Gene Frequency , Genotype , Humans , Linkage DisequilibriumABSTRACT
To elucidate the genetic architecture of amyotrophic lateral sclerosis (ALS) and find associated loci, we assembled a custom imputation reference panel from whole-genome-sequenced patients with ALS and matched controls (n = 1,861). Through imputation and mixed-model association analysis in 12,577 cases and 23,475 controls, combined with 2,579 cases and 2,767 controls in an independent replication cohort, we fine-mapped a new risk locus on chromosome 21 and identified C21orf2 as a gene associated with ALS risk. In addition, we identified MOBP and SCFD1 as new associated risk loci. We established evidence of ALS being a complex genetic trait with a polygenic architecture. Furthermore, we estimated the SNP-based heritability at 8.5%, with a distinct and important role for low-frequency variants (frequency 1-10%). This study motivates the interrogation of larger samples with full genome coverage to identify rare causal variants that underpin ALS risk.
Subject(s)
Amyotrophic Lateral Sclerosis/genetics , Genetic Predisposition to Disease , Munc18 Proteins/genetics , Mutation/genetics , Myelin Proteins/genetics , Proteins/genetics , Amyotrophic Lateral Sclerosis/epidemiology , Case-Control Studies , Cohort Studies , Cytoskeletal Proteins , Genome-Wide Association Study , Humans , Netherlands/epidemiologyABSTRACT
Mutations create variation in the population, fuel evolution and cause genetic diseases. Current knowledge about de novo mutations is incomplete and mostly indirect. Here we analyze 11,020 de novo mutations from the whole genomes of 250 families. We show that de novo mutations in the offspring of older fathers are not only more numerous but also occur more frequently in early-replicating, genic regions. Functional regions exhibit higher mutation rates due to CpG dinucleotides and show signatures of transcription-coupled repair, whereas mutation clusters with a unique signature point to a new mutational mechanism. Mutation and recombination rates independently associate with nucleotide diversity, and regional variation in human-chimpanzee divergence is only partly explained by heterogeneity in mutation rate. Finally, we provide a genome-wide mutation rate map for medical and population genetics applications. Our results provide new insights and refine long-standing hypotheses about human mutagenesis.
Subject(s)
Germ-Line Mutation , Animals , Evolution, Molecular , Female , Genome, Human , Humans , Male , Models, Genetic , Mutation Rate , Pan troglodytes/genetics , Paternal AgeABSTRACT
Alopecia areata (AA) is a prevalent autoimmune disease with 10 known susceptibility loci. Here we perform the first meta-analysis of research on AA by combining data from two genome-wide association studies (GWAS), and replication with supplemented ImmunoChip data for a total of 3,253 cases and 7,543 controls. The strongest region of association is the major histocompatibility complex, where we fine-map four independent effects, all implicating human leukocyte antigen-DR as a key aetiologic driver. Outside the major histocompatibility complex, we identify two novel loci that exceed the threshold of statistical significance, containing ACOXL/BCL2L11(BIM) (2q13); GARP (LRRC32) (11q13.5), as well as a third nominally significant region SH2B3(LNK)/ATXN2 (12q24.12). Candidate susceptibility gene expression analysis in these regions demonstrates expression in relevant immune cells and the hair follicle. We integrate our results with data from seven other autoimmune diseases and provide insight into the alignment of AA within these disorders. Our findings uncover new molecular pathways disrupted in AA, including autophagy/apoptosis, transforming growth factor beta/Tregs and JAK kinase signalling, and support the causal role of aberrant immune processes in AA.
Subject(s)
Alopecia Areata/genetics , Apoptosis Regulatory Proteins/genetics , Ataxin-2/genetics , Genetic Predisposition to Disease , HLA Antigens/genetics , Membrane Proteins/genetics , Polymorphism, Single Nucleotide , Proteins/genetics , Proto-Oncogene Proteins/genetics , Adaptor Proteins, Signal Transducing , Alleles , Animals , Bcl-2-Like Protein 11 , Case-Control Studies , Female , Genome-Wide Association Study , Humans , Intracellular Signaling Peptides and Proteins , Male , Mice , Microscopy, Fluorescence , Oligonucleotide Array Sequence Analysis , Phenotype , Principal Component Analysis , Protein Conformation , Skin/metabolismABSTRACT
Variants associated with blood lipid levels may be population-specific. To identify low-frequency variants associated with this phenotype, population-specific reference panels may be used. Here we impute nine large Dutch biobanks (~35,000 samples) with the population-specific reference panel created by the Genome of The Netherlands Project and perform association testing with blood lipid levels. We report the discovery of five novel associations at four loci (P value <6.61 × 10(-4)), including a rare missense variant in ABCA6 (rs77542162, p.Cys1359Arg, frequency 0.034), which is predicted to be deleterious. The frequency of this ABCA6 variant is 3.65-fold increased in the Dutch and its effect (ßLDL-C=0.135, ßTC=0.140) is estimated to be very similar to those observed for single variants in well-known lipid genes, such as LDLR.
Subject(s)
ATP-Binding Cassette Transporters/genetics , Cholesterol/blood , Mutation, Missense/genetics , Gene Frequency , Genetic Association Studies , Humans , NetherlandsABSTRACT
Since the completion of the Human Genome Project, the field of human genetics has been in great flux, largely due to technological advances in studying DNA sequence variation. Although community-wide adoption of statistical standards was key to the success of genome-wide association studies, similar standards have not yet been globally applied to the processing and interpretation of sequencing data. It has proven particularly challenging to pinpoint unequivocally disease variants in sequencing studies of polygenic traits. Here, we comment on a number of factors that may contribute to irreproducible claims of association in scientific literature and discuss possible steps that we can take towards cultural change.
ABSTRACT
Although genome-wide association studies (GWAS) have identified many common variants associated with complex traits, low-frequency and rare variants have not been interrogated in a comprehensive manner. Imputation from dense reference panels, such as the 1000 Genomes Project (1000G), enables testing of ungenotyped variants for association. Here we present the results of imputation using a large, new population-specific panel: the Genome of The Netherlands (GoNL). We benchmarked the performance of the 1000G and GoNL reference sets by comparing imputation genotypes with 'true' genotypes typed on ImmunoChip in three European populations (Dutch, British, and Italian). GoNL showed significant improvement in the imputation quality for rare variants (MAF 0.05-0.5%) compared with 1000G. In Dutch samples, the mean observed Pearson correlation, r(2), increased from 0.61 to 0.71. We also saw improved imputation accuracy for other European populations (in the British samples, r(2) improved from 0.58 to 0.65, and in the Italians from 0.43 to 0.47). A combined reference set comprising 1000G and GoNL improved the imputation of rare variants even further. The Italian samples benefitted the most from this combined reference (the mean r(2) increased from 0.47 to 0.50). We conclude that the creation of a large population-specific reference is advantageous for imputing rare variants and that a combined reference panel across multiple populations yields the best imputation results.
Subject(s)
Gene Frequency , Genome, Human , Genome-Wide Association Study , Polymorphism, Single Nucleotide , White People/genetics , Case-Control Studies , Cluster Analysis , Denmark , Genotype , Genotyping Techniques , Humans , Italy , Netherlands , Phenotype , Principal Component Analysis , United KingdomABSTRACT
Genomic rearrangements are a common cause of human congenital abnormalities. However, their origin and consequences are poorly understood. We performed molecular analysis of two patients with congenital disease who carried de novo genomic rearrangements. We found that the rearrangements in both patients hit genes that are recurrently rearranged in cancer (ETV1, FOXP1, and microRNA cluster C19MC) and drive formation of fusion genes similar to those described in cancer. Subsequent analysis of a large set of 552 de novo germline genomic rearrangements underlying congenital disorders revealed enrichment for genes rearranged in cancer and overlap with somatic cancer breakpoints. Breakpoints of common (inherited) germline structural variations also overlap with cancer breakpoints but are depleted for cancer genes. We propose that the same genomic positions are prone to genomic rearrangements in germline and soma but that timing and context of breakage determines whether developmental defects or cancer are promoted.
Subject(s)
Chromosome Aberrations , Chromosomes, Human/genetics , Congenital Abnormalities/genetics , Gene Rearrangement , Genome, Human , Germ-Line Mutation , Animals , Chromosome Breakpoints , DNA-Binding Proteins/genetics , Forkhead Transcription Factors/genetics , HEK293 Cells , Humans , MicroRNAs/genetics , Repressor Proteins/genetics , Transcription Factors/genetics , ZebrafishABSTRACT
Within the Netherlands a national network of biobanks has been established (Biobanking and Biomolecular Research Infrastructure-Netherlands (BBMRI-NL)) as a national node of the European BBMRI. One of the aims of BBMRI-NL is to enrich biobanks with different types of molecular and phenotype data. Here, we describe the Genome of the Netherlands (GoNL), one of the projects within BBMRI-NL. GoNL is a whole-genome-sequencing project in a representative sample consisting of 250 trio-families from all provinces in the Netherlands, which aims to characterize DNA sequence variation in the Dutch population. The parent-offspring trios include adult individuals ranging in age from 19 to 87 years (mean=53 years; SD=16 years) from birth cohorts 1910-1994. Sequencing was done on blood-derived DNA from uncultured cells and accomplished coverage was 14-15x. The family-based design represents a unique resource to assess the frequency of regional variants, accurately reconstruct haplotypes by family-based phasing, characterize short indels and complex structural variants, and establish the rate of de novo mutational events. GoNL will also serve as a reference panel for imputation in the available genome-wide association studies in Dutch and other cohorts to refine association signals and uncover population-specific variants. GoNL will create a catalog of human genetic variation in this sample that is uniquely characterized with respect to micro-geographic location and a wide range of phenotypes. The resource will be made available to the research and medical community to guide the interpretation of sequencing projects. The present paper summarizes the global characteristics of the project.