Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 103
Filter
1.
bioRxiv ; 2024 May 03.
Article in English | MEDLINE | ID: mdl-38645134

ABSTRACT

Missense variants can have a range of functional impacts depending on factors such as the specific amino acid substitution and location within the gene. To interpret their deleteriousness, studies have sought to identify regions within genes that are specifically intolerant of missense variation 1-12 . Here, we leverage the patterns of rare missense variation in 125,748 individuals in the Genome Aggregation Database (gnomAD) 13 against a null mutational model to identify transcripts that display regional differences in missense constraint. Missense-depleted regions are enriched for ClinVar 14 pathogenic variants, de novo missense variants from individuals with neurodevelopmental disorders (NDDs) 15,16 , and complex trait heritability. Following ClinGen calibration recommendations for the ACMG/AMP guidelines, we establish that regions with less than 20% of their expected missense variation achieve moderate support for pathogenicity. We create a missense deleteriousness metric (MPC) that incorporates regional constraint and outperforms other deleteriousness scores at stratifying case and control de novo missense variation, with a strong enrichment in NDDs. These results provide additional tools to aid in missense variant interpretation.

2.
Nat Commun ; 15(1): 1881, 2024 Feb 29.
Article in English | MEDLINE | ID: mdl-38424437

ABSTRACT

Germline pathogenic variants associated with increased childhood mortality must be subject to natural selection. Here, we analyze publicly available germline genetic metadata from 4,574 children with cancer [11 studies; 1,083 whole exome sequences (WES), 1,950 whole genome sequences (WGS), and 1,541 gene panel] and 141,456 adults [125,748 WES and 15,708 WGS]. We find that pediatric cancer predisposition syndrome (pCPS) genes [n = 85] are highly constrained, harboring only a quarter of the loss-of-function variants that would be expected. This strong indication of selective pressure on pCPS genes is found across multiple lines of germline genomics data from both pediatric and adult cohorts. For six genes [ELP1, GPR161, VHL and SDHA/B/C], a clear lack of mutational constraint calls the pediatric penetrance and/or severity of associated cancers into question. Conversely, out of 23 known pCPS genes associated with biallelic risk, two [9%, DIS3L2 and MSH2] show significant constraint, indicating that they may monoallelically increase childhood cancer risk. In summary, we show that population genetic data provide empirical evidence that heritable childhood cancer leads to natural selection powerful enough to have significantly impacted the present-day gene pool.


Subject(s)
Neoplasms , Adult , Humans , Child , Neoplasms/genetics , Genetic Predisposition to Disease , Gene Pool , Mutation , Germ-Line Mutation
5.
bioRxiv ; 2024 Jan 10.
Article in English | MEDLINE | ID: mdl-38260295

ABSTRACT

The Variant Call Format (VCF) is widely used in genome sequencing but scales poorly. For instance, we estimate a 150,000 genome VCF would occupy 900 TiB, making it both costly and complicated to produce and analyze. The issue stems from VCF's requirement to densely represent both reference-genotypes and allele-indexed arrays. These requirements lead to unnecessary data duplication and, ultimately, very large files. To address these challenges, we introduce the Scalable Variant Call Representation (SVCR). This representation reduces file sizes by ensuring they scale linearly with samples. SVCR achieves this by adopting reference blocks from the Genomic Variant Call Format (GVCF) and employing local allele indices. SVCR is also lossless and mergeable, allowing for N+1 and N+K incremental joint-calling. We present two implementations of SVCR: SVCR-VCF, which encodes SVCR in VCF format, and VDS, which uses Hail's native format. Our experiments confirm the linear scalability of SVCR-VCF and VDS, in contrast to the super-linear growth seen with standard VCF files. We also discuss the VDS Combiner, a scalable, open-source tool for producing a VDS from GVCFs and unique features of VDS which enable rapid data analysis. SVCR, and VDS in particular, ensure the scientific community can generate, analyze, and disseminate genetics datasets with millions of samples.

6.
bioRxiv ; 2024 Feb 28.
Article in English | MEDLINE | ID: mdl-36747613

ABSTRACT

Underrepresented populations are often excluded from genomic studies due in part to a lack of resources supporting their analyses. The 1000 Genomes Project (1kGP) and Human Genome Diversity Project (HGDP), which have recently been sequenced to high coverage, are valuable genomic resources because of the global diversity they capture and their open data sharing policies. Here, we harmonized a high quality set of 4,094 whole genomes from HGDP and 1kGP with data from the Genome Aggregation Database (gnomAD) and identified over 153 million high-quality SNVs, indels, and SVs. We performed a detailed ancestry analysis of this cohort, characterizing population structure and patterns of admixture across populations, analyzing site frequency spectra, and measuring variant counts at global and subcontinental levels. We also demonstrate substantial added value from this dataset compared to the prior versions of the component resources, typically combined via liftover and variant intersection; for example, we catalog millions of new genetic variants, mostly rare, compared to previous releases. In addition to unrestricted individual-level public release, we provide detailed tutorials for conducting many of the most common quality control steps and analyses with these data in a scalable cloud-computing environment and publicly release this new phased joint callset for use as a haplotype resource in phasing and imputation pipelines. This jointly called reference panel will serve as a key resource to support research of diverse ancestry populations.

7.
Nat Genet ; 56(1): 152-161, 2024 Jan.
Article in English | MEDLINE | ID: mdl-38057443

ABSTRACT

Recessive diseases arise when both copies of a gene are impacted by a damaging genetic variant. When a patient carries two potentially causal variants in a gene, accurate diagnosis requires determining that these variants occur on different copies of the chromosome (that is, are in trans) rather than on the same copy (that is, in cis). However, current approaches for determining phase, beyond parental testing, are limited in clinical settings. Here we developed a strategy for inferring phase for rare variant pairs within genes, leveraging genotypes observed in the Genome Aggregation Database (v2, n = 125,748 exomes). Our approach estimates phase with 96% accuracy, both in trio data and in patients with Mendelian conditions and presumed causal compound heterozygous variants. We provide a public resource of phasing estimates for coding variants and counts per gene of rare variants in trans that can aid interpretation of rare co-occurring variants in the context of recessive disease.


Subject(s)
Exome , High-Throughput Nucleotide Sequencing , Humans , Exome/genetics , Exome Sequencing , Genotype
8.
Nature ; 625(7993): 92-100, 2024 Jan.
Article in English | MEDLINE | ID: mdl-38057664

ABSTRACT

The depletion of disruptive variation caused by purifying natural selection (constraint) has been widely used to investigate protein-coding genes underlying human disorders1-4, but attempts to assess constraint for non-protein-coding regions have proved more difficult. Here we aggregate, process and release a dataset of 76,156 human genomes from the Genome Aggregation Database (gnomAD)-the largest public open-access human genome allele frequency reference dataset-and use it to build a genomic constraint map for the whole genome (genomic non-coding constraint of haploinsufficient variation (Gnocchi)). We present a refined mutational model that incorporates local sequence context and regional genomic features to detect depletions of variation. As expected, the average constraint for protein-coding sequences is stronger than that for non-coding regions. Within the non-coding genome, constrained regions are enriched for known regulatory elements and variants that are implicated in complex human diseases and traits, facilitating the triangulation of biological annotation, disease association and natural selection to non-coding DNA analysis. More constrained regulatory elements tend to regulate more constrained protein-coding genes, which in turn suggests that non-coding constraint can aid the identification of constrained genes that are as yet unrecognized by current gene constraint metrics. We demonstrate that this genome-wide constraint map improves the identification and interpretation of functional human genetic variation.


Subject(s)
Genome, Human , Genomics , Models, Genetic , Mutation , Humans , Access to Information , Databases, Genetic , Datasets as Topic , Gene Frequency , Genome, Human/genetics , Mutation/genetics , Selection, Genetic
9.
Am J Hum Genet ; 110(12): 2068-2076, 2023 Dec 07.
Article in English | MEDLINE | ID: mdl-38000370

ABSTRACT

DNA sample contamination is a major issue in clinical and research applications of whole-genome and -exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a metric to estimate DNA sample contamination from variant-level whole-genome and -exome sequence data called CHARR, contamination from homozygous alternate reference reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VariantDataset format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole-genome and exome sequencing datasets.


Subject(s)
DNA , Trout , Humans , Animals , Sequence Analysis, DNA/methods , Genotype , Homozygote , High-Throughput Nucleotide Sequencing/methods , Software
10.
medRxiv ; 2023 Sep 28.
Article in English | MEDLINE | ID: mdl-37808854

ABSTRACT

The complete blood count is an important screening tool for healthy adults and is the most commonly ordered test at periodic physical exams. However, results are usually interpreted relative to one-size-fits-all reference intervals, undermining the goal of precision medicine to tailor medical care to the needs of individual patients based on their unique characteristics. Here we show that standard complete blood count indices in healthy adults have robust homeostatic setpoints that are patient-specific and stable, with the typical healthy adult's set of 9 blood count setpoints distinguishable from 98% of others, and with these differences persisting for decades. These setpoints reflect a deep physiologic phenotype, enabling improved detection of both acquired and genetic determinants of hematologic regulation, including discovery of multiple novel loci via GWAS analyses. Patient-specific reference intervals derived from setpoints enable more accurate personalized risk assessment, and the setpoints themselves are significantly correlated with mortality risk, providing new opportunities to enhance patient-specific screening and early intervention. This study shows complete blood count setpoints are sufficiently stable and patient-specific to help realize the promise of precision medicine for healthy adults.

11.
Am J Hum Genet ; 110(9): 1496-1508, 2023 09 07.
Article in English | MEDLINE | ID: mdl-37633279

ABSTRACT

Predicted loss of function (pLoF) variants are often highly deleterious and play an important role in disease biology, but many pLoF variants may not result in loss of function (LoF). Here we present a framework that advances interpretation of pLoF variants in research and clinical settings by considering three categories of LoF evasion: (1) predicted rescue by secondary sequence properties, (2) uncertain biological relevance, and (3) potential technical artifacts. We also provide recommendations on adjustments to ACMG/AMP guidelines' PVS1 criterion. Applying this framework to all high-confidence pLoF variants in 22 genes associated with autosomal-recessive disease from the Genome Aggregation Database (gnomAD v.2.1.1) revealed predicted LoF evasion or potential artifacts in 27.3% (304/1,113) of variants. The major reasons were location in the last exon, in a homopolymer repeat, in a low proportion expressed across transcripts (pext) scored region, or the presence of cryptic in-frame splice rescues. Variants predicted to evade LoF or to be potential artifacts were enriched for ClinVar benign variants. PVS1 was downgraded in 99.4% (162/163) of pLoF variants predicted as likely not LoF/not LoF, with 17.2% (28/163) downgraded as a result of our framework, adding to previous guidelines. Variant pathogenicity was affected (mostly from likely pathogenic to VUS) in 20 (71.4%) of these 28 variants. This framework guides assessment of pLoF variants beyond standard annotation pipelines and substantially reduces false positive rates, which is key to ensure accurate LoF variant prediction in both a research and clinical setting.


Subject(s)
Inheritance Patterns , Humans , Exons , Uncertainty
12.
Nature ; 620(7975): 839-848, 2023 Aug.
Article in English | MEDLINE | ID: mdl-37587338

ABSTRACT

Mitochondrial DNA (mtDNA) is a maternally inherited, high-copy-number genome required for oxidative phosphorylation1. Heteroplasmy refers to the presence of a mixture of mtDNA alleles in an individual and has been associated with disease and ageing. Mechanisms underlying common variation in human heteroplasmy, and the influence of the nuclear genome on this variation, remain insufficiently explored. Here we quantify mtDNA copy number (mtCN) and heteroplasmy using blood-derived whole-genome sequences from 274,832 individuals and perform genome-wide association studies to identify associated nuclear loci. Following blood cell composition correction, we find that mtCN declines linearly with age and is associated with variants at 92 nuclear loci. We observe that nearly everyone harbours heteroplasmic mtDNA variants obeying two principles: (1) heteroplasmic single nucleotide variants tend to arise somatically and accumulate sharply after the age of 70 years, whereas (2) heteroplasmic indels are maternally inherited as mixtures with relative levels associated with 42 nuclear loci involved in mtDNA replication, maintenance and novel pathways. These loci may act by conferring a replicative advantage to certain mtDNA alleles. As an illustrative example, we identify a length variant carried by more than 50% of humans at position chrM:302 within a G-quadruplex previously proposed to mediate mtDNA transcription/replication switching2,3. We find that this variant exerts cis-acting genetic control over mtDNA abundance and is itself associated in-trans with nuclear loci encoding machinery for this regulatory switch. Our study suggests that common variation in the nuclear genome can shape variation in mtCN and heteroplasmy dynamics across the human population.


Subject(s)
Cell Nucleus , DNA Copy Number Variations , DNA, Mitochondrial , Heteroplasmy , Mitochondria , Aged , Humans , DNA Copy Number Variations/genetics , DNA, Mitochondrial/genetics , Genome-Wide Association Study , Heteroplasmy/genetics , Mitochondria/genetics , Cell Nucleus/genetics , Alleles , Polymorphism, Single Nucleotide , INDEL Mutation , G-Quadruplexes
13.
Nat Genet ; 55(9): 1589-1597, 2023 09.
Article in English | MEDLINE | ID: mdl-37604963

ABSTRACT

Copy number variants (CNVs) are major contributors to genetic diversity and disease. While standardized methods, such as the genome analysis toolkit (GATK), exist for detecting short variants, technical challenges have confounded uniform large-scale CNV analyses from whole-exome sequencing (WES) data. Given the profound impact of rare and de novo coding CNVs on genome organization and human disease, we developed GATK-gCNV, a flexible algorithm to discover rare CNVs from sequencing read-depth information, complete with open-source distribution via GATK. We benchmarked GATK-gCNV in 7,962 exomes from individuals in quartet families with matched genome sequencing and microarray data, finding up to 95% recall of rare coding CNVs at a resolution of more than two exons. We used GATK-gCNV to generate a reference catalog of rare coding CNVs in WES data from 197,306 individuals in the UK Biobank, and observed strong correlations between per-gene CNV rates and measures of mutational constraint, as well as rare CNV associations with multiple traits. In summary, GATK-gCNV is a tunable approach for sensitive and specific CNV discovery in WES data, with broad applications.


Subject(s)
DNA Copy Number Variations , Exome , Humans , Exome/genetics , Exome Sequencing , DNA Copy Number Variations/genetics , Chromosome Mapping , Exons
14.
bioRxiv ; 2023 Jun 28.
Article in English | MEDLINE | ID: mdl-37425834

ABSTRACT

DNA sample contamination is a major issue in clinical and research applications of whole genome and exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a new metric to estimate DNA sample contamination from variant-level whole genome and exome sequence data, CHARR, Contamination from Homozygous Alternate Reference Reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VDS format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole genome and exome sequencing datasets.

15.
Genome Res ; 33(6): 999-1005, 2023 06.
Article in English | MEDLINE | ID: mdl-37253541

ABSTRACT

Large-scale high-throughput sequencing data sets have been transformative for informing clinical variant interpretation and for use as reference panels for statistical and population genetic efforts. Although such resources are often treated as ground truth, we find that in widely used reference data sets such as the Genome Aggregation Database (gnomAD), some variants pass gold-standard filters, yet are systematically different in their genotype calls across genotype discovery approaches. The inclusion of such discordant sites in study designs involving multiple genotype discovery strategies could bias results and lead to false-positive hits in association studies owing to technological artifacts rather than a true relationship to the phenotype. Here, we describe this phenomenon of discordant genotype calls across genotype discovery approaches, characterize the error mode of wrong calls, provide a list of discordant sites identified in gnomAD that should be treated with caution in analyses, and present a metric and machine learning classifier trained on gnomAD data to identify likely discordant variants in other data sets. We find that different genotype discovery approaches have different sets of variants at which this problem occurs, but there are characteristic variant features that can be used to predict discordant behavior. Discordant sites are largely shared across ancestry groups, although different populations are powered for the discovery of different variants. We find that the most common error mode is that of a variant being heterozygous for one approach and homozygous for the other, with heterozygous in the genomes and homozygous reference in the exomes making up the majority of miscalls.


Subject(s)
Exome , Genetics, Population , Genotype , Heterozygote , Phenotype , Polymorphism, Single Nucleotide
16.
bioRxiv ; 2023 Aug 21.
Article in English | MEDLINE | ID: mdl-36993580

ABSTRACT

Recessive diseases arise when both the maternal and the paternal copies of a gene are impacted by a damaging genetic variant in the affected individual. When a patient carries two different potentially causal variants in a gene for a given disorder, accurate diagnosis requires determining that these two variants occur on different copies of the chromosome (i.e., are in trans) rather than on the same copy (i.e. in cis). However, current approaches for determining phase, beyond parental testing, are limited in clinical settings. We developed a strategy for inferring phase for rare variant pairs within genes, leveraging genotypes observed in exome sequencing data from the Genome Aggregation Database (gnomAD v2, n=125,748). When applied to trio data where phase can be determined by transmission, our approach estimates phase with 95.7% accuracy and remains accurate even for very rare variants (allele frequency < 1×10-4). We also correctly phase 95.9% of variant pairs in a set of 293 patients with Mendelian conditions carrying presumed causal compound heterozygous variants. We provide a public resource of phasing estimates from gnomAD, including phasing estimates for coding variants across the genome and counts per gene of rare variants in trans, that can aid interpretation of rare co-occurring variants in the context of recessive disease.

17.
medRxiv ; 2023 Mar 09.
Article in English | MEDLINE | ID: mdl-36945502

ABSTRACT

Predicted loss of function (pLoF) variants are highly deleterious and play an important role in disease biology, but many of these variants may not actually result in loss-of-function. Here we present a framework that advances interpretation of pLoF variants in research and clinical settings by considering three categories of LoF evasion: (1) predicted rescue by secondary sequence properties, (2) uncertain biological relevance, and (3) potential technical artifacts. We also provide recommendations on adjustments to ACMG/AMP guidelines's PVS1 criterion. Applying this framework to all high-confidence pLoF variants in 22 autosomal recessive disease-genes from the Genome Aggregation Database (gnomAD, v2.1.1) revealed predicted LoF evasion or potential artifacts in 27.3% (304/1,113) of variants. The major reasons were location in the last exon, in a homopolymer repeat, in low per-base expression (pext) score regions, or the presence of cryptic splice rescues. Variants predicted to be potential artifacts or to evade LoF were enriched for ClinVar benign variants. PVS1 was downgraded in 99.4% (162/163) of LoF evading variants assessed, with 17.2% (28/163) downgraded as a result of our framework, adding to previous guidelines. Variant pathogenicity was affected (mostly from likely pathogenic to VUS) in 20 (71.4%) of these 28 variants. This framework guides assessment of pLoF variants beyond standard annotation pipelines, and substantially reduces false positive rates, which is key to ensure accurate LoF variant prediction in both a research and clinical setting.

18.
Nature ; 614(7948): 492-499, 2023 02.
Article in English | MEDLINE | ID: mdl-36755099

ABSTRACT

Both common and rare genetic variants influence complex traits and common diseases. Genome-wide association studies have identified thousands of common-variant associations, and more recently, large-scale exome sequencing studies have identified rare-variant associations in hundreds of genes1-3. However, rare-variant genetic architecture is not well characterized, and the relationship between common-variant and rare-variant architecture is unclear4. Here we quantify the heritability explained by the gene-wise burden of rare coding variants across 22 common traits and diseases in 394,783 UK Biobank exomes5. Rare coding variants (allele frequency < 1 × 10-3) explain 1.3% (s.e. = 0.03%) of phenotypic variance on average-much less than common variants-and most burden heritability is explained by ultrarare loss-of-function variants (allele frequency < 1 × 10-5). Common and rare variants implicate the same cell types, with similar enrichments, and they have pleiotropic effects on the same pairs of traits, with similar genetic correlations. They partially colocalize at individual genes and loci, but not to the same extent: burden heritability is strongly concentrated in significant genes, while common-variant heritability is more polygenic, and burden heritability is also more strongly concentrated in constrained genes. Finally, we find that burden heritability for schizophrenia and bipolar disorder6,7 is approximately 2%. Our results indicate that rare coding variants will implicate a tractable number of large-effect genes, that common and rare associations are mechanistically convergent, and that rare coding variants will contribute only modestly to missing heritability and population risk stratification.


Subject(s)
Exome , Gene Frequency , Genetic Variation , Multifactorial Inheritance , Humans , Exome/genetics , Genetic Variation/genetics , Genome-Wide Association Study , Multifactorial Inheritance/genetics , Risk Factors , United Kingdom , Genetic Loci/genetics , Schizophrenia/genetics , Bipolar Disorder/genetics
19.
BioData Min ; 16(1): 3, 2023 Feb 02.
Article in English | MEDLINE | ID: mdl-36732776

ABSTRACT

BACKGROUND: Loss-of-Function (LoF) variants in human genes are important due to their impact on clinical phenotypes and frequent occurrence in the genomes of healthy individuals. The association of LoF variants with complex diseases and traits may lead to the discovery and validation of novel therapeutic targets. Current approaches predict high-confidence LoF variants without identifying the specific genes or the number of copies they affect. Moreover, there is a lack of methods for detecting knockout genes caused by compound heterozygous (CH) LoF variants. RESULTS: We have developed the Loss-of-Function ToolKit (LoFTK), which allows efficient and automated prediction of LoF variants from genotyped, imputed and sequenced genomes. LoFTK enables the identification of genes that are inactive in one or two copies and provides summary statistics for downstream analyses. LoFTK can identify CH LoF variants, which result in LoF genes with two copies lost. Using data from parents and offspring we show that 96% of CH LoF genes predicted by LoFTK in the offspring have the respective alleles donated by each parent. CONCLUSIONS: LoFTK is a command-line based tool that provides a reliable computational workflow for predicting LoF variants from genotyped and sequenced genomes, identifying genes that are inactive in 1 or 2 copies. LoFTK is an open software and is freely available to non-commercial users at https://github.com/CirculatoryHealth/LoFTK .

20.
Nat Commun ; 14(1): 157, 2023 01 18.
Article in English | MEDLINE | ID: mdl-36653343

ABSTRACT

Otosclerosis is one of the most common causes of conductive hearing loss, affecting 0.3% of the population. It typically presents in adulthood and half of the patients have a positive family history. The pathophysiology of otosclerosis is poorly understood. A previous genome-wide association study (GWAS) identified a single association locus in an intronic region of RELN. Here, we report a meta-analysis of GWAS studies of otosclerosis in three population-based biobanks comprising 3504 cases and 861,198 controls. We identify 23 novel risk loci (p < 5 × 10-8) and report an association in RELN and three previously reported candidate gene or linkage regions (TGFB1, MEPE, and OTSC7). We demonstrate developmental stage-dependent immunostaining patterns of MEPE and RUNX2 in mouse otic capsules. In most association loci, the nearest protein-coding genes are implicated in bone remodelling, mineralization or severe skeletal disorders. We highlight multiple genes involved in transforming growth factor beta signalling for follow-up studies.


Subject(s)
Genome-Wide Association Study , Otosclerosis , Animals , Mice , Otosclerosis/genetics , Biological Specimen Banks , Polymorphism, Single Nucleotide , Genetic Predisposition to Disease/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...