RESUMO
CHASERR encodes a human long noncoding RNA (lncRNA) adjacent to CHD2, a coding gene in which de novo loss-of-function variants cause developmental and epileptic encephalopathy. Here, we report our findings in three unrelated children with a syndromic, early-onset neurodevelopmental disorder, each of whom had a de novo deletion in the CHASERR locus. The children had severe encephalopathy, shared facial dysmorphisms, cortical atrophy, and cerebral hypomyelination - a phenotype that is distinct from the phenotypes of patients with CHD2 haploinsufficiency. We found that the CHASERR deletion results in increased CHD2 protein abundance in patient-derived cell lines and increased expression of the CHD2 transcript in cis. These findings indicate that CHD2 has bidirectional dosage sensitivity in human disease, and we recommend that other lncRNA-encoding genes be evaluated, particularly those upstream of genes associated with mendelian disorders. (Funded by the National Human Genome Research Institute and others.).
Assuntos
Transtornos do Neurodesenvolvimento , RNA Longo não Codificante , Pré-Escolar , Feminino , Humanos , Lactente , Masculino , Encéfalo/patologia , Encéfalo/diagnóstico por imagem , Encéfalo/metabolismo , Proteínas de Ligação a DNA/análise , Proteínas de Ligação a DNA/genética , Proteínas de Ligação a DNA/metabolismo , Deleção de Genes , Haploinsuficiência , Transtornos do Neurodesenvolvimento/diagnóstico , Transtornos do Neurodesenvolvimento/genética , Transtornos do Neurodesenvolvimento/patologia , Fenótipo , RNA Longo não Codificante/genética , Deleção de SequênciaRESUMO
Incomplete penetrance, or absence of disease phenotype in an individual with a disease-associated variant, is a major challenge in variant interpretation. Studying individuals with apparent incomplete penetrance can shed light on underlying drivers of altered phenotype penetrance. Here, we investigate clinically relevant variants from ClinVar in 807,162 individuals from the Genome Aggregation Database (gnomAD), demonstrating improved representation in gnomAD version 4. We then conduct a comprehensive case-by-case assessment of 734 predicted loss of function variants (pLoF) in 77 genes associated with severe, early-onset, highly penetrant haploinsufficient disease. We identified explanations for the presumed lack of disease manifestation in 701 of the variants (95%). Individuals with unexplained lack of disease manifestation in this set of disorders rarely occur, underscoring the need and power of deep case-by-case assessment presented here to minimize false assignments of disease risk, particularly in unaffected individuals with higher rates of secondary properties that result in rescue.
RESUMO
Underrepresented populations are often excluded from genomic studies owing in part to a lack of resources supporting their analyses. The 1000 Genomes Project (1kGP) and Human Genome Diversity Project (HGDP), which have recently been sequenced to high coverage, are valuable genomic resources because of the global diversity they capture and their open data sharing policies. Here, we harmonized a high-quality set of 4094 whole genomes from 80 populations in the HGDP and 1kGP with data from the Genome Aggregation Database (gnomAD) and identified over 153 million high-quality SNVs, indels, and SVs. We performed a detailed ancestry analysis of this cohort, characterizing population structure and patterns of admixture across populations, analyzing site frequency spectra, and measuring variant counts at global and subcontinental levels. We also show substantial added value from this data set compared with the prior versions of the component resources, typically combined via liftOver and variant intersection; for example, we catalog millions of new genetic variants, mostly rare, compared with previous releases. In addition to unrestricted individual-level public release, we provide detailed tutorials for conducting many of the most common quality-control steps and analyses with these data in a scalable cloud-computing environment and publicly release this new phased joint callset for use as a haplotype resource in phasing and imputation pipelines. This jointly called reference panel will serve as a key resource to support research of diverse ancestry populations.
Assuntos
Bases de Dados Genéticas , Genoma Humano , Humanos , Projeto Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Variação Genética , Genômica/métodosRESUMO
Missense variants can have a range of functional impacts depending on factors such as the specific amino acid substitution and location within the gene. To interpret their deleteriousness, studies have sought to identify regions within genes that are specifically intolerant of missense variation 1-12 . Here, we leverage the patterns of rare missense variation in 125,748 individuals in the Genome Aggregation Database (gnomAD) 13 against a null mutational model to identify transcripts that display regional differences in missense constraint. Missense-depleted regions are enriched for ClinVar 14 pathogenic variants, de novo missense variants from individuals with neurodevelopmental disorders (NDDs) 15,16 , and complex trait heritability. Following ClinGen calibration recommendations for the ACMG/AMP guidelines, we establish that regions with less than 20% of their expected missense variation achieve moderate support for pathogenicity. We create a missense deleteriousness metric (MPC) that incorporates regional constraint and outperforms other deleteriousness scores at stratifying case and control de novo missense variation, with a strong enrichment in NDDs. These results provide additional tools to aid in missense variant interpretation.
RESUMO
Genes encoding long non-coding RNAs (lncRNAs) comprise a large fraction of the human genome, yet haploinsufficiency of a lncRNA has not been shown to cause a Mendelian disease. CHASERR is a highly conserved human lncRNA adjacent to CHD2-a coding gene in which de novo loss-of-function variants cause developmental and epileptic encephalopathy. Here we report three unrelated individuals each harboring an ultra-rare heterozygous de novo deletion in the CHASERR locus. We report similarities in severe developmental delay, facial dysmorphisms, and cerebral dysmyelination in these individuals, distinguishing them from the phenotypic spectrum of CHD2 haploinsufficiency. We demonstrate reduced CHASERR mRNA expression and corresponding increased CHD2 mRNA and protein in whole blood and patient-derived cell lines-specifically increased expression of the CHD2 allele in cis with the CHASERR deletion, as predicted from a prior mouse model of Chaserr haploinsufficiency. We show for the first time that de novo structural variants facilitated by Alu-mediated non-allelic homologous recombination led to deletion of a non-coding element (the lncRNA CHASERR) to cause a rare syndromic neurodevelopmental disorder. We also demonstrate that CHD2 has bidirectional dosage sensitivity in human disease. This work highlights the need to carefully evaluate other lncRNAs, particularly those upstream of genes associated with Mendelian disorders.
RESUMO
Recessive diseases arise when both copies of a gene are impacted by a damaging genetic variant. When a patient carries two potentially causal variants in a gene, accurate diagnosis requires determining that these variants occur on different copies of the chromosome (that is, are in trans) rather than on the same copy (that is, in cis). However, current approaches for determining phase, beyond parental testing, are limited in clinical settings. Here we developed a strategy for inferring phase for rare variant pairs within genes, leveraging genotypes observed in the Genome Aggregation Database (v2, n = 125,748 exomes). Our approach estimates phase with 96% accuracy, both in trio data and in patients with Mendelian conditions and presumed causal compound heterozygous variants. We provide a public resource of phasing estimates for coding variants and counts per gene of rare variants in trans that can aid interpretation of rare co-occurring variants in the context of recessive disease.
Assuntos
Exoma , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Exoma/genética , Sequenciamento do Exoma , GenótipoRESUMO
The depletion of disruptive variation caused by purifying natural selection (constraint) has been widely used to investigate protein-coding genes underlying human disorders1-4, but attempts to assess constraint for non-protein-coding regions have proved more difficult. Here we aggregate, process and release a dataset of 76,156 human genomes from the Genome Aggregation Database (gnomAD)-the largest public open-access human genome allele frequency reference dataset-and use it to build a genomic constraint map for the whole genome (genomic non-coding constraint of haploinsufficient variation (Gnocchi)). We present a refined mutational model that incorporates local sequence context and regional genomic features to detect depletions of variation. As expected, the average constraint for protein-coding sequences is stronger than that for non-coding regions. Within the non-coding genome, constrained regions are enriched for known regulatory elements and variants that are implicated in complex human diseases and traits, facilitating the triangulation of biological annotation, disease association and natural selection to non-coding DNA analysis. More constrained regulatory elements tend to regulate more constrained protein-coding genes, which in turn suggests that non-coding constraint can aid the identification of constrained genes that are as yet unrecognized by current gene constraint metrics. We demonstrate that this genome-wide constraint map improves the identification and interpretation of functional human genetic variation.
Assuntos
Genoma Humano , Genômica , Modelos Genéticos , Mutação , Humanos , Acesso à Informação , Bases de Dados Genéticas , Conjuntos de Dados como Assunto , Frequência do Gene , Genoma Humano/genética , Mutação/genética , Seleção GenéticaRESUMO
Underrepresented populations are often excluded from genomic studies due in part to a lack of resources supporting their analyses. The 1000 Genomes Project (1kGP) and Human Genome Diversity Project (HGDP), which have recently been sequenced to high coverage, are valuable genomic resources because of the global diversity they capture and their open data sharing policies. Here, we harmonized a high quality set of 4,094 whole genomes from HGDP and 1kGP with data from the Genome Aggregation Database (gnomAD) and identified over 153 million high-quality SNVs, indels, and SVs. We performed a detailed ancestry analysis of this cohort, characterizing population structure and patterns of admixture across populations, analyzing site frequency spectra, and measuring variant counts at global and subcontinental levels. We also demonstrate substantial added value from this dataset compared to the prior versions of the component resources, typically combined via liftover and variant intersection; for example, we catalog millions of new genetic variants, mostly rare, compared to previous releases. In addition to unrestricted individual-level public release, we provide detailed tutorials for conducting many of the most common quality control steps and analyses with these data in a scalable cloud-computing environment and publicly release this new phased joint callset for use as a haplotype resource in phasing and imputation pipelines. This jointly called reference panel will serve as a key resource to support research of diverse ancestry populations.
RESUMO
DNA sample contamination is a major issue in clinical and research applications of whole-genome and -exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a metric to estimate DNA sample contamination from variant-level whole-genome and -exome sequence data called CHARR, contamination from homozygous alternate reference reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VariantDataset format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole-genome and exome sequencing datasets.
Assuntos
DNA , Truta , Humanos , Animais , Análise de Sequência de DNA/métodos , Genótipo , Homozigoto , Sequenciamento de Nucleotídeos em Larga Escala/métodos , SoftwareRESUMO
DNA sample contamination is a major issue in clinical and research applications of whole genome and exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a new metric to estimate DNA sample contamination from variant-level whole genome and exome sequence data, CHARR, Contamination from Homozygous Alternate Reference Reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VDS format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole genome and exome sequencing datasets.
RESUMO
Recessive diseases arise when both the maternal and the paternal copies of a gene are impacted by a damaging genetic variant in the affected individual. When a patient carries two different potentially causal variants in a gene for a given disorder, accurate diagnosis requires determining that these two variants occur on different copies of the chromosome (i.e., are in trans) rather than on the same copy (i.e. in cis). However, current approaches for determining phase, beyond parental testing, are limited in clinical settings. We developed a strategy for inferring phase for rare variant pairs within genes, leveraging genotypes observed in exome sequencing data from the Genome Aggregation Database (gnomAD v2, n=125,748). When applied to trio data where phase can be determined by transmission, our approach estimates phase with 95.7% accuracy and remains accurate even for very rare variants (allele frequency < 1×10-4). We also correctly phase 95.9% of variant pairs in a set of 293 patients with Mendelian conditions carrying presumed causal compound heterozygous variants. We provide a public resource of phasing estimates from gnomAD, including phasing estimates for coding variants across the genome and counts per gene of rare variants in trans, that can aid interpretation of rare co-occurring variants in the context of recessive disease.
RESUMO
Genome-wide association studies have successfully discovered thousands of common variants associated with human diseases and traits, but the landscape of rare variations in human disease has not been explored at scale. Exome-sequencing studies of population biobanks provide an opportunity to systematically evaluate the impact of rare coding variations across a wide range of phenotypes to discover genes and allelic series relevant to human health and disease. Here, we present results from systematic association analyses of 4,529 phenotypes using single-variant and gene tests of 394,841 individuals in the UK Biobank with exome-sequence data. We find that the discovery of genetic associations is tightly linked to frequency and is correlated with metrics of deleteriousness and natural selection. We highlight biological findings elucidated by these data and release the dataset as a public resource alongside the Genebass browser for rapidly exploring rare-variant association results.
RESUMO
Reference population databases are an essential tool in variant and gene interpretation. Their use guides the identification of pathogenic variants amidst the sea of benign variation present in every human genome, and supports the discovery of new disease-gene relationships. The Genome Aggregation Database (gnomAD) is currently the largest and most widely used publicly available collection of population variation from harmonized sequencing data. The data is available through the online gnomAD browser (https://gnomad.broadinstitute.org/) that enables rapid and intuitive variant analysis. This review provides guidance on the content of the gnomAD browser, and its usage for variant and gene interpretation. We introduce key features including allele frequency, per-base expression levels, constraint scores, and variant co-occurrence, alongside guidance on how to use these in analysis, with a focus on the interpretation of candidate variants and novel genes in rare disease.
Assuntos
Doenças Raras , Software , Bases de Dados Genéticas , Frequência do Gene , Humanos , Doenças Raras/genéticaRESUMO
Hundreds of thousands of genetic variants have been reported to cause severe monogenic diseases, but the probability that a variant carrier develops the disease (termed penetrance) is unknown for virtually all of them. Additionally, the clinical utility of common polygenetic variation remains uncertain. Using exome sequencing from 77,184 adult individuals (38,618 multi-ancestral individuals from a type 2 diabetes case-control study and 38,566 participants from the UK Biobank, for whom genotype array data were also available), we apply clinical standard-of-care gene variant curation for eight monogenic metabolic conditions. Rare variants causing monogenic diabetes and dyslipidemias display effect sizes significantly larger than the top 1% of the corresponding polygenic scores. Nevertheless, penetrance estimates for monogenic variant carriers average 60% or lower for most conditions. We assess epidemiologic and genetic factors contributing to risk prediction in monogenic variant carriers, demonstrating that inclusion of polygenic variation significantly improves biomarker estimation for two monogenic dyslipidemias.
Assuntos
Diabetes Mellitus Tipo 2/genética , Dislipidemias/genética , Predisposição Genética para Doença/genética , Adulto , Variação Biológica da População , Biomarcadores/metabolismo , Diabetes Mellitus Tipo 2/metabolismo , Dislipidemias/metabolismo , Exoma/genética , Genótipo , Humanos , Herança Multifatorial , Penetrância , Medição de RiscoRESUMO
JAG2 encodes the Notch ligand Jagged2. The conserved Notch signaling pathway contributes to the development and homeostasis of multiple tissues, including skeletal muscle. We studied an international cohort of 23 individuals with genetically unsolved muscular dystrophy from 13 unrelated families. Whole-exome sequencing identified rare homozygous or compound heterozygous JAG2 variants in all 13 families. The identified bi-allelic variants include 10 missense variants that disrupt highly conserved amino acids, a nonsense variant, two frameshift variants, an in-frame deletion, and a microdeletion encompassing JAG2. Onset of muscle weakness occurred from infancy to young adulthood. Serum creatine kinase (CK) levels were normal or mildly elevated. Muscle histology was primarily dystrophic. MRI of the lower extremities revealed a distinct, slightly asymmetric pattern of muscle involvement with cores of preserved and affected muscles in quadriceps and tibialis anterior, in some cases resembling patterns seen in POGLUT1-associated muscular dystrophy. Transcriptome analysis of muscle tissue from two participants suggested misregulation of genes involved in myogenesis, including PAX7. In complementary studies, Jag2 downregulation in murine myoblasts led to downregulation of multiple components of the Notch pathway, including Megf10. Investigations in Drosophila suggested an interaction between Serrate and Drpr, the fly orthologs of JAG1/JAG2 and MEGF10, respectively. In silico analysis predicted that many Jagged2 missense variants are associated with structural changes and protein misfolding. In summary, we describe a muscular dystrophy associated with pathogenic variants in JAG2 and evidence suggests a disease mechanism related to Notch pathway dysfunction.
Assuntos
Proteína Jagged-2/genética , Distrofias Musculares/genética , Adolescente , Adulto , Sequência de Aminoácidos , Animais , Linhagem Celular , Criança , Pré-Escolar , Proteínas de Drosophila/genética , Drosophila melanogaster/genética , Feminino , Glucosiltransferases/genética , Haplótipos/genética , Humanos , Proteína Jagged-1/genética , Proteína Jagged-2/química , Proteína Jagged-2/deficiência , Proteína Jagged-2/metabolismo , Masculino , Proteínas de Membrana/genética , Camundongos , Pessoa de Meia-Idade , Modelos Moleculares , Músculos/metabolismo , Músculos/patologia , Distrofias Musculares/patologia , Mioblastos/metabolismo , Mioblastos/patologia , Linhagem , Fenótipo , Receptores Notch/metabolismo , Transdução de Sinais , Sequenciamento do Exoma , Adulto JovemRESUMO
Advances in genome sequencing have resulted in the identification of the causes for numerous rare diseases. However, many cases remain unsolved with standard molecular analyses. We describe a family presenting with a phenotype resembling inherited thrombocytopenia 2 (THC2). THC2 is generally caused by single nucleotide variants that prevent silencing of ANKRD26 expression during hematopoietic differentiation. Short-read whole-exome and genome sequencing approaches were unable to identify a causal variant in this family. Using long-read whole-genome sequencing, a large complex structural variant involving a paired-duplication inversion was identified. Through functional studies, we show that this structural variant results in a pathogenic gain-of-function WAC-ANKRD26 fusion transcript. Our findings illustrate how complex structural variants that may be missed by conventional genome sequencing approaches can cause human disease.
Assuntos
Proteínas Adaptadoras de Transdução de Sinal/genética , Peptídeos e Proteínas de Sinalização Intercelular/genética , Polimorfismo de Nucleotídeo Único/genética , Trombocitopenia/genética , Adolescente , Adulto , Idoso , Linhagem Celular , Linhagem Celular Tumoral , Criança , Quebra Cromossômica , Transtornos Cromossômicos/genética , Exoma/genética , Feminino , Células HEK293 , Células HeLa , Humanos , Masculino , Pessoa de Meia-Idade , Mutação/genética , Linhagem , Trombocitopenia/congênitoRESUMO
Human genetic variants predicted to cause loss-of-function of protein-coding genes (pLoF variants) provide natural in vivo models of human gene inactivation and can be valuable indicators of gene function and the potential toxicity of therapeutic inhibitors targeting these genes1,2. Gain-of-kinase-function variants in LRRK2 are known to significantly increase the risk of Parkinson's disease3,4, suggesting that inhibition of LRRK2 kinase activity is a promising therapeutic strategy. While preclinical studies in model organisms have raised some on-target toxicity concerns5-8, the biological consequences of LRRK2 inhibition have not been well characterized in humans. Here, we systematically analyze pLoF variants in LRRK2 observed across 141,456 individuals sequenced in the Genome Aggregation Database (gnomAD)9, 49,960 exome-sequenced individuals from the UK Biobank and over 4 million participants in the 23andMe genotyped dataset. After stringent variant curation, we identify 1,455 individuals with high-confidence pLoF variants in LRRK2. Experimental validation of three variants, combined with previous work10, confirmed reduced protein levels in 82.5% of our cohort. We show that heterozygous pLoF variants in LRRK2 reduce LRRK2 protein levels but that these are not strongly associated with any specific phenotype or disease state. Our results demonstrate the value of large-scale genomic databases and phenotyping of human loss-of-function carriers for target validation in drug discovery.