Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Resultados 1 - 20 de 22
Filtrar
1.
Nature ; 625(7993): 92-100, 2024 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-38057664

RESUMEN

The depletion of disruptive variation caused by purifying natural selection (constraint) has been widely used to investigate protein-coding genes underlying human disorders1-4, but attempts to assess constraint for non-protein-coding regions have proved more difficult. Here we aggregate, process and release a dataset of 76,156 human genomes from the Genome Aggregation Database (gnomAD)-the largest public open-access human genome allele frequency reference dataset-and use it to build a genomic constraint map for the whole genome (genomic non-coding constraint of haploinsufficient variation (Gnocchi)). We present a refined mutational model that incorporates local sequence context and regional genomic features to detect depletions of variation. As expected, the average constraint for protein-coding sequences is stronger than that for non-coding regions. Within the non-coding genome, constrained regions are enriched for known regulatory elements and variants that are implicated in complex human diseases and traits, facilitating the triangulation of biological annotation, disease association and natural selection to non-coding DNA analysis. More constrained regulatory elements tend to regulate more constrained protein-coding genes, which in turn suggests that non-coding constraint can aid the identification of constrained genes that are as yet unrecognized by current gene constraint metrics. We demonstrate that this genome-wide constraint map improves the identification and interpretation of functional human genetic variation.


Asunto(s)
Genoma Humano , Genómica , Modelos Genéticos , Mutación , Humanos , Acceso a la Información , Bases de Datos Genéticas , Conjuntos de Datos como Asunto , Frecuencia de los Genes , Genoma Humano/genética , Mutación/genética , Selección Genética
2.
Nature ; 604(7906): 509-516, 2022 04.
Artículo en Inglés | MEDLINE | ID: mdl-35396579

RESUMEN

Rare coding variation has historically provided the most direct connections between gene function and disease pathogenesis. By meta-analysing the whole exomes of 24,248 schizophrenia cases and 97,322 controls, we implicate ultra-rare coding variants (URVs) in 10 genes as conferring substantial risk for schizophrenia (odds ratios of 3-50, P < 2.14 × 10-6) and 32 genes at a false discovery rate of <5%. These genes have the greatest expression in central nervous system neurons and have diverse molecular functions that include the formation, structure and function of the synapse. The associations of the NMDA (N-methyl-D-aspartate) receptor subunit GRIN2A and AMPA (α-amino-3-hydroxy-5-methyl-4-isoxazole propionic acid) receptor subunit GRIA3 provide support for dysfunction of the glutamatergic system as a mechanistic hypothesis in the pathogenesis of schizophrenia. We observe an overlap of rare variant risk among schizophrenia, autism spectrum disorders1, epilepsy and severe neurodevelopmental disorders2, although different mutation types are implicated in some shared genes. Most genes described here, however, are not implicated in neurodevelopment. We demonstrate that genes prioritized from common variant analyses of schizophrenia are enriched in rare variant risk3, suggesting that common and rare genetic risk factors converge at least partially on the same underlying pathogenic biological processes. Even after excluding significantly associated genes, schizophrenia cases still carry a substantial excess of URVs, which indicates that more risk genes await discovery using this approach.


Asunto(s)
Mutación , Trastornos del Neurodesarrollo , Esquizofrenia , Estudios de Casos y Controles , Exoma , Predisposición Genética a la Enfermedad/genética , Humanos , Trastornos del Neurodesarrollo/genética , Receptores de N-Metil-D-Aspartato/genética , Esquizofrenia/genética
3.
Am J Hum Genet ; 110(12): 2068-2076, 2023 Dec 07.
Artículo en Inglés | MEDLINE | ID: mdl-38000370

RESUMEN

DNA sample contamination is a major issue in clinical and research applications of whole-genome and -exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a metric to estimate DNA sample contamination from variant-level whole-genome and -exome sequence data called CHARR, contamination from homozygous alternate reference reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VariantDataset format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole-genome and exome sequencing datasets.


Asunto(s)
ADN , Trucha , Humanos , Animales , Análisis de Secuencia de ADN/métodos , Genotipo , Homocigoto , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas Informáticos
4.
Nature ; 581(7809): 452-458, 2020 05.
Artículo en Inglés | MEDLINE | ID: mdl-32461655

RESUMEN

The acceleration of DNA sequencing in samples from patients and population studies has resulted in extensive catalogues of human genetic variation, but the interpretation of rare genetic variants remains problematic. A notable example of this challenge is the existence of disruptive variants in dosage-sensitive disease genes, even in apparently healthy individuals. Here, by manual curation of putative loss-of-function (pLoF) variants in haploinsufficient disease genes in the Genome Aggregation Database (gnomAD)1, we show that one explanation for this paradox involves alternative splicing of mRNA, which allows exons of a gene to be expressed at varying levels across different cell types. Currently, no existing annotation tool systematically incorporates information about exon expression into the interpretation of variants. We develop a transcript-level annotation metric known as the 'proportion expressed across transcripts', which quantifies isoform expression for variants. We calculate this metric using 11,706 tissue samples from the Genotype Tissue Expression (GTEx) project2 and show that it can differentiate between weakly and highly evolutionarily conserved exons, a proxy for functional importance. We demonstrate that expression-based annotation selectively filters 22.8% of falsely annotated pLoF variants found in haploinsufficient disease genes in gnomAD, while removing less than 4% of high-confidence pathogenic variants in the same genes. Finally, we apply our expression filter to the analysis of de novo variants in patients with autism spectrum disorder and intellectual disability or developmental disorders to show that pLoF variants in weakly expressed regions have similar effect sizes to those of synonymous variants, whereas pLoF variants in highly expressed exons are most strongly enriched among cases. Our annotation is fast, flexible and generalizable, making it possible for any variant file to be annotated with any isoform expression dataset, and will be valuable for the genetic diagnosis of rare diseases, the analysis of rare variant burden in complex disorders, and the curation and prioritization of variants in recall-by-genotype studies.


Asunto(s)
Enfermedad/genética , Haploinsuficiencia/genética , Mutación con Pérdida de Función/genética , Anotación de Secuencia Molecular , Transcripción Genética , Transcriptoma/genética , Trastorno del Espectro Autista/genética , Conjuntos de Datos como Asunto , Discapacidades del Desarrollo/genética , Exones/genética , Femenino , Genotipo , Humanos , Discapacidad Intelectual/genética , Masculino , Anotación de Secuencia Molecular/normas , Distribución de Poisson , ARN Mensajero/análisis , ARN Mensajero/genética , Enfermedades Raras/diagnóstico , Enfermedades Raras/genética , Reproducibilidad de los Resultados , Secuenciación del Exoma
5.
Nature ; 581(7809): 434-443, 2020 05.
Artículo en Inglés | MEDLINE | ID: mdl-32461654

RESUMEN

Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes that are crucial for the function of an organism will be depleted of such variants in natural populations, whereas non-essential genes will tolerate their accumulation. However, predicted loss-of-function variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes1. Here we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence predicted loss-of-function variants in this cohort after filtering for artefacts caused by sequencing and annotation errors. Using an improved model of human mutation rates, we classify human protein-coding genes along a spectrum that represents tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve the power of gene discovery for both common and rare diseases.


Asunto(s)
Exoma/genética , Genes Esenciales/genética , Variación Genética/genética , Genoma Humano/genética , Adulto , Encéfalo/metabolismo , Enfermedades Cardiovasculares/genética , Estudios de Cohortes , Bases de Datos Genéticas , Femenino , Predisposición Genética a la Enfermedad/genética , Estudio de Asociación del Genoma Completo , Humanos , Mutación con Pérdida de Función/genética , Masculino , Tasa de Mutación , Proproteína Convertasa 9/genética , ARN Mensajero/genética , Reproducibilidad de los Resultados , Secuenciación del Exoma , Secuenciación Completa del Genoma
7.
Am J Hum Genet ; 108(12): 2354-2367, 2021 12 02.
Artículo en Inglés | MEDLINE | ID: mdl-34822764

RESUMEN

Whole-genome sequencing studies applied to large populations or biobanks with extensive phenotyping raise new analytic challenges. The need to consider many variants at a locus or group of genes simultaneously and the potential to study many correlated phenotypes with shared genetic architecture provide opportunities for discovery not addressed by the traditional one variant, one phenotype association study. Here, we introduce a Bayesian model comparison approach called MRP (multiple rare variants and phenotypes) for rare-variant association studies that considers correlation, scale, and direction of genetic effects across a group of genetic variants, phenotypes, and studies, requiring only summary statistic data. We apply our method to exome sequencing data (n = 184,698) across 2,019 traits from the UK Biobank, aggregating signals in genes. MRP demonstrates an ability to recover signals such as associations between PCSK9 and LDL cholesterol levels. We additionally find MRP effective in conducting meta-analyses in exome data. Non-biomarker findings include associations between MC1R and red hair color and skin color, IL17RA and monocyte count, and IQGAP2 and mean platelet volume. Finally, we apply MRP in a multi-phenotype setting; after clustering the 35 biomarker phenotypes based on genetic correlation estimates, we find that joint analysis of these phenotypes results in substantial power gains for gene-trait associations, such as in TNFRSF13B in one of the clusters containing diabetes- and lipid-related traits. Overall, we show that the MRP model comparison approach improves upon useful features from widely used meta-analysis approaches for rare-variant association analyses and prioritizes protective modifiers of disease risk.


Asunto(s)
Variación Genética , Estudio de Asociación del Genoma Completo , Modelos Genéticos , Teorema de Bayes , Femenino , Humanos , Masculino , Fenotipo
11.
Stroke ; 49(7): 1618-1625, 2018 07.
Artículo en Inglés | MEDLINE | ID: mdl-29915124

RESUMEN

BACKGROUND AND PURPOSE: Hematoma volume is an important determinant of clinical outcome in spontaneous intracerebral hemorrhage (ICH). We performed a genome-wide association study (GWAS) of hematoma volume with the aim of identifying novel biological pathways involved in the pathophysiology of primary brain injury in ICH. METHODS: We conducted a 2-stage (discovery and replication) case-only genome-wide association study in patients with ICH of European ancestry. We utilized the admission head computed tomography to calculate hematoma volume via semiautomated computer-assisted technique. After quality control and imputation, 7 million genetic variants were available for association testing with ICH volume, which was performed separately in lobar and nonlobar ICH cases using linear regression. Signals with P<5×10-8 were pursued in replication and tested for association with admission Glasgow coma scale and 3-month post-ICH dichotomized (0-2 versus 3-6) modified Rankin Scale using ordinal and logistic regression, respectively. RESULTS: The discovery phase included 394 ICH cases (228 lobar and 166 nonlobar) and identified 2 susceptibility loci: a genomic region on 22q13 encompassing PARVB (top single-nucleotide polymorphism rs9614326: ß, 1.84; SE, 0.32; P=4.4×10-8) for lobar ICH volume and an intergenic region overlying numerous copy number variants on 17p12 (top single-nucleotide polymorphism rs11655160: ß, 0.95; SE, 0.17; P=4.3×10-8) for nonlobar ICH volume. The replication included 240 ICH cases (71 lobar and 169 nonlobar) and corroborated the association for 17p12 (P=0.04; meta-analysis P=2.5×10-9; heterogeneity, P=0.16) but not for 22q13 (P=0.49). In multivariable analysis, rs11655160 was also associated with lower admission Glasgow coma scale (odds ratio, 0.17; P=0.004) and increased risk of poor 3-month modified Rankin Scale (odds ratio, 1.94; P=0.045). CONCLUSIONS: We identified 17p12 as a novel susceptibility risk locus for hematoma volume, clinical severity, and functional outcome in nonlobar ICH. Replication in other ethnicities and follow-up translational studies are needed to elucidate the mechanism mediating the observed association.


Asunto(s)
Hemorragia Cerebral/genética , Cromosomas Humanos Par 17 , Hematoma/genética , Anciano , Anciano de 80 o más Años , Femenino , Estudio de Asociación del Genoma Completo , Genotipo , Humanos , Masculino , Persona de Mediana Edad , Polimorfismo de Nucleótido Simple
12.
bioRxiv ; 2024 Jan 10.
Artículo en Inglés | MEDLINE | ID: mdl-38260295

RESUMEN

The Variant Call Format (VCF) is widely used in genome sequencing but scales poorly. For instance, we estimate a 150,000 genome VCF would occupy 900 TiB, making it both costly and complicated to produce and analyze. The issue stems from VCF's requirement to densely represent both reference-genotypes and allele-indexed arrays. These requirements lead to unnecessary data duplication and, ultimately, very large files. To address these challenges, we introduce the Scalable Variant Call Representation (SVCR). This representation reduces file sizes by ensuring they scale linearly with samples. SVCR achieves this by adopting reference blocks from the Genomic Variant Call Format (GVCF) and employing local allele indices. SVCR is also lossless and mergeable, allowing for N+1 and N+K incremental joint-calling. We present two implementations of SVCR: SVCR-VCF, which encodes SVCR in VCF format, and VDS, which uses Hail's native format. Our experiments confirm the linear scalability of SVCR-VCF and VDS, in contrast to the super-linear growth seen with standard VCF files. We also discuss the VDS Combiner, a scalable, open-source tool for producing a VDS from GVCFs and unique features of VDS which enable rapid data analysis. SVCR, and VDS in particular, ensure the scientific community can generate, analyze, and disseminate genetics datasets with millions of samples.

13.
bioRxiv ; 2024 May 03.
Artículo en Inglés | MEDLINE | ID: mdl-38645134

RESUMEN

Missense variants can have a range of functional impacts depending on factors such as the specific amino acid substitution and location within the gene. To interpret their deleteriousness, studies have sought to identify regions within genes that are specifically intolerant of missense variation 1-12 . Here, we leverage the patterns of rare missense variation in 125,748 individuals in the Genome Aggregation Database (gnomAD) 13 against a null mutational model to identify transcripts that display regional differences in missense constraint. Missense-depleted regions are enriched for ClinVar 14 pathogenic variants, de novo missense variants from individuals with neurodevelopmental disorders (NDDs) 15,16 , and complex trait heritability. Following ClinGen calibration recommendations for the ACMG/AMP guidelines, we establish that regions with less than 20% of their expected missense variation achieve moderate support for pathogenicity. We create a missense deleteriousness metric (MPC) that incorporates regional constraint and outperforms other deleteriousness scores at stratifying case and control de novo missense variation, with a strong enrichment in NDDs. These results provide additional tools to aid in missense variant interpretation.

14.
bioRxiv ; 2023 Jun 28.
Artículo en Inglés | MEDLINE | ID: mdl-37425834

RESUMEN

DNA sample contamination is a major issue in clinical and research applications of whole genome and exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a new metric to estimate DNA sample contamination from variant-level whole genome and exome sequence data, CHARR, Contamination from Homozygous Alternate Reference Reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VDS format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole genome and exome sequencing datasets.

15.
Cell Genom ; 2(9): 100168, 2022 Sep 14.
Artículo en Inglés | MEDLINE | ID: mdl-36778668

RESUMEN

Genome-wide association studies have successfully discovered thousands of common variants associated with human diseases and traits, but the landscape of rare variations in human disease has not been explored at scale. Exome-sequencing studies of population biobanks provide an opportunity to systematically evaluate the impact of rare coding variations across a wide range of phenotypes to discover genes and allelic series relevant to human health and disease. Here, we present results from systematic association analyses of 4,529 phenotypes using single-variant and gene tests of 394,841 individuals in the UK Biobank with exome-sequence data. We find that the discovery of genetic associations is tightly linked to frequency and is correlated with metrics of deleteriousness and natural selection. We highlight biological findings elucidated by these data and release the dataset as a public resource alongside the Genebass browser for rapidly exploring rare-variant association results.

17.
Nat Genet ; 51(1): 63-75, 2019 01.
Artículo en Inglés | MEDLINE | ID: mdl-30478444

RESUMEN

Attention deficit/hyperactivity disorder (ADHD) is a highly heritable childhood behavioral disorder affecting 5% of children and 2.5% of adults. Common genetic variants contribute substantially to ADHD susceptibility, but no variants have been robustly associated with ADHD. We report a genome-wide association meta-analysis of 20,183 individuals diagnosed with ADHD and 35,191 controls that identifies variants surpassing genome-wide significance in 12 independent loci, finding important new information about the underlying biology of ADHD. Associations are enriched in evolutionarily constrained genomic regions and loss-of-function intolerant genes and around brain-expressed regulatory marks. Analyses of three replication studies: a cohort of individuals diagnosed with ADHD, a self-reported ADHD sample and a meta-analysis of quantitative measures of ADHD symptoms in the population, support these findings while highlighting study-specific differences on genetic overlap with educational attainment. Strong concordance with GWAS of quantitative population measures of ADHD symptoms supports that clinical diagnosis of ADHD is an extreme expression of continuous heritable traits.


Asunto(s)
Trastorno por Déficit de Atención con Hiperactividad/genética , Sitios Genéticos/genética , Predisposición Genética a la Enfermedad/genética , Polimorfismo de Nucleótido Simple/genética , Adolescente , Encéfalo/fisiología , Niño , Preescolar , Estudios de Cohortes , Femenino , Regulación de la Expresión Génica/genética , Estudio de Asociación del Genoma Completo/métodos , Humanos , Masculino , Riesgo
18.
Nat Commun ; 9(1): 3391, 2018 08 23.
Artículo en Inglés | MEDLINE | ID: mdl-30140000

RESUMEN

Large-scale deep-coverage whole-genome sequencing (WGS) is now feasible and offers potential advantages for locus discovery. We perform WGS in 16,324 participants from four ancestries at mean depth >29X and analyze genotypes with four quantitative traits-plasma total cholesterol, low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol, and triglycerides. Common variant association yields known loci except for few variants previously poorly imputed. Rare coding variant association yields known Mendelian dyslipidemia genes but rare non-coding variant association detects no signals. A high 2M-SNP LDL-C polygenic score (top 5th percentile) confers similar effect size to a monogenic mutation (~30 mg/dl higher for each); however, among those with severe hypercholesterolemia, 23% have a high polygenic score and only 2% carry a monogenic mutation. At these sample sizes and for these phenotypes, the incremental value of WGS for discovery is limited but WGS permits simultaneous assessment of monogenic and polygenic models to severe hypercholesterolemia.


Asunto(s)
Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Lípidos/sangre , Secuencia de Bases , LDL-Colesterol/genética , Frecuencia de los Genes/genética , Estudio de Asociación del Genoma Completo , Humanos , Modelos Genéticos , Mutación/genética
20.
Nat Commun ; 9(1): 2606, 2018 07 04.
Artículo en Inglés | MEDLINE | ID: mdl-29973585

RESUMEN

Lipoprotein(a), Lp(a), is a modified low-density lipoprotein particle that contains apolipoprotein(a), encoded by LPA, and is a highly heritable, causal risk factor for cardiovascular diseases that varies in concentrations across ancestries. Here, we use deep-coverage whole genome sequencing in 8392 individuals of European and African ancestry to discover and interpret both single-nucleotide variants and copy number (CN) variation associated with Lp(a). We observe that genetic determinants between Europeans and Africans have several unique determinants. The common variant rs12740374 associated with Lp(a) cholesterol is an eQTL for SORT1 and independent of LDL cholesterol. Observed associations of aggregates of rare non-coding variants are largely explained by LPA structural variation, namely the LPA kringle IV 2 (KIV2)-CN. Finally, we find that LPA risk genotypes confer greater relative risk for incident atherosclerotic cardiovascular diseases compared to directly measured Lp(a), and are significantly associated with measures of subclinical atherosclerosis in African Americans.


Asunto(s)
Enfermedades Cardiovasculares/genética , Variaciones en el Número de Copia de ADN , Genoma Humano , Lipoproteína(a)/genética , Polimorfismo de Nucleótido Simple , Proteínas Adaptadoras del Transporte Vesicular/sangre , Proteínas Adaptadoras del Transporte Vesicular/genética , Población Negra , Enfermedades Cardiovasculares/sangre , Enfermedades Cardiovasculares/diagnóstico , Enfermedades Cardiovasculares/etnología , LDL-Colesterol/sangre , Expresión Génica , Estudio de Asociación del Genoma Completo , Humanos , Lipoproteína(a)/sangre , Sitios de Carácter Cuantitativo , Factores de Riesgo , Población Blanca , Secuenciación Completa del Genoma
SELECCIÓN DE REFERENCIAS
Detalles de la búsqueda