RESUMEN
Although base editors are widely used to install targeted point mutations, the factors that determine base editing outcomes are not well understood. We characterized sequence-activity relationships of 11 cytosine and adenine base editors (CBEs and ABEs) on 38,538 genomically integrated targets in mammalian cells and used the resulting outcomes to train BE-Hive, a machine learning model that accurately predicts base editing genotypic outcomes (R ≈ 0.9) and efficiency (R ≈ 0.7). We corrected 3,388 disease-associated SNVs with ≥90% precision, including 675 alleles with bystander nucleotides that BE-Hive correctly predicted would not be edited. We discovered determinants of previously unpredictable C-to-G, or C-to-A editing and used these discoveries to correct coding sequences of 174 pathogenic transversion SNVs with ≥90% precision. Finally, we used insights from BE-Hive to engineer novel CBE variants that modulate editing outcomes. These discoveries illuminate base editing, enable editing at previously intractable targets, and provide new base editors with improved editing capabilities.
Asunto(s)
Edición Génica/métodos , Aprendizaje Automático , Animales , Biblioteca de Genes , Humanos , Ratones , Células Madre Embrionarias de Ratones/citología , Células Madre Embrionarias de Ratones/metabolismo , Mutación Puntual , ARN Guía de Kinetoplastida/metabolismoRESUMEN
While pathogenic variants can significantly increase disease risk, it is still challenging to estimate the clinical impact of rare missense variants more generally. Even in genes such as BRCA2 or PALB2, large cohort studies find no significant association between breast cancer and rare missense variants collectively. Here, we introduce REGatta, a method to estimate clinical risk from variants in smaller segments of individual genes. We first define these regions by using the density of pathogenic diagnostic reports and then calculate the relative risk in each region by using over 200,000 exome sequences in the UK Biobank. We apply this method in 13 genes with established roles across several monogenic disorders. In genes with no significant difference at the gene level, this approach significantly separates disease risk for individuals with rare missense variants at higher or lower risk (BRCA2 regional model OR = 1.46 [1.12, 1.79], p = 0.0036 vs. BRCA2 gene model OR = 0.96 [0.85, 1.07] p = 0.4171). We find high concordance between these regional risk estimates and high-throughput functional assays of variant impact. We compare our method with existing methods and the use of protein domains (Pfam) as regions and find REGatta better identifies individuals at elevated or reduced risk. These regions provide useful priors and are potentially useful for improving risk assessment for genes associated with monogenic diseases.
Asunto(s)
Neoplasias de la Mama , Predisposición Genética a la Enfermedad , Humanos , Femenino , Proteína BRCA2/genética , Mutación Missense , Análisis de Secuencia de ADN , Neoplasias de la Mama/genética , Neoplasias de la Mama/patología , Estudios de CohortesRESUMEN
In this Article, a data processing error affected Fig. 3e and Extended Data Table 2; these errors have been corrected online.
RESUMEN
Following Cas9 cleavage, DNA repair without a donor template is generally considered stochastic, heterogeneous and impractical beyond gene disruption. Here, we show that template-free Cas9 editing is predictable and capable of precise repair to a predicted genotype, enabling correction of disease-associated mutations in humans. We constructed a library of 2,000 Cas9 guide RNAs paired with DNA target sites and trained inDelphi, a machine learning model that predicts genotypes and frequencies of 1- to 60-base-pair deletions and 1-base-pair insertions with high accuracy (r = 0.87) in five human and mouse cell lines. inDelphi predicts that 5-11% of Cas9 guide RNAs targeting the human genome are 'precise-50', yielding a single genotype comprising greater than or equal to 50% of all major editing products. We experimentally confirmed precise-50 insertions and deletions in 195 human disease-relevant alleles, including correction in primary patient-derived fibroblasts of pathogenic alleles to wild-type genotype for Hermansky-Pudlak syndrome and Menkes disease. This study establishes an approach for precise, template-free genome editing.
Asunto(s)
Sistemas CRISPR-Cas/genética , Edición Génica/métodos , Edición Génica/normas , Síndrome de Hermanski-Pudlak/genética , Aprendizaje Automático , Síndrome del Pelo Ensortijado/genética , Moldes Genéticos , Alelos , Secuencia de Bases , Proteína 9 Asociada a CRISPR/metabolismo , Reparación del ADN/genética , Fibroblastos/metabolismo , Fibroblastos/patología , Células HCT116 , Células HEK293 , Síndrome de Hermanski-Pudlak/patología , Humanos , Células K562 , Síndrome del Pelo Ensortijado/patología , Reproducibilidad de los Resultados , Especificidad por SustratoRESUMEN
PURPOSE: This study aimed to explore whether evidence of pathogenicity from prior variant classifications in ClinVar could be used to inform variant interpretation using the American College of Medical Genetics and Genomics/Association for Molecular Pathology clinical guidelines. METHODS: We identified distinct single-nucleotide variants (SNVs) that are either similar in location or in functional consequence to pathogenic variants in ClinVar and analyzed evidence in support of pathogenicity using 3 interpretation criteria. RESULTS: Thousands of variants, including many in clinically actionable disease genes (American College of Medical Genetics and Genomics secondary findings v3.0), have evidence of pathogenicity from existing variant classifications, accounting for 2.5% of nonsynonymous SNVs within ClinVar. Notably, there are many variants with uncertain or conflicting classifications that cause the same amino acid substitution as other pathogenic variants (PS1, N = 323), variants that are predicted to cause different amino acid substitutions in the same codon as pathogenic variants (PM5, N = 7692), and loss-of-function variants that are present in genes in which many loss-of-function variants are classified as pathogenic (PVS1, N = 3635). Most of these variants have similar computational predictions of pathogenicity and splicing effect as their associated pathogenic variants. CONCLUSION: Broadly, for >1.4 million SNVs exome wide, information from previously classified variants could be used to provide evidence of pathogenicity. We have developed a pipeline to identify variants meeting these criteria that may inform interpretation efforts.
Asunto(s)
Pruebas Genéticas , Genómica , Humanos , Exoma , Empalme del ARN , Patología Molecular , Variación Genética/genéticaRESUMEN
Restoring gene function by the induced skipping of deleterious exons has been shown to be effective for treating genetic disorders. However, many of the clinically successful therapies for exon skipping are transient oligonucleotide-based treatments that require frequent dosing. CRISPR-Cas9 based genome editing that causes exon skipping is a promising therapeutic modality that may offer permanent alleviation of genetic disease. We show that machine learning can select Cas9 guide RNAs that disrupt splice acceptors and cause the skipping of targeted exons. We experimentally measured the exon skipping frequencies of a diverse genome-integrated library of 791 splice sequences targeted by 1,063 guide RNAs in mouse embryonic stem cells. We found that our method, SkipGuide, is able to identify effective guide RNAs with a precision of 0.68 (50% threshold predicted exon skipping frequency) and 0.93 (70% threshold predicted exon skipping frequency). We anticipate that SkipGuide will be useful for selecting guide RNA candidates for evaluation of CRISPR-Cas9-mediated exon skipping therapy.
Asunto(s)
Sistemas CRISPR-Cas/genética , Edición Génica/métodos , Terapia Genética/métodos , Aprendizaje Automático , ARN Guía de Kinetoplastida/genética , Animales , Células Cultivadas , Células Madre Embrionarias , Exones , Biblioteca de Genes , Humanos , RatonesRESUMEN
Patterns of amino acid conservation have served as a tool for understanding protein evolution. The same principles have also found broad application in human genomics, driven by the need to interpret the pathogenic potential of variants in patients. Here we performed a systematic comparative genomics analysis of human disease-causing missense variants. We found that an appreciable fraction of disease-causing alleles are fixed in the genomes of other species, suggesting a role for genomic context. We developed a model of genetic interactions that predicts most of these to be simple pairwise compensations. Functional testing of this model on two known human disease genes revealed discrete cis amino acid residues that, although benign on their own, could rescue the human mutations in vivo. This approach was also applied to ab initio gene discovery to support the identification of a de novo disease driver in BTG2 that is subject to protective cis-modification in more than 50 species. Finally, on the basis of our data and models, we developed a computational tool to predict candidate residues subject to compensation. Taken together, our data highlight the importance of cis-genomic context as a contributor to protein evolution; they provide an insight into the complexity of allele effect on phenotype; and they are likely to assist methods for predicting allele pathogenicity.
Asunto(s)
Enfermedad/genética , Genómica , Mutación Missense/genética , Supresión Genética/genética , Proteínas Adaptadoras Transductoras de Señales/genética , Alelos , Animales , Evolución Molecular , Genoma Humano/genética , Humanos , Proteínas Inmediatas-Precoces/genética , Microcefalia/genética , Proteínas Asociadas a Microtúbulos , Fenotipo , Proteínas/genética , Alineación de Secuencia , Proteínas Supresoras de Tumor/genéticaRESUMEN
PURPOSE: Over 150,000 variants have been reported to cause Mendelian disease in the medical literature. It is still difficult to leverage this knowledge base in clinical practice, as many reports lack strong statistical evidence or may include false associations. Clinical laboratories assess whether these variants (along with newly observed variants that are adjacent to these published ones) underlie clinical disorders. METHODS: We investigated whether citation data-including journal impact factor and the number of cited variants (NCV) in each gene with published disease associations-can be used to improve variant assessment. RESULTS: Surprisingly, we found that impact factor is not predictive of pathogenicity, but the NCV score for each gene can provide statistical support for prediction of pathogenicity. When this gene-level citation metric is combined with variant-level evolutionary conservation and structural features, classification accuracy reaches 89.5%. Further, variants identified in clinical exome sequencing cases have higher NCVs than do simulated rare variants from the Exome Aggregation Consortium database within the same set of genes and functional consequences (P < 2.22 × 10-16). CONCLUSION: Aggregate citation data can complement existing variant-based predictive algorithms, and can boost their performance without the need to access and review large numbers of papers. The NCV is a slow-growing metric of scientific knowledge about each gene's association with disease.
Asunto(s)
Biología Computacional/métodos , Estudio de Asociación del Genoma Completo/métodos , Algoritmos , Bases de Datos Genéticas , Predicción , Variación Genética , Humanos , Factor de Impacto de la RevistaRESUMEN
Population bottlenecks followed by re-expansions have been common throughout history of many populations. The response of alleles under selection to such demographic perturbations has been a subject of great interest in population genetics. On the basis of theoretical analysis and computer simulations, we suggest that this response qualitatively depends on dominance. The number of dominant or additive deleterious alleles per haploid genome is expected to be slightly increased following the bottleneck and re-expansion. In contrast, the number of completely or partially recessive alleles should be sharply reduced. Changes of population size expose differences between recessive and additive selection, potentially providing insight into the prevalence of dominance in natural populations. Specifically, we use a simple statistic, [Formula: see text], where xi represents the derived allele frequency, to compare the number of mutations in different populations, and detail its functional dependence on the strength of selection and the intensity of the population bottleneck. We also provide empirical evidence showing that gene sets associated with autosomal recessive disease in humans may have a BR indicative of recessive selection. Together, these theoretical predictions and empirical observations show that complex demographic history may facilitate rather than impede inference of parameters of natural selection.
Asunto(s)
Frecuencia de los Genes/genética , Genes Dominantes/genética , Genética de Población/estadística & datos numéricos , Dinámica Poblacional/estadística & datos numéricos , Animales , Evolución Biológica , Simulación por Computador , Humanos , Modelos Genéticos , Modelos Estadísticos , Selección GenéticaRESUMEN
Clinical sequencing is expanding, but causal variants are still not identified in the majority of cases. These unsolved cases can aid in gene discovery when individuals with similar phenotypes are identified in systems such as the Matchmaker Exchange. We describe risks for gene discovery in this growing set of unsolved cases. In a set of rare disease cases with the same phenotype, it is not difficult to find two individuals with the same phenotype that carry variants in the same gene. We quantify the risk of false-positive association in a cohort of individuals with the same phenotype, using the prior probability of observing a variant in each gene from over 60,000 individuals (Exome Aggregation Consortium). Based on the number of individuals with a genic variant, cohort size, specific gene, and mode of inheritance, we calculate a P value that the match represents a true association. A match in two of 10 patients in MECP2 is statistically significant (P = 0.0014), whereas a match in TTN would not reach significance, as expected (P > 0.999). Finally, we analyze the probability of matching in clinical exome cases to estimate the number of cases needed to identify genes related to different disorders. We offer Rare Disease Match, an online tool to mitigate the uncertainty of false-positive associations.
Asunto(s)
Biología Computacional/métodos , Estudios de Asociación Genética/métodos , Enfermedades Raras/genética , Algoritmos , Bases de Datos Genéticas , Exoma , Reacciones Falso Positivas , Variación Genética , Humanos , Fenotipo , Navegador WebRESUMEN
There is an emerging consensus that when investigators obtain genomic data from research participants, they may incur an ethical responsibility to inform at-risk individuals about clinically significant variants discovered during the course of their research. With whole-exome sequencing becoming commonplace and the falling costs of full-genome sequencing, there will be an increasingly large number of variants identified in research participants that may be of sufficient clinical relevance to share. An explicit approach to triaging and communicating these results has yet to be developed, and even the magnitude of the task is uncertain. To develop an estimate of the number of variants that might qualify for disclosure, we apply recently published recommendations for the return of results to a defined and representative set of variants and then extrapolate these estimates to genome scale. We find that the total number of variants meeting the threshold for recommended disclosure ranges from 3955-12,579 (3.79%-12.06%, 95% CI) in the most conservative estimate to 6998-17,189 (6.69%-16.48%, 95% CI) in an estimate including variants with variable disease expressivity. Additionally, if the growth rate from the previous 4 yr continues, we estimate that the total number of disease-associated variants will grow 37% over the next 4 yr.
Asunto(s)
Revelación/ética , Privacidad Genética/ética , Revelación/legislación & jurisprudencia , Ética en Investigación , Asesoramiento Genético/ética , Asesoramiento Genético/legislación & jurisprudencia , Privacidad Genética/legislación & jurisprudencia , Investigación Genética/ética , Investigación Genética/legislación & jurisprudencia , Variación Genética , Estudio de Asociación del Genoma Completo , Humanos , Estados UnidosRESUMEN
Deep mutational scanning enables high-throughput functional assessment of genetic variants. While phenotypic measurements from screening assays generally align with clinical outcomes, experimental noise may affect the accuracy of individual variant estimates. We developed the FUSE (functional substitution estimation) pipeline, which leverages measurements collectively within screening assays to improve the estimation of variant impacts. Drawing data from 115 published functional assays, FUSE assesses the mean functional effect per amino acid position and makes estimates for individual allelic variants. It enhances the correlation of variant functional effects from different assay platforms and increases the classification accuracy of missense variants in ClinVar across 29 genes (area under the receiver operating characteristic [ROC] curve [AUC] from 0.83 to 0.90). In UK Biobank patients with rare missense variants in BRCA1, LDLR, or TP53, FUSE improves the classification accuracy of associated phenotypes. FUSE can also impute variant effects for substitutions not experimentally screened. This approach improves accuracy and broadens the utility of data from functional screening.
Asunto(s)
Proteína BRCA1 , Humanos , Proteína BRCA1/genética , Proteína p53 Supresora de Tumor/genética , Receptores de LDL/genética , Mutación Missense , Fenotipo , Variación Genética/genéticaRESUMEN
CRISPR base editing screens enable analysis of disease-associated variants at scale; however, variable efficiency and precision confounds the assessment of variant-induced phenotypes. Here, we provide an integrated experimental and computational pipeline that improves estimation of variant effects in base editing screens. We use a reporter construct to measure guide RNA (gRNA) editing outcomes alongside their phenotypic consequences and introduce base editor screen analysis with activity normalization (BEAN), a Bayesian network that uses per-guide editing outcomes provided by the reporter and target site chromatin accessibility to estimate variant impacts. BEAN outperforms existing tools in variant effect quantification. We use BEAN to pinpoint common regulatory variants that alter low-density lipoprotein (LDL) uptake, implicating previously unreported genes. Additionally, through saturation base editing of LDLR, we accurately quantify missense variant pathogenicity that is consistent with measurements in UK Biobank patients and identify underlying structural mechanisms. This work provides a widely applicable approach to improve the power of base editing screens for disease-associated variant characterization.
Asunto(s)
Sistemas CRISPR-Cas , Edición Génica , Genotipo , Fenotipo , ARN Guía de Sistemas CRISPR-Cas , Humanos , Edición Génica/métodos , ARN Guía de Sistemas CRISPR-Cas/genética , Teorema de Bayes , Receptores de LDL/genética , Células HEK293RESUMEN
It is now affordable to order clinically interpreted whole-genome sequence reports from clinical laboratories. One major component of these reports is derived from the knowledge base of previously identified pathogenic variants, including research articles, locus-specific, and other databases. While over 150,000 such pathogenic variants have been identified, many of these were originally discovered in small cohort studies of affected individuals, so their applicability to asymptomatic populations is unclear. We analyzed the prevalence of a large set of pathogenic variants from the medical and scientific literature in a large set of asymptomatic individuals (N = 1,092) and found 8.5% of these pathogenic variants in at least one individual. In the average individual in the 1000 Genomes Project, previously identified pathogenic variants occur on average 294 times (σ = 25.5) in homozygous form and 942 times (σ = 68.2) in heterozygous form. We also find that many of these pathogenic variants are frequently occurring: there are 3,744 variants with minor allele frequency (MAF) ≥ 0.01 (4.6%) and 2,837 variants with MAF ≥ 0.05 (3.5%). This indicates that many of these variants may be erroneous findings or have lower penetrance than previously expected.
Asunto(s)
Frecuencia de los Genes , Variación Genética , Análisis de Secuencia de ADN , Bases de Datos Genéticas , Genoma Humano , Genotipo , Heterocigoto , Homocigoto , Humanos , Hallazgos Incidentales , PenetranciaRESUMEN
While pathogenic variants significantly increase disease risk in many genes, it is still challenging to estimate the clinical impact of rare missense variants more generally. Even in genes such as BRCA2 or PALB2 , large cohort studies find no significant association between breast cancer and rare germline missense variants collectively. Here we introduce REGatta, a method to improve the estimation of clinical risk in gene segments. We define gene regions using the density of pathogenic diagnostic reports, and then calculate the relative risk in each of these regions using 109,581 exome sequences from women in the UK Biobank. We apply this method in seven established breast cancer genes, and identify regions in each gene with statistically significant differences in breast cancer incidence for rare missense carriers. Even in genes with no significant difference at the gene level, this approach significantly separates rare missense variant carriers at higher or lower risk ( BRCA2 regional model OR=1.46 [1.12, 1.79], p=0.0036 vs. BRCA2 gene model OR=0.96 [0.85,1.07] p=0.4171). We find high concordance between these regional risk estimates and high-throughput functional assays of variant impact. We compare with existing methods and the use of protein domains (Pfam) as regions, and find REGatta better identifies individuals at elevated or reduced risk. These regions provide useful priors which can potentially be used to improve risk assessment and clinical management.
RESUMEN
Deep mutational scanning assays enable the functional assessment of variants in high throughput. Phenotypic measurements from these assays are broadly concordant with clinical outcomes but are prone to noise at the individual variant level. We develop a framework to exploit related measurements within and across experimental assays to jointly estimate variant impact. Drawing from a large corpus of deep mutational scanning data, we collectively estimate the mean functional effect per AA residue position within each gene, normalize observed functional effects by substitution type, and make estimates for individual allelic variants with a pipeline called FUSE (Functional Substitution Estimation). FUSE improves the correlation of functional screening datasets covering the same variants, better separates estimated functional impacts for known pathogenic and benign variants (ClinVar BRCA1, p=2.24×10-51), and increases the number of variants for which predictions can be made (2,741 to 10,347) by inferring additional variant effects for substitutions not experimentally screened. For UK Biobank patients who carry a rare variant in TP53, FUSE significantly improves the separation of patients who develop cancer syndromes from those without cancer (p=1.77×10-6). These approaches promise to improve estimates of variant impact and broaden the utility of screening data generated from functional assays.
RESUMEN
Despite the increasing use of genomic sequencing in clinical practice, the interpretation of rare genetic variants remains challenging even in well-studied disease genes, resulting in many patients with Variants of Uncertain Significance (VUSs). Computational Variant Effect Predictors (VEPs) provide valuable evidence in variant assessment, but they are prone to misclassifying benign variants, contributing to false positives. Here, we develop Deciphering Mutations in Actionable Genes (DeMAG), a supervised classifier for missense variants trained using extensive diagnostic data available in 59 actionable disease genes (American College of Medical Genetics and Genomics Secondary Findings v2.0, ACMG SF v2.0). DeMAG improves performance over existing VEPs by reaching balanced specificity (82%) and sensitivity (94%) on clinical data, and includes a novel epistatic feature, the 'partners score', which leverages evolutionary and structural partnerships of residues. The 'partners score' provides a general framework for modeling epistatic interactions, integrating both clinical and functional information. We provide our tool and predictions for all missense variants in 316 clinically actionable disease genes (demag.org) to facilitate the interpretation of variants and improve clinical decision-making.
Asunto(s)
Genómica , Mutación Missense , Humanos , Estados Unidos , Genómica/métodos , Variación Genética , Pruebas Genéticas/métodosRESUMEN
CRISPR base editing screens are powerful tools for studying disease-associated variants at scale. However, the efficiency and precision of base editing perturbations vary, confounding the assessment of variant-induced phenotypic effects. Here, we provide an integrated pipeline that improves the estimation of variant impact in base editing screens. We perform high-throughput ABE8e-SpRY base editing screens with an integrated reporter construct to measure the editing efficiency and outcomes of each gRNA alongside their phenotypic consequences. We introduce BEAN, a Bayesian network that accounts for per-guide editing outcomes and target site chromatin accessibility to estimate variant impacts. We show this pipeline attains superior performance compared to existing tools in variant classification and effect size quantification. We use BEAN to pinpoint common variants that alter LDL uptake, implicating novel genes. Additionally, through saturation base editing of LDLR, we enable accurate quantitative prediction of the effects of missense variants on LDL-C levels, which aligns with measurements in UK Biobank individuals, and identify structural mechanisms underlying variant pathogenicity. This work provides a widely applicable approach to improve the power of base editor screens for disease-associated variant characterization.
RESUMEN
Genetic variation contributes greatly to LDL cholesterol (LDL-C) levels and coronary artery disease risk. By combining analysis of rare coding variants from the UK Biobank and genome-scale CRISPR-Cas9 knockout and activation screening, we have substantially improved the identification of genes whose disruption alters serum LDL-C levels. We identify 21 genes in which rare coding variants significantly alter LDL-C levels at least partially through altered LDL-C uptake. We use co-essentiality-based gene module analysis to show that dysfunction of the RAB10 vesicle transport pathway leads to hypercholesterolemia in humans and mice by impairing surface LDL receptor levels. Further, we demonstrate that loss of function of OTX2 leads to robust reduction in serum LDL-C levels in mice and humans by increasing cellular LDL-C uptake. Altogether, we present an integrated approach that improves our understanding of genetic regulators of LDL-C levels and provides a roadmap for further efforts to dissect complex human disease genetics.
RESUMEN
Genetic variation contributes greatly to LDL cholesterol (LDL-C) levels and coronary artery disease risk. By combining analysis of rare coding variants from the UK Biobank and genome-scale CRISPR-Cas9 knockout and activation screening, we substantially improve the identification of genes whose disruption alters serum LDL-C levels. We identify 21 genes in which rare coding variants significantly alter LDL-C levels at least partially through altered LDL-C uptake. We use co-essentiality-based gene module analysis to show that dysfunction of the RAB10 vesicle transport pathway leads to hypercholesterolemia in humans and mice by impairing surface LDL receptor levels. Further, we demonstrate that loss of function of OTX2 leads to robust reduction in serum LDL-C levels in mice and humans by increasing cellular LDL-C uptake. Altogether, we present an integrated approach that improves our understanding of the genetic regulators of LDL-C levels and provides a roadmap for further efforts to dissect complex human disease genetics.