ABSTRACT
To maintain genome integrity, cells must accurately duplicate their genome and repair DNA lesions when they occur. To uncover genes that suppress DNA damage in human cells, we undertook flow-cytometry-based CRISPR-Cas9 screens that monitored DNA damage. We identified 160 genes whose mutation caused spontaneous DNA damage, a list enriched in essential genes, highlighting the importance of genomic integrity for cellular fitness. We also identified 227 genes whose mutation caused DNA damage in replication-perturbed cells. Among the genes characterized, we discovered that deoxyribose-phosphate aldolase DERA suppresses DNA damage caused by cytarabine (Ara-C) and that GNB1L, a gene implicated in 22q11.2 syndrome, promotes biogenesis of ATR and related phosphatidylinositol 3-kinase-related kinases (PIKKs). These results implicate defective PIKK biogenesis as a cause of some phenotypes associated with 22q11.2 syndrome. The phenotypic mapping of genes that suppress DNA damage therefore provides a rich resource to probe the cellular pathways that influence genome maintenance.
Subject(s)
CRISPR-Cas Systems , DNA Damage , Humans , Mutation , DNA Repair , PhenotypeABSTRACT
Transcription factor (TF) DNA sequence preferences direct their regulatory activity, but are currently known for only â¼1% of eukaryotic TFs. Broadly sampling DNA-binding domain (DBD) types from multiple eukaryotic clades, we determined DNA sequence preferences for >1,000 TFs encompassing 54 different DBD classes from 131 diverse eukaryotes. We find that closely related DBDs almost always have very similar DNA sequence preferences, enabling inference of motifs for â¼34% of the â¼170,000 known or predicted eukaryotic TFs. Sequences matching both measured and inferred motifs are enriched in chromatin immunoprecipitation sequencing (ChIP-seq) peaks and upstream of transcription start sites in diverse eukaryotic lineages. SNPs defining expression quantitative trait loci in Arabidopsis promoters are also enriched for predicted TF binding sites. Importantly, our motif "library" can be used to identify specific TFs whose binding may be altered by human disease risk alleles. These data present a powerful resource for mapping transcriptional networks across eukaryotes.
Subject(s)
Arabidopsis/genetics , Nucleotide Motifs , Sequence Analysis, DNA , Transcription Factors/metabolism , Arabidopsis/metabolism , Chromatin Immunoprecipitation , Humans , Polymorphism, Single Nucleotide , Promoter Regions, Genetic , Protein Binding , Quantitative Trait LociABSTRACT
Global insights into cellular organization and genome function require comprehensive understanding of the interactome networks that mediate genotype-phenotype relationships1,2. Here we present a human 'all-by-all' reference interactome map of human binary protein interactions, or 'HuRI'. With approximately 53,000 protein-protein interactions, HuRI has approximately four times as many such interactions as there are high-quality curated interactions from small-scale studies. The integration of HuRI with genome3, transcriptome4 and proteome5 data enables cellular function to be studied within most physiological or pathological cellular contexts. We demonstrate the utility of HuRI in identifying the specific subcellular roles of protein-protein interactions. Inferred tissue-specific networks reveal general principles for the formation of cellular context-specific functions and elucidate potential molecular mechanisms that might underlie tissue-specific phenotypes of Mendelian diseases. HuRI is a systematic proteome-wide reference that links genomic variation to phenotypic outcomes.
Subject(s)
Proteome/metabolism , Extracellular Space/metabolism , Humans , Organ Specificity , Protein Interaction MappingABSTRACT
MOTIVATION: Long-read sequencing technologies, an attractive solution for many applications, often suffer from higher error rates. Alignment of multiple reads can improve base-calling accuracy, but some applications, e.g. sequencing mutagenized libraries where multiple distinct clones differ by one or few variants, require the use of barcodes or unique molecular identifiers. Unfortunately, sequencing errors can interfere with correct barcode identification, and a given barcode sequence may be linked to multiple independent clones within a given library. RESULTS: Here we focus on the target application of sequencing mutagenized libraries in the context of multiplexed assays of variant effects (MAVEs). MAVEs are increasingly used to create comprehensive genotype-phenotype maps that can aid clinical variant interpretation. Many MAVE methods use long-read sequencing of barcoded mutant libraries for accurate association of barcode with genotype. Existing long-read sequencing pipelines do not account for inaccurate sequencing or nonunique barcodes. Here, we describe Pacybara, which handles these issues by clustering long reads based on the similarities of (error-prone) barcodes while also detecting barcodes that have been associated with multiple genotypes. Pacybara also detects recombinant (chimeric) clones and reduces false positive indel calls. In three example applications, we show that Pacybara identifies and correctly resolves these issues. AVAILABILITY AND IMPLEMENTATION: Pacybara, freely available at https://github.com/rothlab/pacybara, is implemented using R, Python, and bash for Linux. It runs on GNU/Linux HPC clusters via Slurm, PBS, or GridEngine schedulers. A single-machine simplex version is also available.
Subject(s)
High-Throughput Nucleotide Sequencing , Software , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Gene Library , Genotype , Cluster AnalysisABSTRACT
SUMMARY: The promise of personalized genomic medicine depends on our ability to assess the functional impact of rare sequence variation. Multiplexed assays can experimentally measure the functional impact of missense variants on a massive scale. However, even after such assays, many missense variants remain poorly measured. Here we describe a software pipeline and application to impute missing information in experimentally determined variant effect maps. AVAILABILITY AND IMPLEMENTATION: http://impute.varianteffect.org source code: https://github.com/joewuca/imputation. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Software , Genome , Genomics , Mutation, MissenseABSTRACT
Condition-dependent genetic interactions can reveal functional relationships between genes that are not evident under standard culture conditions. State-of-the-art yeast genetic interaction mapping, which relies on robotic manipulation of arrays of double-mutant strains, does not scale readily to multi-condition studies. Here, we describe barcode fusion genetics to map genetic interactions (BFG-GI), by which double-mutant strains generated via en masse "party" mating can also be monitored en masse for growth to detect genetic interactions. By using site-specific recombination to fuse two DNA barcodes, each representing a specific gene deletion, BFG-GI enables multiplexed quantitative tracking of double mutants via next-generation sequencing. We applied BFG-GI to a matrix of DNA repair genes under nine different conditions, including methyl methanesulfonate (MMS), 4-nitroquinoline 1-oxide (4NQO), bleomycin, zeocin, and three other DNA-damaging environments. BFG-GI recapitulated known genetic interactions and yielded new condition-dependent genetic interactions. We validated and further explored a subnetwork of condition-dependent genetic interactions involving MAG1, SLX4, and genes encoding the Shu complex, and inferred that loss of the Shu complex leads to an increase in the activation of the checkpoint protein kinase Rad53.
Subject(s)
Chromosome Mapping , DNA Barcoding, Taxonomic , DNA Damage , Saccharomyces cerevisiae Proteins/genetics , Saccharomyces cerevisiae/genetics , DNA Repair , Epistasis, Genetic , Gene Deletion , Genetic Loci , High-Throughput Nucleotide Sequencing , Methyl Methanesulfonate , Models, Theoretical , Promoter Regions, Genetic , Reproducibility of ResultsABSTRACT
Although we now routinely sequence human genomes, we can confidently identify only a fraction of the sequence variants that have a functional impact. Here, we developed a deep mutational scanning framework that produces exhaustive maps for human missense variants by combining random codon mutagenesis and multiplexed functional variation assays with computational imputation and refinement. We applied this framework to four proteins corresponding to six human genes: UBE2I (encoding SUMO E2 conjugase), SUMO1 (small ubiquitin-like modifier), TPK1 (thiamin pyrophosphokinase), and CALM1/2/3 (three genes encoding the protein calmodulin). The resulting maps recapitulate known protein features and confidently identify pathogenic variation. Assays potentially amenable to deep mutational scanning are already available for 57% of human disease genes, suggesting that DMS could ultimately map functional variation for all human disease genes.
Subject(s)
DNA Mutational Analysis/methods , Mutation, Missense/genetics , Calmodulin/genetics , Disease/genetics , Humans , Machine Learning , Phenotype , Phylogeny , Reproducibility of Results , SUMO-1 Protein/genetics , Ubiquitin-Conjugating Enzymes/genetics , Ubiquitin-Conjugating Enzymes/metabolismABSTRACT
The exponential growth of genomic variants uncovered by next-generation sequencing necessitates efficient and accurate computational analyses to predict their functional effects. A number of computational methods have been developed for the task, but few unbiased comparisons of their performance are available. To fill the gap, The Critical Assessment of Genome Interpretation (CAGI) comprehensively assesses phenotypic predictions on newly collected experimental datasets. Here, we present the results of the SUMO conjugase challenge where participants were predicting functional effects of missense mutations in human SUMO-conjugating enzyme UBE2I. The performance of the predictors is similar to each other and is far from perfection. Evolutionary information from sequence alignments dominates the success: deleterious mutations at conserved positions and benign mutations at variable positions are accurately predicted. Prediction accuracy of other mutations remains unsatisfactory, and this fast-growing field of research is yet to learn the use of spatial structure information to improve the predictions significantly.
Subject(s)
Computational Biology/methods , Mutation, Missense , Ubiquitin-Conjugating Enzymes/genetics , Ubiquitin-Conjugating Enzymes/metabolism , Databases, Genetic , Evolution, Molecular , High-Throughput Nucleotide Sequencing , Humans , Models, Molecular , Protein Binding , Selection, Genetic , Sequence Alignment , Ubiquitin-Conjugating Enzymes/chemistryABSTRACT
High-throughput binary protein interaction mapping is continuing to extend our understanding of cellular function and disease mechanisms. However, we remain one or two orders of magnitude away from a complete interaction map for humans and other major model organisms. Completion will require screening at substantially larger scales with many complementary assays, requiring further efficiency gains in proteome-scale interaction mapping. Here, we report Barcode Fusion Genetics-Yeast Two-Hybrid (BFG-Y2H), by which a full matrix of protein pairs can be screened in a single multiplexed strain pool. BFG-Y2H uses Cre recombination to fuse DNA barcodes from distinct plasmids, generating chimeric protein-pair barcodes that can be quantified via next-generation sequencing. We applied BFG-Y2H to four different matrices ranging in scale from ~25 K to 2.5 M protein pairs. The results show that BFG-Y2H increases the efficiency of protein matrix screening, with quality that is on par with state-of-the-art Y2H methods.
Subject(s)
Centrosome/metabolism , Protein Interaction Mapping/methods , Proteome/metabolism , Saccharomyces cerevisiae/genetics , Chromosomes, Human/metabolism , Gene Library , High-Throughput Nucleotide Sequencing , Humans , Protein Binding , Two-Hybrid System TechniquesABSTRACT
Long DNA palindromes are implicated in chromosomal rearrangement, but their roles in the underlying molecular events remain a matter of conjecture. One notion is that palindromes induce DNA breaks after assuming a cruciform structure, the four-way DNA junction providing a target for cleavage by Holliday junction (HJ)-specific enzymes. Though compelling, few components of the "cruciform resolution" proposal are established. Here we address fundamental properties and genetic dependencies of palindromic DNA metabolism in eukaryotes. Plasmid-borne palindromes introduced into S. cerevisiae are site-specifically broken in vivo, and the breaks exhibit unique hallmarks of an HJ resolvase mechanism. In vivo resolution requires Mus81, for which the bacterial HJ resolvase RusA will substitute. These results provide confirmation of cruciform extrusion and resolution in the context of eukaryotic chromatin. Related observations are that, unchecked by a nuclease function provided by Mre11, episomal palindromes launch a self-perpetuating breakage-fusion-bridge-independent copy number increase termed "escape."
Subject(s)
DNA Breaks, Double-Stranded , DNA, Cruciform/metabolism , DNA-Binding Proteins/metabolism , Endonucleases/metabolism , Saccharomyces cerevisiae Proteins/metabolism , Saccharomyces cerevisiae/metabolism , AT Rich Sequence , Base Sequence , DNA Replication , Dimerization , Escherichia coli Proteins/metabolism , Gene Amplification , Gene Rearrangement , Holliday Junction Resolvases/metabolism , Humans , Molecular Sequence Data , Plasmids/geneticsABSTRACT
H-NS and Lsr2 are nucleoid-associated proteins from Gram-negative bacteria and Mycobacteria, respectively, that play an important role in the silencing of horizontally acquired foreign DNA that is more AT-rich than the resident genome. Despite the fact that Lsr2 and H-NS proteins are dissimilar in sequence and structure, they serve apparently similar functions and can functionally complement one another. The mechanism by which these xenogeneic silencers selectively target AT-rich DNA has been enigmatic. We performed high-resolution protein binding microarray analysis to simultaneously assess the binding preference of H-NS and Lsr2 for all possible 8-base sequences. Concurrently, we performed a detailed structure-function relationship analysis of their C-terminal DNA binding domains by NMR. Unexpectedly, we found that H-NS and Lsr2 use a common DNA binding mechanism where a short loop containing a "Q/RGR" motif selectively interacts with the DNA minor groove, where the highest affinity is for AT-rich sequences that lack A-tracts. Mutations of the Q/RGR motif abolished DNA binding activity. Netropsin, a DNA minor groove-binding molecule effectively outcompeted H-NS and Lsr2 for binding to AT-rich sequences. These results provide a unified molecular mechanism to explain findings related to xenogeneic silencing proteins, including their lack of apparent sequence specificity but preference for AT-rich sequences. Our findings also suggest that structural information contained within the DNA minor groove is deciphered by xenogeneic silencing proteins to distinguish genetic material that is self from nonself.
Subject(s)
AT Rich Sequence , Bacterial Proteins/metabolism , DNA-Binding Proteins/metabolism , DNA/metabolism , Nucleic Acid Conformation , Amino Acid Sequence , Bacterial Proteins/chemistry , Base Sequence , DNA/chemistry , DNA-Binding Proteins/chemistry , Models, Molecular , Molecular Sequence Data , Nuclear Magnetic Resonance, Biomolecular , Sequence Homology, Amino AcidABSTRACT
BACKGROUND: Computational variant effect predictors offer a scalable and increasingly reliable means of interpreting human genetic variation, but concerns of circularity and bias have limited previous methods for evaluating and comparing predictors. Population-level cohorts of genotyped and phenotyped participants that have not been used in predictor training can facilitate an unbiased benchmarking of available methods. Using a curated set of human gene-trait associations with a reported rare-variant burden association, we evaluate the correlations of 24 computational variant effect predictors with associated human traits in the UK Biobank and All of Us cohorts. RESULTS: AlphaMissense outperformed all other predictors in inferring human traits based on rare missense variants in UK Biobank and All of Us participants. The overall rankings of computational variant effect predictors in these two cohorts showed a significant positive correlation. CONCLUSION: We describe a method to assess computational variant effect predictors that sidesteps the limitations of previous evaluations. This approach is generalizable to future predictors and could continue to inform predictor choice for personal and clinical genetics.
Subject(s)
Benchmarking , Genetic Variation , Humans , Phenotype , Computational Biology/methods , GenotypeABSTRACT
C2H2 zinc fingers (C2H2-ZFs) are the most prevalent type of vertebrate DNA-binding domain, and typically appear in tandem arrays (ZFAs), with sequential C2H2-ZFs each contacting three (or more) sequential bases. C2H2-ZFs can be assembled in a modular fashion, providing one explanation for their remarkable evolutionary success. Given a set of modules with defined three-base specificities, modular assembly also presents a way to construct artificial proteins with specific DNA-binding preferences. However, a recent survey of a large number of three-finger ZFAs engineered by modular assembly reported high failure rates (â¼70%), casting doubt on the generality of modular assembly. Here, we used protein-binding microarrays to analyze 28 ZFAs that failed in the aforementioned study. Most (17) preferred specific sequences, which in all but one case resembled the intended target sequence. Like natural ZFAs, the engineered ZFAs typically yielded degenerate motifs, binding dozens to hundreds of related individual sequences. Thus, the failure of these proteins in previous assays is not due to lack of sequence-specific DNA-binding activity. Our findings underscore the relevance of individual C2H2-ZF sequence specificities within tandem arrays, and support the general ability of modular assembly to produce ZFAs with sequence-specific DNA-binding activity.
Subject(s)
DNA-Binding Proteins/chemistry , Zinc Fingers , Base Sequence , Protein Array Analysis/methods , Protein Binding , Protein EngineeringABSTRACT
The impact of millions of individual genetic variants on molecular phenotypes in coding sequences remains unknown. Multiplexed assays of variant effect (MAVEs) are scalable methods to annotate relevant variants, but existing software lacks standardization, requires cumbersome configuration, and does not scale to large targets. We present satmut_utils as a flexible solution for simulation and variant quantification. We then benchmark MAVE software using simulated and real MAVE data. We finally determine mRNA abundance for thousands of cystathionine beta-synthase variants using two experimental methods. The satmut_utils package enables high-performance analysis of MAVEs and reveals the capability of variants to alter mRNA abundance.
Subject(s)
High-Throughput Nucleotide Sequencing , Software , Computer Simulation , Phenotype , Exons , High-Throughput Nucleotide Sequencing/methodsABSTRACT
BACKGROUND: Glucokinase (GCK) regulates insulin secretion to maintain appropriate blood glucose levels. Sequence variants can alter GCK activity to cause hyperinsulinemic hypoglycemia or hyperglycemia associated with GCK-maturity-onset diabetes of the young (GCK-MODY), collectively affecting up to 10 million people worldwide. Patients with GCK-MODY are frequently misdiagnosed and treated unnecessarily. Genetic testing can prevent this but is hampered by the challenge of interpreting novel missense variants. RESULT: Here, we exploit a multiplexed yeast complementation assay to measure both hyper- and hypoactive GCK variation, capturing 97% of all possible missense and nonsense variants. Activity scores correlate with in vitro catalytic efficiency, fasting glucose levels in carriers of GCK variants and with evolutionary conservation. Hypoactive variants are concentrated at buried positions, near the active site, and at a region of known importance for GCK conformational dynamics. Some hyperactive variants shift the conformational equilibrium towards the active state through a relative destabilization of the inactive conformation. CONCLUSION: Our comprehensive assessment of GCK variant activity promises to facilitate variant interpretation and diagnosis, expand our mechanistic understanding of hyperactive variants, and inform development of therapeutics targeting GCK.
Subject(s)
Diabetes Mellitus, Type 2 , Glucokinase , Humans , Glucokinase/genetics , Glucokinase/chemistry , Diabetes Mellitus, Type 2/genetics , Diabetes Mellitus, Type 2/diagnosis , Mutation, Missense , Genetic Testing , MutationABSTRACT
Long read sequencing technologies, an attractive solution for many applications, often suffer from higher error rates. Alignment of multiple reads can improve base-calling accuracy, but some applications, e.g. sequencing mutagenized libraries where multiple distinct clones differ by one or few variants, require the use of barcodes or unique molecular identifiers. Unfortunately, sequencing errors can interfere with correct barcode identification, and a given barcode sequence may be linked to multiple independent clones within a given library. Here we focus on the target application of sequencing mutagenized libraries in the context of multiplexed assays of variant effects (MAVEs). MAVEs are increasingly used to create comprehensive genotype-phenotype maps that can aid clinical variant interpretation. Many MAVE methods use long-read sequencing of barcoded mutant libraries for accurate association of barcode with genotype. Existing long-read sequencing pipelines do not account for inaccurate sequencing or non-unique barcodes. Here, we describe Pacybara, which handles these issues by clustering long reads based on the similarities of (error-prone) barcodes while also detecting barcodes that have been associated with multiple genotypes. Pacybara also detects recombinant (chimeric) clones and reduces false positive indel calls. In three example applications, we show that Pacybara identifies and correctly resolves these issues.
ABSTRACT
Generating reference maps of interactome networks illuminates genetic studies by providing a protein-centric approach to finding new components of existing pathways, complexes, and processes. We apply state-of-the-art methods to identify binary protein-protein interactions (PPIs) for Drosophila melanogaster. Four all-by-all yeast two-hybrid (Y2H) screens of > 10,000 Drosophila proteins result in the 'FlyBi' dataset of 8723 PPIs among 2939 proteins. Testing subsets of data from FlyBi and previous PPI studies using an orthogonal assay allows for normalization of data quality; subsequent integration of FlyBi and previous data results in an expanded binary Drosophila reference interaction network, DroRI, comprising 17,232 interactions among 6511 proteins. We use FlyBi data to generate an autophagy network, then validate in vivo using autophagy-related assays. The deformed wings (dwg) gene encodes a protein that is both a regulator and a target of autophagy. Altogether, these resources provide a foundation for building new hypotheses regarding protein networks and function.
Subject(s)
Drosophila Proteins , Protein Interaction Maps , Animals , Protein Interaction Maps/genetics , Drosophila melanogaster/genetics , Drosophila melanogaster/metabolism , Drosophila/genetics , Saccharomyces cerevisiae/metabolism , Drosophila Proteins/genetics , Drosophila Proteins/metabolism , Protein Interaction Mapping/methods , Two-Hybrid System TechniquesABSTRACT
Understanding the mechanisms of coronavirus disease 2019 (COVID-19) disease severity to efficiently design therapies for emerging virus variants remains an urgent challenge of the ongoing pandemic. Infection and immune reactions are mediated by direct contacts between viral molecules and the host proteome, and the vast majority of these virus-host contacts (the 'contactome') have not been identified. Here, we present a systematic contactome map of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) with the human host encompassing more than 200 binary virus-host and intraviral protein-protein interactions. We find that host proteins genetically associated with comorbidities of severe illness and long COVID are enriched in SARS-CoV-2 targeted network communities. Evaluating contactome-derived hypotheses, we demonstrate that viral NSP14 activates nuclear factor κB (NF-κB)-dependent transcription, even in the presence of cytokine signaling. Moreover, for several tested host proteins, genetic knock-down substantially reduces viral replication. Additionally, we show for USP25 that this effect is phenocopied by the small-molecule inhibitor AZ1. Our results connect viral proteins to human genetic architecture for COVID-19 severity and offer potential therapeutic targets.
Subject(s)
COVID-19 , SARS-CoV-2 , Humans , SARS-CoV-2/genetics , COVID-19/genetics , Proteome/genetics , Post-Acute COVID-19 Syndrome , Virus Replication/genetics , Ubiquitin Thiolesterase/pharmacologyABSTRACT
BACKGROUND: For the majority of rare clinical missense variants, pathogenicity status cannot currently be classified. Classical homocystinuria, characterized by elevated homocysteine in plasma and urine, is caused by variants in the cystathionine beta-synthase (CBS) gene, most of which are rare. With early detection, existing therapies are highly effective. METHODS: Damaging CBS variants can be detected based on their failure to restore growth in yeast cells lacking the yeast ortholog CYS4. This assay has only been applied reactively, after first observing a variant in patients. Using saturation codon-mutagenesis, en masse growth selection, and sequencing, we generated a comprehensive, proactive map of CBS missense variant function. RESULTS: Our CBS variant effect map far exceeds the performance of computational predictors of disease variants. Map scores correlated strongly with both disease severity (Spearman's ϱ = 0.9) and human clinical response to vitamin B6 (ϱ = 0.93). CONCLUSIONS: We demonstrate that highly multiplexed cell-based assays can yield proactive maps of variant function and patient response to therapy, even for rare variants not previously seen in the clinic.
Subject(s)
Cystathionine beta-Synthase/genetics , Genetic Complementation Test/methods , Genetic Testing/methods , Homocystinuria/genetics , Mutation, Missense , Cystathionine beta-Synthase/metabolism , Genotype , Humans , Phenotype , Saccharomyces cerevisiae , Saccharomyces cerevisiae Proteins/geneticsABSTRACT
Many traits are complex, depending non-additively on variant combinations. Even in model systems, such as the yeast S. cerevisiae, carrying out the high-order variant-combination testing needed to dissect complex traits remains a daunting challenge. Here, we describe "X-gene" genetic analysis (XGA), a strategy for engineering and profiling highly combinatorial gene perturbations. We demonstrate XGA on yeast ABC transporters by engineering 5,353 strains, each deleted for a random subset of 16 transporters, and profiling each strain's resistance to 16 compounds. XGA yielded 85,648 genotype-to-resistance observations, revealing high-order genetic interactions for 13 of the 16 transporters studied. Neural networks yielded intuitive functional models and guided exploration of fluconazole resistance, which was influenced non-additively by five genes. Together, our results showed that highly combinatorial genetic perturbation can functionally dissect complex traits, supporting pursuit of analogous strategies in human cells and other model systems.