RESUMO
R2 non-long terminal repeat retrotransposons insert site-specifically into ribosomal RNA genes (rDNA) in a broad range of multicellular eukaryotes. R2-encoded proteins can be leveraged to mediate transgene insertion at 28S rDNA loci in cultured human cells. This strategy, precise RNA-mediated insertion of transgenes (PRINT), relies on the codelivery of an mRNA encoding R2 protein and an RNA template encoding a transgene cassette of choice. Here, we demonstrate that the PRINT RNA template 5' module, which as a complementary DNA 3' end will generate the transgene 5' junction with rDNA, influences the efficiency and mechanism of gene insertion. Iterative design and testing identified optimal 5' modules consisting of a hepatitis delta virus-like ribozyme fold with high thermodynamic stability, suggesting that RNA template degradation from its 5' end may limit transgene insertion efficiency. We also demonstrate that transgene 5' junction formation can be either precise, formed by annealing the 3' end of first-strand complementary DNA with the upstream target site, or imprecise, by end-joining, but this difference in junction formation mechanism is not a major determinant of insertion efficiency. Sequence characterization of imprecise end-joining events indicates surprisingly minimal reliance on microhomology. Our findings expand the current understanding of the role of R2 retrotransposon transcript sequence and structure, and especially the 5' ribozyme fold, for retrotransposon mobility and RNA-templated gene synthesis in cells.
Assuntos
Retroelementos , Transgenes , Retroelementos/genética , Humanos , RNA Catalítico/genética , RNA Catalítico/metabolismo , RNA Catalítico/química , Conformação de Ácido Nucleico , Sequência de Bases , Moldes GenéticosRESUMO
R2 non-long terminal repeat (non-LTR) retrotransposons are among the most extensively distributed mobile genetic elements in multicellular eukaryotes and show promise for applications in transgene supplementation of the human genome. They insert new gene copies into a conserved site in 28S ribosomal DNA with exquisite specificity. R2 clades are defined by the number of zinc fingers (ZFs) at the N terminus of the retrotransposon-encoded protein, postulated to additively confer DNA site specificity. Here, we illuminate general principles of DNA recognition by R2 N-terminal domains across and between clades, with extensive, specific recognition requiring only one or two compact domains. DNA-binding and protection assays demonstrate broadly shared as well as clade-specific DNA interactions. Gene insertion assays in cells identify the N-terminal domains sufficient for target-site insertion and reveal roles in second-strand cleavage or synthesis for clade-specific ZFs. Our results have implications for understanding evolutionary diversification of non-LTR retrotransposon insertion mechanisms and the design of retrotransposon-based gene therapies.
Assuntos
Retroelementos , Retroelementos/genética , Humanos , DNA/metabolismo , DNA/genética , Dedos de Zinco , Domínios Proteicos , Ligação ProteicaRESUMO
Current approaches for inserting autonomous transgenes into the genome, such as CRISPR-Cas9 or virus-based strategies, have limitations including low efficiency and high risk of untargeted genome mutagenesis. Here, we describe precise RNA-mediated insertion of transgenes (PRINT), an approach for site-specifically primed reverse transcription that directs transgene synthesis directly into the genome at a multicopy safe-harbor locus. PRINT uses delivery of two in vitro transcribed RNAs: messenger RNA encoding avian R2 retroelement-protein and template RNA encoding a transgene of length validated up to 4 kb. The R2 protein coordinately recognizes the target site, nicks one strand at a precise location and primes complementary DNA synthesis for stable transgene insertion. With a cultured human primary cell line, over 50% of cells can gain several 2 kb transgenes, of which more than 50% are full-length. PRINT advantages include no extragenomic DNA, limiting risk of deleterious mutagenesis and innate immune responses, and the relatively low cost, rapid production and scalability of RNA-only delivery.
RESUMO
Short tandem repeats (STRs) are enriched in eukaryotic cis-regulatory elements and alter gene expression, yet how they regulate transcription remains unknown. We found that STRs modulate transcription factor (TF)-DNA affinities and apparent on-rates by about 70-fold by directly binding TF DNA-binding domains, with energetic impacts exceeding many consensus motif mutations. STRs maximize the number of weakly preferred microstates near target sites, thereby increasing TF density, with impacts well predicted by statistical mechanics. Confirming that STRs also affect TF binding in cells, neural networks trained only on in vivo occupancies predicted effects identical to those observed in vitro. Approximately 90% of TFs preferentially bound STRs that need not resemble known motifs, providing a cis-regulatory mechanism to target TFs to genomic sites.
Assuntos
Regulação da Expressão Gênica , Repetições de Microssatélites , Fatores de Transcrição , Células Eucarióticas , Fatores de Transcrição/química , Fatores de Transcrição/genética , Ligação Proteica , Humanos , Animais , Saccharomyces cerevisiae , Domínios Proteicos , Conformação ProteicaRESUMO
Transcription factors (TF) are proteins that bind DNA in a sequence-specific manner to regulate gene transcription. Despite their unique intrinsic sequence preferences, in vivo genomic occupancy profiles of TFs differ across cellular contexts. Hence, deciphering the sequence determinants of TF binding, both intrinsic and context-specific, is essential to understand gene regulation and the impact of regulatory, non-coding genetic variation. Biophysical models trained on in vitro TF binding assays can estimate intrinsic affinity landscapes and predict occupancy based on TF concentration and affinity. However, these models cannot adequately explain context-specific, in vivo binding profiles. Conversely, deep learning models, trained on in vivo TF binding assays, effectively predict and explain genomic occupancy profiles as a function of complex regulatory sequence syntax, albeit without a clear biophysical interpretation. To reconcile these complementary models of in vitro and in vivo TF binding, we developed Affinity Distillation (AD), a method that extracts thermodynamic affinities de-novo from deep learning models of TF chromatin immunoprecipitation (ChIP) experiments by marginalizing away the influence of genomic sequence context. Applied to neural networks modeling diverse classes of yeast and mammalian TFs, AD predicts energetic impacts of sequence variation within and surrounding motifs on TF binding as measured by diverse in vitro assays with superior dynamic range and accuracy compared to motif-based methods. Furthermore, AD can accurately discern affinities of TF paralogs. Our results highlight thermodynamic affinity as a key determinant of in vivo binding, suggest that deep learning models of in vivo binding implicitly learn high-resolution affinity landscapes, and show that these affinities can be successfully distilled using AD. This new biophysical interpretation of deep learning models enables high-throughput in silico experiments to explore the influence of sequence context and variation on both intrinsic affinity and in vivo occupancy.
RESUMO
SUMMARY: Single-cell Hi-C (scHi-C) allows the study of cell-to-cell variability in chromatin structure and dynamics. However, the high level of noise inherent in current scHi-C protocols necessitates careful assessment of data quality before biological conclusions can be drawn. Here, we present GiniQC, which quantifies unevenness in the distribution of inter-chromosomal reads in the scHi-C contact matrix to measure the level of noise. Our examples show the utility of GiniQC in assessing the quality of scHi-C data as a complement to existing quality control measures. We also demonstrate how GiniQC can help inform the impact of various data processing steps on data quality. AVAILABILITY AND IMPLEMENTATION: Source code and documentation are freely available at https://github.com/4dn-dcic/GiniQC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Cromossomos , SoftwareRESUMO
The ability to rewrite large stretches of genomic DNA enables the creation of new organisms with customized functions. However, few methods currently exist for accumulating such widespread genomic changes in a single organism. In this study, we demonstrate a rapid approach for rewriting bacterial genomes with modified synthetic DNA. We recode 200 kb of the Salmonella typhimurium LT2 genome through a process we term SIRCAS (stepwise integration of rolling circle amplified segments), towards constructing an attenuated and genetically isolated bacterial chassis. The SIRCAS process involves direct iterative recombineering of 10-25 kb synthetic DNA constructs which are assembled in yeast and amplified by rolling circle amplification. Using SIRCAS, we create a Salmonella with 1557 synonymous leucine codon replacements across 176 genes, the largest number of cumulative recoding changes in a single bacterial strain to date. We demonstrate reproducibility over sixteen two-day cycles of integration and parallelization for hierarchical construction of a synthetic genome by conjugation. The resulting recoded strain grows at a similar rate to the wild-type strain and does not exhibit any major growth defects. This work is the first instance of synthetic bacterial recoding beyond the Escherichia coli genome, and reveals that Salmonella is remarkably amenable to genome-scale modification.