RESUMEN
Alternative polyadenylation (APA) is a major driver of transcriptome diversity in human cells. Here, we use deep learning to predict APA from DNA sequence alone. We trained our model (APARENT, APA REgression NeT) on isoform expression data from over 3 million APA reporters. APARENT's predictions are highly accurate when tasked with inferring APA in synthetic and human 3'UTRs. Visualizing features learned across all network layers reveals that APARENT recognizes sequence motifs known to recruit APA regulators, discovers previously unknown sequence determinants of 3' end processing, and integrates these features into a comprehensive, interpretable, cis-regulatory code. We apply APARENT to forward engineer functional polyadenylation signals with precisely defined cleavage position and isoform usage and validate predictions experimentally. Finally, we use APARENT to quantify the impact of genetic variants on APA. Our approach detects pathogenic variants in a wide range of disease contexts, expanding our understanding of the genetic origins of disease.
Asunto(s)
Aprendizaje Profundo , Modelos Genéticos , Poliadenilación/genética , Regiones no Traducidas 3'/genética , Secuencia de Bases/genética , Bases de Datos Genéticas , Expresión Génica/genética , Células HEK293 , Humanos , Mutagénesis/genética , División del ARN/genética , ARN Mensajero/genética , RNA-Seq , Biología Sintética , TranscriptomaRESUMEN
Massively parallel reporter assays (MPRAs) are powerful tools for quantifying the impacts of sequence variation on gene expression. Reading out molecular phenotypes with sequencing enables interrogating the impact of sequence variation beyond genome scale. Machine learning models integrate and codify information learned from MPRAs and enable generalization by predicting sequences outside the training data set. Models can provide a quantitative understanding of cis-regulatory codes controlling gene expression, enable variant stratification, and guide the design of synthetic regulatory elements for applications from synthetic biology to mRNA and gene therapy. This review focuses on cis-regulatory MPRAs, particularly those that interrogate cotranscriptional and post-transcriptional processes: alternative splicing, cleavage and polyadenylation, translation, and mRNA decay.
Asunto(s)
Aprendizaje Automático , Humanos , Genes Reporteros/genética , Animales , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Regulación de la Expresión Génica/genéticaRESUMEN
Most human transcripts are alternatively spliced, and many disease-causing mutations affect RNA splicing. Toward better modeling the sequence determinants of alternative splicing, we measured the splicing patterns of over two million (M) synthetic mini-genes, which include degenerate subsequences totaling over 100 M bases of variation. The massive size of these training data allowed us to improve upon current models of splicing, as well as to gain new mechanistic insights. Our results show that the vast majority of hexamer sequence motifs measurably influence splice site selection when positioned within alternative exons, with multiple motifs acting additively rather than cooperatively. Intriguingly, motifs that enhance (suppress) exon inclusion in alternative 5' splicing also enhance (suppress) exon inclusion in alternative 3' or cassette exon splicing, suggesting a universal mechanism for alternative exon recognition. Finally, our empirically trained models are highly predictive of the effects of naturally occurring variants on alternative splicing in vivo.
Asunto(s)
Empalme Alternativo , Genoma Humano , Modelos Genéticos , Polimorfismo de Nucleótido Simple , Secuencia de Bases , Humanos , Datos de Secuencia Molecular , Motivos de Nucleótidos , Sitios de Empalme de ARNRESUMEN
Protein-protein interactions (PPIs) regulate many cellular processes and engineered PPIs have cell and gene therapy applications. Here, we introduce massively parallel PPI measurement by sequencing (MP3-seq), an easy-to-use and highly scalable yeast two-hybrid approach for measuring PPIs. In MP3-seq, DNA barcodes are associated with specific protein pairs and barcode enrichment can be read by sequencing to provide a direct measure of interaction strength. We show that MP3-seq is highly quantitative and scales to over 100,000 interactions. We apply MP3-seq to characterize interactions between families of rationally designed heterodimers and to investigate elements conferring specificity to coiled-coil interactions. Lastly, we predict coiled heterodimer structures using AlphaFold-Multimer (AF-M) and train linear models on physics-based energy terms to predict MP3-seq values. We find that AF-M-based models could be valuable for prescreening interactions but experimentally measuring interactions remains necessary to rank their strengths quantitatively.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Mapeo de Interacción de Proteínas/métodos , Técnicas del Sistema de Dos Híbridos , Unión Proteica , Proteínas/metabolismo , Proteínas/química , Proteínas/genética , HumanosRESUMEN
MOTIVATION: Single-cell RNA sequencing (scRNA-seq) is widely used for analyzing gene expression in multi-cellular systems and provides unprecedented access to cellular heterogeneity. scRNA-seq experiments aim to identify and quantify all cell types present in a sample. Measured single-cell transcriptomes are grouped by similarity and the resulting clusters are mapped to cell types based on cluster-specific gene expression patterns. While the process of generating clusters has become largely automated, annotation remains a laborious ad hoc effort that requires expert biological knowledge. RESULTS: Here, we introduce CellMeSH-a new automated approach to identifying cell types for clusters based on prior literature. CellMeSH combines a database of gene-cell-type associations with a probabilistic method for database querying. The database is constructed by automatically linking gene and cell-type information from millions of publications using existing indexed literature resources. Compared to manually constructed databases, CellMeSH is more comprehensive and is easily updated with new data. The probabilistic query method enables reliable information retrieval even though the gene-cell-type associations extracted from the literature are noisy. CellMeSH is also able to optionally utilize prior knowledge about tissues or cells for further annotation improvement. CellMeSH achieves top-one and top-three accuracies on a number of mouse and human datasets that are consistently better than existing approaches. AVAILABILITY AND IMPLEMENTATION: Web server at https://uncurl.cs.washington.edu/db_query and API at https://github.com/shunfumao/cellmesh. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Programas Informáticos , Humanos , Perfilación de la Expresión Génica/métodos , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodosRESUMEN
Over just the last 2 years, mRNA therapeutics and vaccines have undergone a rapid transition from an intriguing concept to real-world impact. However, whereas some aspects of mRNA therapeutics, such as the use of chemical modifications to increase stability and reduce immunogenicity, have been extensively optimized for over two decades, other aspects, particularly the selection and design of the noncoding leader and trailer sequences which control translation efficiency and stability, have received comparably less attention. In practice, such 5' and 3' untranslated regions (UTRs) are often borrowed from highly expressed human genes with few or no modifications, as in the case for the Pfizer/BioNTech Covid vaccine. Focusing on the 5'UTR, we here argue that model-driven design is a promising alternative that provides unprecedented control over 5'UTR function. We review recent work that combines synthetic biology with machine learning to build quantitative models that relate ribosome loading, and thus translation efficiency, to the 5'UTR sequence. We first introduce an experimental approach that uses polysome profiling and high-throughput sequencing to quantify ribosome loading for hundreds of thousands of 5'UTRs in parallel. We apply this approach to measure ribosome loading in synthetic RNA libraries with a random sequence inserted into the 5'UTR. We then review Optimus 5-Prime, a convolutional neural network model trained on the experimental data. We highlight that very accurate models of biological regulation can be learned from synthetic data sets with degenerate 5'UTRs. We validate model predictions not only on held-out data sets from our random library but also on a large library of over 30â¯000 human 5'UTR fragments and using translation reporter data collected independently by other groups. Both the experiment and model are compatible with commonly used chemically modified nucleosides, in particular, pseudouridine (Ψ) and 1-methyl-pseudouridine (m1Ψ). We find that, in general, 5'UTRs have very similar impacts when combined with different protein-coding sequences and even in the context of different chemical modifications. We demonstrate that Optimus 5-Prime can be combined with design algorithms to generate de novo sequences with precisely defined translation efficiencies. We emphasize recent developments in design algorithms that rely on activation maximization and generative modeling to improve both the fitness and diversity of designed sequences. Compared with prior approaches such as genetic algorithms, we show that these approaches are not only faster but also less likely to get stuck in local sequence optima. Finally, we discuss how the approach reviewed here can be generalized to other gene regions and applications.
Asunto(s)
COVID-19 , Biosíntesis de Proteínas , Vacunas contra la COVID-19 , Humanos , Aprendizaje Automático , ARN Mensajero/genética , ARN Mensajero/metabolismo , SARS-CoV-2RESUMEN
Cerebellar malformations are diverse congenital anomalies frequently associated with developmental disability. Although genetic and prenatal non-genetic causes have been described, no systematic analysis has been performed. Here, we present a large-exome sequencing study of Dandy-Walker malformation (DWM) and cerebellar hypoplasia (CBLH). We performed exome sequencing in 282 individuals from 100 families with DWM or CBLH, and we established a molecular diagnosis in 36 of 100 families, with a significantly higher yield for CBLH (51%) than for DWM (16%). The 41 variants impact 27 neurodevelopmental-disorder-associated genes, thus demonstrating that CBLH and DWM are often features of monogenic neurodevelopmental disorders. Though only seven monogenic causes (19%) were identified in more than one individual, neuroimaging review of 131 additional individuals confirmed cerebellar abnormalities in 23 of 27 genetic disorders (85%). Prenatal risk factors were frequently found among individuals without a genetic diagnosis (30 of 64 individuals [47%]). Single-cell RNA sequencing of prenatal human cerebellar tissue revealed gene enrichment in neuronal and vascular cell types; this suggests that defective vasculogenesis may disrupt cerebellar development. Further, de novo gain-of-function variants in PDGFRB, a tyrosine kinase receptor essential for vascular progenitor signaling, were associated with CBLH, and this discovery links genetic and non-genetic etiologies. Our results suggest that genetic defects impact specific cerebellar cell types and implicate abnormal vascular development as a mechanism for cerebellar malformations. We also confirmed a major contribution for non-genetic prenatal factors in individuals with cerebellar abnormalities, substantially influencing diagnostic evaluation and counseling regarding recurrence risk and prognosis.
Asunto(s)
Cerebelo/anomalías , Cerebelo/diagnóstico por imagen , Estudios de Cohortes , Femenino , Humanos , Masculino , EmbarazoRESUMEN
BACKGROUND: Optimization of DNA and protein sequences based on Machine Learning models is becoming a powerful tool for molecular design. Activation maximization offers a simple design strategy for differentiable models: one-hot coded sequences are first approximated by a continuous representation, which is then iteratively optimized with respect to the predictor oracle by gradient ascent. While elegant, the current version of the method suffers from vanishing gradients and may cause predictor pathologies leading to poor convergence. RESULTS: Here, we introduce Fast SeqProp, an improved activation maximization method that combines straight-through approximation with normalization across the parameters of the input sequence distribution. Fast SeqProp overcomes bottlenecks in earlier methods arising from input parameters becoming skewed during optimization. Compared to prior methods, Fast SeqProp results in up to 100-fold faster convergence while also finding improved fitness optima for many applications. We demonstrate Fast SeqProp's capabilities by designing DNA and protein sequences for six deep learning predictors, including a protein structure predictor. CONCLUSIONS: Fast SeqProp offers a reliable and efficient method for general-purpose sequence optimization through a differentiable fitness predictor. As demonstrated on a variety of deep learning models, the method is widely applicable, and can incorporate various regularization techniques to maintain confidence in the sequence designs. As a design tool, Fast SeqProp may aid in the development of novel molecules, drug therapies and vaccines.
Asunto(s)
Algoritmos , Aprendizaje Automático , Secuencia de AminoácidosRESUMEN
Cancer-associated mutations of the core splicing factor 3 B1 (SF3B1) result in selection of novel 3' splice sites (3'SS), but precise molecular mechanisms of oncogenesis remain unclear. SF3B1 stabilizes the interaction between U2 snRNP and branch point (BP) on the pre-mRNA. It has hence been speculated that a change in BP selection is the basis for novel 3'SS selection. Direct quantitative determination of BP utilization is however technically challenging. To define BP utilization by SF3B1-mutant spliceosomes, we used an overexpression approach in human cells as well as a complementary strategy using isogenic murine embryonic stem cells with monoallelic K700E mutations constructed via CRISPR/Cas9-based genome editing and a dual vector homology-directed repair methodology. A synthetic minigene library with degenerate regions in 3' intronic regions (3.4 million individual minigenes) was used to compare BP usage of SF3B1K700E and SF3B1WT. Using this model, we show that SF3B1K700E spliceosomes utilize non-canonical sequence variants (at position -1 relative to BP adenosine) more frequently than wild-type spliceosomes. These predictions were confirmed using minigene splicing assays. Our results suggest a model of BP utilization by mutant SF3B1 wherein it is able to utilize non-consensus alternative BP sequences by stabilizing weaker U2-BP interactions.
Asunto(s)
Factores de Empalme de ARN/metabolismo , Animales , Emparejamiento Base , Células Cultivadas , Células Madre Embrionarias/metabolismo , Biblioteca de Genes , Células HEK293 , Humanos , Ratones , Mutación , Motivos de Nucleótidos , Fosfoproteínas/genética , Sitios de Empalme de ARN , Factores de Empalme de ARN/genética , ARN Mensajero/metabolismoRESUMEN
Classical genetic approaches for interpreting variants, such as case-control or co-segregation studies, require finding many individuals with each variant. Because the overwhelming majority of variants are present in only a few living humans, this strategy has clear limits. Fully realizing the clinical potential of genetics requires that we accurately infer pathogenicity even for rare or private variation. Many computational approaches to predicting variant effects have been developed, but they can identify only a small fraction of pathogenic variants with the high confidence that is required in the clinic. Experimentally measuring a variant's functional consequences can provide clearer guidance, but individual assays performed only after the discovery of the variant are both time and resource intensive. Here, we discuss how multiplex assays of variant effect (MAVEs) can be used to measure the functional consequences of all possible variants in disease-relevant loci for a variety of molecular and cellular phenotypes. The resulting large-scale functional data can be combined with machine learning and clinical knowledge for the development of "lookup tables" of accurate pathogenicity predictions. A coordinated effort to produce, analyze, and disseminate large-scale functional data generated by multiplex assays could be essential to addressing the variant-interpretation crisis.
Asunto(s)
Biología Computacional/métodos , Enfermedad/genética , Variación Genética , Genoma Humano , HumanosRESUMEN
Our ability to predict protein expression from DNA sequence alone remains poor, reflecting our limited understanding of cis-regulatory grammar and hampering the design of engineered genes for synthetic biology applications. Here, we generate a model that predicts the protein expression of the 5' untranslated region (UTR) of mRNAs in the yeast Saccharomyces cerevisiae. We constructed a library of half a million 50-nucleotide-long random 5' UTRs and assayed their activity in a massively parallel growth selection experiment. The resulting data allow us to quantify the impact on protein expression of Kozak sequence composition, upstream open reading frames (uORFs), and secondary structure. We trained a convolutional neural network (CNN) on the random library and showed that it performs well at predicting the protein expression of both a held-out set of the random 5' UTRs as well as native S. cerevisiae 5' UTRs. The model additionally was used to computationally evolve highly active 5' UTRs. We confirmed experimentally that the great majority of the evolved sequences led to higher protein expression rates than the starting sequences, demonstrating the predictive power of this model.
Asunto(s)
Modelos Genéticos , Saccharomyces cerevisiae/genética , Regiones no Traducidas 5' , Empalme Alternativo , Simulación por Computador , Biblioteca de Genes , Aprendizaje Automático , Redes Neurales de la Computación , ARN de Hongos , ARN MensajeroRESUMEN
Biology offers compelling proof that macroscopic "living materials" can emerge from reactions between diffusing biomolecules. Here, we show that molecular self-organization could be a similarly powerful approach for engineering functional synthetic materials. We introduce a programmable DNA embedded hydrogel that produces tunable patterns at the centimeter length scale. We generate these patterns by implementing chemical reaction networks through synthetic DNA complexes, embedding the complexes in the hydrogel, and triggering with locally applied input DNA strands. We first demonstrate ring pattern formation around a circular input cavity and show that the ring width and intensity can be predictably tuned. Then, we create patterns of increasing complexity, including concentric rings and non-isotropic patterns. Finally, we show "destructive" and "constructive" interference patterns, by combining several ring-forming modules in the gel and triggering them from multiple sources. We further show that computer simulations based on the reaction-diffusion model can predict and inform the programming of target patterns.
Asunto(s)
Simulación por Computador , ADN/química , Hidrogeles/química , Modelos QuímicosRESUMEN
Motivation: Single cell RNA-seq (scRNA-seq) data contains a wealth of information which has to be inferred computationally from the observed sequencing reads. As the ability to sequence more cells improves rapidly, existing computational tools suffer from three problems. (i) The decreased reads-per-cell implies a highly sparse sample of the true cellular transcriptome. (ii) Many tools simply cannot handle the size of the resulting datasets. (iii) Prior biological knowledge such as bulk RNA-seq information of certain cell types or qualitative marker information is not taken into account. Here we present UNCURL, a preprocessing framework based on non-negative matrix factorization for scRNA-seq data, that is able to handle varying sampling distributions, scales to very large cell numbers and can incorporate prior knowledge. Results: We find that preprocessing using UNCURL consistently improves performance of commonly used scRNA-seq tools for clustering, visualization and lineage estimation, both in the absence and presence of prior knowledge. Finally we demonstrate that UNCURL is extremely scalable and parallelizable, and runs faster than other methods on a scRNA-seq dataset containing 1.3 million cells. Availability and implementation: Source code is available at https://github.com/yjzhang/uncurl_python. Supplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Perfilación de la Expresión Génica/métodos , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Programas Informáticos , Algoritmos , Análisis por ConglomeradosRESUMEN
The invention of the Kalman filter is a crowning achievement of filtering theory-one that has revolutionized technology in countless ways. By dealing effectively with noise, the Kalman filter has enabled various applications in positioning, navigation, control, and telecommunications. In the emerging field of synthetic biology, noise and context dependency are among the key challenges facing the successful implementation of reliable, complex, and scalable synthetic circuits. Although substantial further advancement in the field may very well rely on effectively addressing these issues, a principled protocol to deal with noise-as provided by the Kalman filter-remains completely missing. Here we develop an optimal filtering theory that is suitable for noisy biochemical networks. We show how the resulting filters can be implemented at the molecular level and provide various simulations related to estimation, system identification, and noise cancellation problems. We demonstrate our approach in vitro using DNA strand displacement cascades as well as in vivo using flow cytometry measurements of a light-inducible circuit in Escherichia coli.
Asunto(s)
Computadores Moleculares , Modelos Biológicos , Modelos Químicos , Modelos Estadísticos , Procesamiento de Señales Asistido por Computador , Relación Señal-RuidoRESUMEN
Even a single-nucleotide difference between the sequences of two otherwise identical biological nucleic acids can have dramatic functional consequences. Here, we use model-guided reaction pathway engineering to quantitatively improve the performance of selective hybridization probes in recognizing single nucleotide variants (SNVs). Specifically, we build a detection system that combines discrimination by competition with DNA strand displacement-based catalytic amplification. We show, both mathematically and experimentally, that the single nucleotide selectivity of such a system in binding to single-stranded DNA and RNA is quadratically better than discrimination due to competitive hybridization alone. As an additional benefit the integrated circuit inherits the property of amplification and provides at least 10-fold better sensitivity than standard hybridization probes. Moreover, we demonstrate how the detection mechanism can be tuned such that the detection reaction is agnostic to the position of the SNV within the target sequence. in contrast, prior strand displacement-based probes designed for kinetic discrimination are highly sensitive to position effects. We apply our system to reliably discriminate between different members of the let-7 microRNA family that differ in only a single base position. Our results demonstrate the power of systematic reaction network design to quantitatively improve biotechnology.
Asunto(s)
Sondas de ADN/química , Sondas de ADN/genética , ADN/química , ADN/genética , MicroARNs/química , MicroARNs/genética , Hibridación de Ácido Nucleico/métodos , Humanos , Polimorfismo de Nucleótido SimpleRESUMEN
mRNA therapeutics are revolutionizing the pharmaceutical industry, but methods to optimize the primary sequence for increased expression are still lacking. Here, we design 5'UTRs for efficient mRNA translation using deep learning. We perform polysome profiling of fully or partially randomized 5'UTR libraries in three cell types and find that UTR performance is highly correlated across cell types. We train models on our datasets and use them to guide the design of high-performing 5'UTRs using gradient descent and generative neural networks. We experimentally test designed 5'UTRs with mRNA encoding megaTALTM gene editing enzymes for two different gene targets and in two different cell lines. We find that the designed 5'UTRs support strong gene editing activity. Editing efficiency is correlated between cell types and gene targets, although the best performing UTR was specific to one cargo and cell type. Our results highlight the potential of model-based sequence design for mRNA therapeutics.
Asunto(s)
Regiones no Traducidas 5' , Aprendizaje Profundo , Edición Génica , ARN Mensajero , ARN Mensajero/genética , ARN Mensajero/metabolismo , Regiones no Traducidas 5'/genética , Humanos , Edición Génica/métodos , Polirribosomas/metabolismo , Línea Celular , Células HEK293 , Biosíntesis de ProteínasRESUMEN
Transcriptional heterogeneity in isogenic bacterial populations can play various roles in bacterial evolution, but its detection remains technically challenging. Here, we use microbial split-pool ligation transcriptomics to study the relationship between bacterial subpopulation formation and plasmid-host interactions at the single-cell level. We find that single-cell transcript abundances are influenced by bacterial growth state and plasmid carriage. Moreover, plasmid carriage constrains the formation of bacterial subpopulations. Plasmid genes, including those with core functions such as replication and maintenance, exhibit transcriptional heterogeneity associated with cell activity. Notably, we identify a cell subpopulation that does not transcribe conjugal plasmid transfer genes, which may help reduce plasmid burden on a subset of cells. Our study advances the understanding of plasmid-mediated subpopulation dynamics and provides insights into the plasmid-bacteria interplay.
Asunto(s)
Plásmidos , Análisis de la Célula Individual , Plásmidos/genética , Análisis de la Célula Individual/métodos , Escherichia coli/genética , Análisis de Secuencia de ARN/métodos , Conjugación Genética , Bacterias/genética , Regulación Bacteriana de la Expresión Génica , Heterogeneidad GenéticaRESUMEN
An important and largely unsolved problem in synthetic biology is how to target gene expression to specific cell types. Here, we apply iterative deep learning to design synthetic enhancers with strong differential activity between two human cell lines. We initially train models on published datasets of enhancer activity and chromatin accessibility and use them to guide the design of synthetic enhancers that maximize predicted specificity. We experimentally validate these sequences, use the measurements to re-optimize the predictor, and design a second generation of enhancers with improved specificity. Our design methods embed relevant transcription factor binding site (TFBS) motifs with higher frequencies than comparable endogenous enhancers while using a more selective motif vocabulary, and we show that enhancer activity is correlated with transcription factor expression at the single cell level. Finally, we characterize causal features of top enhancers via perturbation experiments and show enhancers as short as 50bp can maintain specificity.
RESUMEN
The interplay between transcription factors and chromatin accessibility regulates cell type diversification during vertebrate embryogenesis. To systematically decipher the gene regulatory logic guiding this process, we generated a single-cell multi-omics atlas of RNA expression and chromatin accessibility during early zebrafish embryogenesis. We developed a deep learning model to predict chromatin accessibility based on DNA sequence and found that a small number of transcription factors underlie cell-type-specific chromatin landscapes. While Nanog is well-established in promoting pluripotency, we discovered a new function in priming the enhancer accessibility of mesendodermal genes. In addition to the classical stepwise mode of differentiation, we describe instant differentiation, where pluripotent cells skip intermediate fate transitions and terminally differentiate. Reconstruction of gene regulatory interactions reveals that this process is driven by a shallow network in which maternally deposited regulators activate a small set of transcription factors that co-regulate hundreds of differentiation genes. Notably, misexpression of these transcription factors in pluripotent cells is sufficient to ectopically activate their targets. This study provides a rich resource for analyzing embryonic gene regulation and reveals the regulatory logic of instant differentiation.