Search | VHL Regional Portal

1.

A nuclear RNA degradation code for eukaryotic transcriptome surveillance.

Soles, Lindsey V; Liu, Liang; Zou, Xudong; Yoon, Yoseop; Li, Shuangyu; Tian, Lusong; Valdez, Marielle Cárdenas; Yu, Angela; Yin, Hong; Li, Wei; Ding, Fangyuan; Seelig, Georg; Li, Lei; Shi, Yongsheng.

bioRxiv ; 2024 Jul 24.

Article in English | MEDLINE | ID: mdl-39211185

ABSTRACT

The RNA exosome plays critical roles in eukaryotic RNA degradation, but it remains unclear how the exosome specifically recognizes its targets. The PAXT connection is an adaptor that recruits the exosome to polyadenylated RNAs in the nucleus, especially transcripts polyadenylated at intronic poly(A) sites. Here we show that PAXT-mediated RNA degradation is induced by the combination of a 5' splice site and a poly(A) junction, but not by either sequence alone. These sequences are bound by U1 snRNP and cleavage/polyadenylation factors, which in turn cooperatively recruit PAXT. As the 5' splice site-poly(A) junction combination is typically not found on correctly processed full-length RNAs, we propose that it functions as a "nuclear RNA degradation code" (NRDC). Importantly, disease-associated single nucleotide polymorphisms that create novel 5' splice sites in 3' untranslated regions can induce aberrant mRNA degradation via the NRDC mechanism. Together our study identified the first NRDC, revealed its recognition mechanism, and characterized its role in human diseases.

2.

Massively parallel measurement of protein-protein interactions by sequencing using MP3-seq.

Baryshev, Alexandr; La Fleur, Alyssa; Groves, Benjamin; Michel, Cirstyn; Baker, David; Ljubetic, Ajasja; Seelig, Georg.

Nat Chem Biol ; 2024 Aug 27.

Article in English | MEDLINE | ID: mdl-39192093

ABSTRACT

Protein-protein interactions (PPIs) regulate many cellular processes and engineered PPIs have cell and gene therapy applications. Here, we introduce massively parallel PPI measurement by sequencing (MP3-seq), an easy-to-use and highly scalable yeast two-hybrid approach for measuring PPIs. In MP3-seq, DNA barcodes are associated with specific protein pairs and barcode enrichment can be read by sequencing to provide a direct measure of interaction strength. We show that MP3-seq is highly quantitative and scales to over 100,000 interactions. We apply MP3-seq to characterize interactions between families of rationally designed heterodimers and to investigate elements conferring specificity to coiled-coil interactions. Lastly, we predict coiled heterodimer structures using AlphaFold-Multimer (AF-M) and train linear models on physics-based energy terms to predict MP3-seq values. We find that AF-M-based models could be valuable for prescreening interactions but experimentally measuring interactions remains necessary to rank their strengths quantitatively.

3.

Single-cell RNA sequencing reveals plasmid constrains bacterial population heterogeneity and identifies a non-conjugating subpopulation.

Cyriaque, Valentine; Ibarra-Chávez, Rodrigo; Kuchina, Anna; Seelig, Georg; Nesme, Joseph; Madsen, Jonas Stenløkke.

Nat Commun ; 15(1): 5853, 2024 Jul 12.

Article in English | MEDLINE | ID: mdl-38997267

ABSTRACT

Transcriptional heterogeneity in isogenic bacterial populations can play various roles in bacterial evolution, but its detection remains technically challenging. Here, we use microbial split-pool ligation transcriptomics to study the relationship between bacterial subpopulation formation and plasmid-host interactions at the single-cell level. We find that single-cell transcript abundances are influenced by bacterial growth state and plasmid carriage. Moreover, plasmid carriage constrains the formation of bacterial subpopulations. Plasmid genes, including those with core functions such as replication and maintenance, exhibit transcriptional heterogeneity associated with cell activity. Notably, we identify a cell subpopulation that does not transcribe conjugal plasmid transfer genes, which may help reduce plasmid burden on a subset of cells. Our study advances the understanding of plasmid-mediated subpopulation dynamics and provides insights into the plasmid-bacteria interplay.

Subject(s)

Plasmids , Single-Cell Analysis , Plasmids/genetics , Single-Cell Analysis/methods , Escherichia coli/genetics , Sequence Analysis, RNA/methods , Conjugation, Genetic , Bacteria/genetics , Gene Expression Regulation, Bacterial , Genetic Heterogeneity

4.

High-throughput single-cell transcriptomics of bacteria using combinatorial barcoding.

Gaisser, Karl D; Skloss, Sophie N; Brettner, Leandra M; Paleologu, Luana; Roco, Charles M; Rosenberg, Alexander B; Hirano, Matthew; DePaolo, R William; Seelig, Georg; Kuchina, Anna.

Nat Protoc ; 2024 Jun 17.

Article in English | MEDLINE | ID: mdl-38886529

ABSTRACT

Microbial split-pool ligation transcriptomics (microSPLiT) is a high-throughput single-cell RNA sequencing method for bacteria. With four combinatorial barcoding rounds, microSPLiT can profile transcriptional states in hundreds of thousands of Gram-negative and Gram-positive bacteria in a single experiment without specialized equipment. As bacterial samples are fixed and permeabilized before barcoding, they can be collected and stored ahead of time. During the first barcoding round, the fixed and permeabilized bacteria are distributed into a 96-well plate, where their transcripts are reverse transcribed into cDNA and labeled with the first well-specific barcode inside the cells. The cells are mixed and redistributed two more times into new 96-well plates, where the second and third barcodes are appended to the cDNA via in-cell ligation reactions. Finally, the cells are mixed and divided into aliquot sub-libraries, which can be stored until future use or prepared for sequencing with the addition of a fourth barcode. It takes 4 days to generate sequencing-ready libraries, including 1 day for collection and overnight fixation of samples. The standard plate setup enables single-cell transcriptional profiling of up to 1 million bacterial cells and up to 96 samples in a single barcoding experiment, with the possibility of expansion by adding barcoding rounds. The protocol requires experience in basic molecular biology techniques, handling of bacterial samples and preparation of DNA libraries for next-generation sequencing. It can be performed by experienced undergraduate or graduate students. Data analysis requires access to computing resources, familiarity with Unix command line and basic experience with Python or R.

5.

Optimizing 5'UTRs for mRNA-delivered gene editing using deep learning.

Castillo-Hair, Sebastian; Fedak, Stephen; Wang, Ban; Linder, Johannes; Havens, Kyle; Certo, Michael; Seelig, Georg.

Nat Commun ; 15(1): 5284, 2024 Jun 20.

Article in English | MEDLINE | ID: mdl-38902240

ABSTRACT

mRNA therapeutics are revolutionizing the pharmaceutical industry, but methods to optimize the primary sequence for increased expression are still lacking. Here, we design 5'UTRs for efficient mRNA translation using deep learning. We perform polysome profiling of fully or partially randomized 5'UTR libraries in three cell types and find that UTR performance is highly correlated across cell types. We train models on our datasets and use them to guide the design of high-performing 5'UTRs using gradient descent and generative neural networks. We experimentally test designed 5'UTRs with mRNA encoding megaTALTM gene editing enzymes for two different gene targets and in two different cell lines. We find that the designed 5'UTRs support strong gene editing activity. Editing efficiency is correlated between cell types and gene targets, although the best performing UTR was specific to one cargo and cell type. Our results highlight the potential of model-based sequence design for mRNA therapeutics.

Subject(s)

5' Untranslated Regions , Deep Learning , Gene Editing , RNA, Messenger , RNA, Messenger/genetics , RNA, Messenger/metabolism , 5' Untranslated Regions/genetics , Humans , Gene Editing/methods , Polyribosomes/metabolism , Cell Line , HEK293 Cells , Protein Biosynthesis

6.

Iterative deep learning-design of human enhancers exploits condensed sequence grammar to achieve cell type-specificity.

Yin, Christopher; Hair, Sebastian Castillo; Byeon, Gun Woo; Bromley, Peter; Meuleman, Wouter; Seelig, Georg.

bioRxiv ; 2024 Jun 14.

Article in English | MEDLINE | ID: mdl-38915713

ABSTRACT

An important and largely unsolved problem in synthetic biology is how to target gene expression to specific cell types. Here, we apply iterative deep learning to design synthetic enhancers with strong differential activity between two human cell lines. We initially train models on published datasets of enhancer activity and chromatin accessibility and use them to guide the design of synthetic enhancers that maximize predicted specificity. We experimentally validate these sequences, use the measurements to re-optimize the predictor, and design a second generation of enhancers with improved specificity. Our design methods embed relevant transcription factor binding site (TFBS) motifs with higher frequencies than comparable endogenous enhancers while using a more selective motif vocabulary, and we show that enhancer activity is correlated with transcription factor expression at the single cell level. Finally, we characterize causal features of top enhancers via perturbation experiments and show enhancers as short as 50bp can maintain specificity.

7.

The anticancer compound JTE-607 reveals hidden sequence specificity of the mRNA 3' processing machinery.

Liu, Liang; Yu, Angela M; Wang, Xiuye; Soles, Lindsey V; Teng, Xueyi; Chen, Yiling; Yoon, Yoseop; Sarkan, Kristianna S K; Valdez, Marielle Cárdenas; Linder, Johannes; England, Whitney; Spitale, Robert; Yu, Zhaoxia; Marazzi, Ivan; Qiao, Feng; Li, Wei; Seelig, Georg; Shi, Yongsheng.

Nat Struct Mol Biol ; 30(12): 1947-1957, 2023 Dec.

Article in English | MEDLINE | ID: mdl-38087090

ABSTRACT

JTE-607 is an anticancer and anti-inflammatory compound and its active form, compound 2, directly binds to and inhibits CPSF73, the endonuclease for the cleavage step in pre-messenger RNA (pre-mRNA) 3' processing. Surprisingly, compound 2-mediated inhibition of pre-mRNA cleavage is sequence specific and the drug sensitivity is predominantly determined by sequences flanking the cleavage site (CS). Using massively parallel in vitro assays, we identified key sequence features that determine drug sensitivity. We trained a machine learning model that can predict poly(A) site (PAS) relative sensitivity to compound 2 and provide the molecular basis for understanding the impact of JTE-607 on PAS selection and transcription termination genome wide. We propose that CPSF73 and associated factors bind to the CS region in a sequence-dependent manner and the interaction affinity determines compound 2 sensitivity. These results have not only elucidated the mechanism of action of JTE-607, but also unveiled an evolutionarily conserved sequence specificity of the mRNA 3' processing machinery.

Subject(s)

RNA Precursors , RNA Processing, Post-Transcriptional , Cell Line , RNA Precursors/genetics , RNA Precursors/metabolism , RNA, Messenger/genetics , RNA, Messenger/metabolism

8.

The regulatory landscape of 5' UTRs in translational control during zebrafish embryogenesis.

Reimão-Pinto, Madalena M; Castillo-Hair, Sebastian M; Seelig, Georg; Schier, Alex F.

bioRxiv ; 2023 Nov 23.

Article in English | MEDLINE | ID: mdl-38045294

ABSTRACT

The 5' UTRs of mRNAs are critical for translation regulation, but their in vivo regulatory features are poorly characterized. Here, we report the regulatory landscape of 5' UTRs during early zebrafish embryogenesis using a massively parallel reporter assay of 18,154 sequences coupled to polysome profiling. We found that the 5' UTR is sufficient to confer temporal dynamics to translation initiation, and identified 86 motifs enriched in 5' UTRs with distinct ribosome recruitment capabilities. A quantitative deep learning model, DaniO5P, revealed a combined role for 5' UTR length, translation initiation site context, upstream AUGs and sequence motifs on in vivo ribosome recruitment. DaniO5P predicts the activities of 5' UTR isoforms and indicates that modulating 5' UTR length and motif grammar contributes to translation initiation dynamics. This study provides a first quantitative model of 5' UTR-based translation regulation in early vertebrate development and lays the foundation for identifying the underlying molecular effectors.

9.

DNA storage in thermoresponsive microcapsules for repeated random multiplexed data access.

Bögels, Bas W A; Nguyen, Bichlien H; Ward, David; Gascoigne, Levena; Schrijver, David P; Makri Pistikou, Anna-Maria; Joesaar, Alex; Yang, Shuo; Voets, Ilja K; Mulder, Willem J M; Phillips, Andrew; Mann, Stephen; Seelig, Georg; Strauss, Karin; Chen, Yuan-Jyue; de Greef, Tom F A.

Nat Nanotechnol ; 18(8): 912-921, 2023 08.

Article in English | MEDLINE | ID: mdl-37142708

ABSTRACT

DNA has emerged as an attractive medium for archival data storage due to its durability and high information density. Scalable parallel random access to information is a desirable property of any storage system. For DNA-based storage systems, however, this still needs to be robustly established. Here we report on a thermoconfined polymerase chain reaction, which enables multiplexed, repeated random access to compartmentalized DNA files. The strategy is based on localizing biotin-functionalized oligonucleotides inside thermoresponsive, semipermeable microcapsules. At low temperatures, microcapsules are permeable to enzymes, primers and amplified products, whereas at high temperatures, membrane collapse prevents molecular crosstalk during amplification. Our data show that the platform outperforms non-compartmentalized DNA storage compared with repeated random access and reduces amplification bias tenfold during multiplex polymerase chain reaction. Using fluorescent sorting, we also demonstrate sample pooling and data retrieval by microcapsule barcoding. Therefore, the thermoresponsive microcapsule technology offers a scalable, sequence-agnostic approach for repeated random access to archival DNA files.

Subject(s)

DNA , Information Storage and Retrieval , Capsules , DNA/genetics , Oligonucleotides , High-Throughput Nucleotide Sequencing

10.

The anti-cancer compound JTE-607 reveals hidden sequence specificity of the mRNA 3' processing machinery.

Liu, Liang; Yu, Angela M; Wang, Xiuye; Soles, Lindsey V; Chen, Yiling; Yoon, Yoseop; Sarkan, Kristianna S K; Valdez, Marielle Cárdenas; Linder, Johannes; Marazzi, Ivan; Yu, Zhaoxia; Qiao, Feng; Li, Wei; Seelig, Georg; Shi, Yongsheng.

bioRxiv ; 2023 Apr 11.

Article in English | MEDLINE | ID: mdl-37090613

ABSTRACT

JTE-607 is a small molecule compound with anti-inflammation and anti-cancer activities. Upon entering the cell, it is hydrolyzed to Compound 2, which directly binds to and inhibits CPSF73, the endonuclease for the cleavage step in pre-mRNA 3' processing. Although CPSF73 is universally required for mRNA 3' end formation, we have unexpectedly found that Compound 2- mediated inhibition of pre-mRNA 3' processing is sequence-specific and that the sequences flanking the cleavage site (CS) are a major determinant for drug sensitivity. By using massively parallel in vitro assays, we have measured the Compound 2 sensitivities of over 260,000 sequence variants and identified key sequence features that determine drug sensitivity. A machine learning model trained on these data can predict the impact of JTE-607 on poly(A) site (PAS) selection and transcription termination genome-wide. We propose a biochemical model in which CPSF73 and other mRNA 3' processing factors bind to RNA of the CS region in a sequence-specific manner and the affinity of such interaction determines the Compound 2 sensitivity of a PAS. As the Compound 2-resistant CS sequences, characterized by U/A-rich motifs, are prevalent in PASs from yeast to human, the CS region sequence may have more fundamental functions beyond determining drug resistance. Together, our study not only characterized the mechanism of action of a compound with clinical implications, but also revealed a previously unknown and evolutionarily conserved sequence-specificity of the mRNA 3' processing machinery.

11.

Massively parallel protein-protein interaction measurement by sequencing (MP3-seq) enables rapid screening of protein heterodimers.

Baryshev, Alexander; La Fleur, Alyssa; Groves, Benjamin; Michel, Cirstyn; Baker, David; Ljubetic, Ajasja; Seelig, Georg.

bioRxiv ; 2023 Aug 16.

Article in English | MEDLINE | ID: mdl-36798377

ABSTRACT

Protein-protein interactions (PPIs) regulate many cellular processes, and engineered PPIs have cell and gene therapy applications. Here we introduce massively parallel protein-protein interaction measurement by sequencing (MP3-seq), an easy-to-use and highly scalable yeast-two-hybrid approach for measuring PPIs. In MP3-seq, DNA barcodes are associated with specific protein pairs, and barcode enrichment can be read by sequencing to provide a direct measure of interaction strength. We show that MP3-seq is highly quantitative and scales to over 100,000 interactions. We apply MP3-seq to characterize interactions between families of rationally designed heterodimers and to investigate elements conferring specificity to coiled-coil interactions. Finally, we predict coiled heterodimer structures using AlphaFold-Multimer (AF-M) and train linear models on physics simulation energy terms to predict MP3-seq values. We find that AF-M and AF-M complex prediction-based models could be valuable for pre-screening interactions, but that measuring interactions experimentally remains necessary to rank their strengths quantitatively.

12.

Deciphering the impact of genetic variation on human polyadenylation using APARENT2.

Linder, Johannes; Koplik, Samantha E; Kundaje, Anshul; Seelig, Georg.

Genome Biol ; 23(1): 232, 2022 11 05.

Article in English | MEDLINE | ID: mdl-36335397

ABSTRACT

BACKGROUND: 3'-end processing by cleavage and polyadenylation is an important and finely tuned regulatory process during mRNA maturation. Numerous genetic variants are known to cause or contribute to human disorders by disrupting the cis-regulatory code of polyadenylation signals. Yet, due to the complexity of this code, variant interpretation remains challenging. RESULTS: We introduce a residual neural network model, APARENT2, that can infer 3'-cleavage and polyadenylation from DNA sequence more accurately than any previous model. This model generalizes to the case of alternative polyadenylation (APA) for a variable number of polyadenylation signals. We demonstrate APARENT2's performance on several variant datasets, including functional reporter data and human 3' aQTLs from GTEx. We apply neural network interpretation methods to gain insights into disrupted or protective higher-order features of polyadenylation. We fine-tune APARENT2 on human tissue-resolved transcriptomic data to elucidate tissue-specific variant effects. By combining APARENT2 with models of mRNA stability, we extend aQTL effect size predictions to the entire 3' untranslated region. Finally, we perform in silico saturation mutagenesis of all human polyadenylation signals and compare the predicted effects of [Formula: see text] million variants against gnomAD. While loss-of-function variants were generally selected against, we also find specific clinical conditions linked to gain-of-function mutations. For example, we detect an association between gain-of-function mutations in the 3'-end and autism spectrum disorder. To experimentally validate APARENT2's predictions, we assayed clinically relevant variants in multiple cell lines, including microglia-derived cells. CONCLUSIONS: A sequence-to-function model based on deep residual learning enables accurate functional interpretation of genetic variants in polyadenylation signals and, when coupled with large human variation databases, elucidates the link between functional 3'-end mutations and human health.

Subject(s)

Autism Spectrum Disorder , Polyadenylation , Humans , Autism Spectrum Disorder/genetics , RNA Stability/genetics , Transcriptome , Genetic Variation , 3' Untranslated Regions

13.

Interpreting Neural Networks for Biological Sequences by Learning Stochastic Masks.

Linder, Johannes; La Fleur, Alyssa; Chen, Zibo; Ljubeti, Ajasja; Baker, David; Kannan, Sreeram; Seelig, Georg.

Nat Mach Intell ; 4(1): 41-54, 2022 Jan.

Article in English | MEDLINE | ID: mdl-35966405

ABSTRACT

Sequence-based neural networks can learn to make accurate predictions from large biological datasets, but model interpretation remains challenging. Many existing feature attribution methods are optimized for continuous rather than discrete input patterns and assess individual feature importance in isolation, making them ill-suited for interpreting non-linear interactions in molecular sequences. Building on work in computer vision and natural language processing, we developed an approach based on deep learning - Scrambler networks - wherein the most salient sequence positions are identified with learned input masks. Scramblers learn to predict Position-Specific Scoring Matrices (PSSMs) where unimportant nucleotides or residues are scrambled by raising their entropy. We apply Scramblers to interpret the effects of genetic variants, uncover non-linear interactions between cis-regulatory elements, explain binding specificity for protein-protein interactions, and identify structural determinants of de novo designed proteins. We show that Scramblers enable efficient attribution across large datasets and result in high-quality explanations, often outperforming state-of-the-art methods.

14.

A nanopore interface for higher bandwidth DNA computing.

Zhang, Karen; Chen, Yuan-Jyue; Wilde, Delaney; Doroschak, Kathryn; Strauss, Karin; Ceze, Luis; Seelig, Georg; Nivala, Jeff.

Nat Commun ; 13(1): 4904, 2022 08 20.

Article in English | MEDLINE | ID: mdl-35987925

ABSTRACT

DNA has emerged as a powerful substrate for programming information processing machines at the nanoscale. Among the DNA computing primitives used today, DNA strand displacement (DSD) is arguably the most popular, with DSD-based circuit applications ranging from disease diagnostics to molecular artificial neural networks. The outputs of DSD circuits are generally read using fluorescence spectroscopy. However, due to the spectral overlap of typical small-molecule fluorescent reporters, the number of unique outputs that can be detected in parallel is limited, requiring complex optical setups or spatial isolation of reactions to make output bandwidths scalable. Here, we present a multiplexable sequencing-free readout method that enables real-time, kinetic measurement of DSD circuit activity through highly parallel, direct detection of barcoded output strands using nanopore sensor array technology (Oxford Nanopore Technologies' MinION device). These results increase DSD output bandwidth by an order of magnitude over what is currently feasible with fluorescence spectroscopy.

Subject(s)

Nanopores , DNA , High-Throughput Nucleotide Sequencing/methods , Recombination, Genetic , Sequence Analysis, DNA/methods

15.

Modular, robust, and extendible multicellular circuit design in yeast.

Carignano, Alberto; Chen, Dai Hua; Mallory, Cannon; Wright, R Clay; Seelig, Georg; Klavins, Eric.

Elife ; 112022 03 21.

Article in English | MEDLINE | ID: mdl-35312478

ABSTRACT

Division of labor between cells is ubiquitous in biology but the use of multicellular consortia for engineering applications is only beginning to be explored. A significant advantage of multicellular circuits is their potential to be modular with respect to composition but this claim has not yet been extensively tested using experiments and quantitative modeling. Here, we construct a library of 24 yeast strains capable of sending, receiving or responding to three molecular signals, characterize them experimentally and build quantitative models of their input-output relationships. We then compose these strains into two- and three-strain cascades as well as a four-strain bistable switch and show that experimentally measured consortia dynamics can be predicted from the models of the constituent parts. To further explore the achievable range of behaviors, we perform a fully automated computational search over all two-, three-, and four-strain consortia to identify combinations that realize target behaviors including logic gates, band-pass filters, and time pulses. Strain combinations that are predicted to map onto a target behavior are further computationally optimized and then experimentally tested. Experiments closely track computational predictions. The high reliability of these model descriptions further strengthens the feasibility and highlights the potential for distributed computing in synthetic biology.

Subject(s)

Saccharomyces cerevisiae , Synthetic Biology , Gene Library , Logic , Reproducibility of Results , Saccharomyces cerevisiae/genetics , Synthetic Biology/methods

16.

Machine Learning for Designing Next-Generation mRNA Therapeutics.

Castillo-Hair, Sebastian M; Seelig, Georg.

Acc Chem Res ; 55(1): 24-34, 2022 01 04.

Article in English | MEDLINE | ID: mdl-34905691

ABSTRACT

Over just the last 2 years, mRNA therapeutics and vaccines have undergone a rapid transition from an intriguing concept to real-world impact. However, whereas some aspects of mRNA therapeutics, such as the use of chemical modifications to increase stability and reduce immunogenicity, have been extensively optimized for over two decades, other aspects, particularly the selection and design of the noncoding leader and trailer sequences which control translation efficiency and stability, have received comparably less attention. In practice, such 5' and 3' untranslated regions (UTRs) are often borrowed from highly expressed human genes with few or no modifications, as in the case for the Pfizer/BioNTech Covid vaccine. Focusing on the 5'UTR, we here argue that model-driven design is a promising alternative that provides unprecedented control over 5'UTR function. We review recent work that combines synthetic biology with machine learning to build quantitative models that relate ribosome loading, and thus translation efficiency, to the 5'UTR sequence. We first introduce an experimental approach that uses polysome profiling and high-throughput sequencing to quantify ribosome loading for hundreds of thousands of 5'UTRs in parallel. We apply this approach to measure ribosome loading in synthetic RNA libraries with a random sequence inserted into the 5'UTR. We then review Optimus 5-Prime, a convolutional neural network model trained on the experimental data. We highlight that very accurate models of biological regulation can be learned from synthetic data sets with degenerate 5'UTRs. We validate model predictions not only on held-out data sets from our random library but also on a large library of over 30â¯000 human 5'UTR fragments and using translation reporter data collected independently by other groups. Both the experiment and model are compatible with commonly used chemically modified nucleosides, in particular, pseudouridine (Ψ) and 1-methyl-pseudouridine (m1Ψ). We find that, in general, 5'UTRs have very similar impacts when combined with different protein-coding sequences and even in the context of different chemical modifications. We demonstrate that Optimus 5-Prime can be combined with design algorithms to generate de novo sequences with precisely defined translation efficiencies. We emphasize recent developments in design algorithms that rely on activation maximization and generative modeling to improve both the fitness and diversity of designed sequences. Compared with prior approaches such as genetic algorithms, we show that these approaches are not only faster but also less likely to get stuck in local sequence optima. Finally, we discuss how the approach reviewed here can be generalized to other gene regions and applications.

Subject(s)

COVID-19 , Protein Biosynthesis , COVID-19 Vaccines , Humans , Machine Learning , RNA, Messenger/genetics , RNA, Messenger/metabolism , SARS-CoV-2

17.

CellMeSH: probabilistic cell-type identification using indexed literature.

Mao, Shunfu; Zhang, Yue; Seelig, Georg; Kannan, Sreeram.

Bioinformatics ; 38(5): 1393-1402, 2022 02 07.

Article in English | MEDLINE | ID: mdl-34893819

ABSTRACT

MOTIVATION: Single-cell RNA sequencing (scRNA-seq) is widely used for analyzing gene expression in multi-cellular systems and provides unprecedented access to cellular heterogeneity. scRNA-seq experiments aim to identify and quantify all cell types present in a sample. Measured single-cell transcriptomes are grouped by similarity and the resulting clusters are mapped to cell types based on cluster-specific gene expression patterns. While the process of generating clusters has become largely automated, annotation remains a laborious ad hoc effort that requires expert biological knowledge. RESULTS: Here, we introduce CellMeSH-a new automated approach to identifying cell types for clusters based on prior literature. CellMeSH combines a database of gene-cell-type associations with a probabilistic method for database querying. The database is constructed by automatically linking gene and cell-type information from millions of publications using existing indexed literature resources. Compared to manually constructed databases, CellMeSH is more comprehensive and is easily updated with new data. The probabilistic query method enables reliable information retrieval even though the gene-cell-type associations extracted from the literature are noisy. CellMeSH is also able to optionally utilize prior knowledge about tissues or cells for further annotation improvement. CellMeSH achieves top-one and top-three accuracies on a number of mouse and human datasets that are consistently better than existing approaches. AVAILABILITY AND IMPLEMENTATION: Web server at https://uncurl.cs.washington.edu/db_query and API at https://github.com/shunfumao/cellmesh. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms , Software , Humans , Gene Expression Profiling/methods , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods

18.

Fast activation maximization for molecular sequence design.

Linder, Johannes; Seelig, Georg.

BMC Bioinformatics ; 22(1): 510, 2021 Oct 20.

Article in English | MEDLINE | ID: mdl-34670493

ABSTRACT

BACKGROUND: Optimization of DNA and protein sequences based on Machine Learning models is becoming a powerful tool for molecular design. Activation maximization offers a simple design strategy for differentiable models: one-hot coded sequences are first approximated by a continuous representation, which is then iteratively optimized with respect to the predictor oracle by gradient ascent. While elegant, the current version of the method suffers from vanishing gradients and may cause predictor pathologies leading to poor convergence. RESULTS: Here, we introduce Fast SeqProp, an improved activation maximization method that combines straight-through approximation with normalization across the parameters of the input sequence distribution. Fast SeqProp overcomes bottlenecks in earlier methods arising from input parameters becoming skewed during optimization. Compared to prior methods, Fast SeqProp results in up to 100-fold faster convergence while also finding improved fitness optima for many applications. We demonstrate Fast SeqProp's capabilities by designing DNA and protein sequences for six deep learning predictors, including a protein structure predictor. CONCLUSIONS: Fast SeqProp offers a reliable and efficient method for general-purpose sequence optimization through a differentiable fitness predictor. As demonstrated on a variety of deep learning models, the method is widely applicable, and can incorporate various regularization techniques to maintain confidence in the sequence designs. As a design tool, Fast SeqProp may aid in the development of novel molecules, drug therapies and vaccines.

Subject(s)

Algorithms , Machine Learning , Amino Acid Sequence

19.

A comprehensive analysis of gene expression changes in a high replicate and open-source dataset of differentiating hiPSC-derived cardiomyocytes.

Grancharova, Tanya; Gerbin, Kaytlyn A; Rosenberg, Alexander B; Roco, Charles M; Arakaki, Joy E; DeLizo, Colette M; Dinh, Stephanie Q; Donovan-Maiye, Rory M; Hirano, Matthew; Nelson, Angelique M; Tang, Joyce; Theriot, Julie A; Yan, Calysta; Menon, Vilas; Palecek, Sean P; Seelig, Georg; Gunawardane, Ruwanthi N.

Sci Rep ; 11(1): 15845, 2021 08 04.

Article in English | MEDLINE | ID: mdl-34349150

ABSTRACT

We performed a comprehensive analysis of the transcriptional changes occurring during human induced pluripotent stem cell (hiPSC) differentiation to cardiomyocytes. Using single cell RNA-seq, we sequenced > 20,000 single cells from 55 independent samples representing two differentiation protocols and multiple hiPSC lines. Samples included experimental replicates ranging from undifferentiated hiPSCs to mixed populations of cells at D90 post-differentiation. Differentiated cell populations clustered by time point, with differential expression analysis revealing markers of cardiomyocyte differentiation and maturation changing from D12 to D90. We next performed a complementary cluster-independent sparse regression analysis to identify and rank genes that best assigned cells to differentiation time points. The two highest ranked genes between D12 and D24 (MYH7 and MYH6) resulted in an accuracy of 0.84, and the three highest ranked genes between D24 and D90 (A2M, H19, IGF2) resulted in an accuracy of 0.94, revealing that low dimensional gene features can identify differentiation or maturation stages in differentiating cardiomyocytes. Expression levels of select genes were validated using RNA FISH. Finally, we interrogated differences in cardiac gene expression resulting from two differentiation protocols, experimental replicates, and three hiPSC lines in the WTC-11 background to identify sources of variation across these experimental variables.

Subject(s)

Biomarkers/metabolism , Cell Differentiation , Gene Expression Regulation , Induced Pluripotent Stem Cells/metabolism , Myocytes, Cardiac/cytology , Myocytes, Cardiac/metabolism , Transcriptome , Humans , Induced Pluripotent Stem Cells/cytology , RNA-Seq

20.

Molecular-level similarity search brings computing to DNA data storage.

Bee, Callista; Chen, Yuan-Jyue; Queen, Melissa; Ward, David; Liu, Xiaomeng; Organick, Lee; Seelig, Georg; Strauss, Karin; Ceze, Luis.

Nat Commun ; 12(1): 4764, 2021 08 06.

Article in English | MEDLINE | ID: mdl-34362913

ABSTRACT

As global demand for digital storage capacity grows, storage technologies based on synthetic DNA have emerged as a dense and durable alternative to traditional media. Existing approaches leverage robust error correcting codes and precise molecular mechanisms to reliably retrieve specific files from large databases. Typically, files are retrieved using a pre-specified key, analogous to a filename. However, these approaches lack the ability to perform more complex computations over the stored data, such as similarity search: e.g., finding images that look similar to an image of interest without prior knowledge of their file names. Here we demonstrate a technique for executing similarity search over a DNA-based database of 1.6 million images. Queries are implemented as hybridization probes, and a key step in our approach was to learn an image-to-sequence encoding ensuring that queries preferentially bind to targets representing visually similar images. Experimental results show that our molecular implementation performs comparably to state-of-the-art in silico algorithms for similarity search.

Subject(s)

Computational Biology/methods , DNA/chemistry , Databases, Genetic , Information Storage and Retrieval , Algorithms , Base Sequence , Computer Simulation , DNA/genetics , DNA Probes , Databases, Factual , Neural Networks, Computer

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL