Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 44
Filtrar
Mais filtros








Base de dados
Intervalo de ano de publicação
1.
Nucleic Acids Res ; 2024 May 30.
Artigo em Inglês | MEDLINE | ID: mdl-38813823

RESUMO

The CRISPR/Cas9 system is a highly accurate gene-editing technique, but it can also lead to unintended off-target sites (OTS). Consequently, many high-throughput assays have been developed to measure OTS in a genome-wide manner, and their data was used to train machine-learning models to predict OTS. However, these models are inaccurate when considering OTS with bulges due to limited data compared to OTS without bulges. Recently, CHANGE-seq, a new in vitro technique to detect OTS, was used to produce a dataset of unprecedented scale and quality. In addition, the same study produced in cellula GUIDE-seq experiments, but none of these GUIDE-seq experiments included bulges. Here, we generated the most comprehensive GUIDE-seq dataset with bulges, and trained and evaluated state-of-the-art machine-learning models that consider OTS with bulges. We first reprocessed the publicly available experimental raw data of the CHANGE-seq study to generate 20 new GUIDE-seq experiments, and hundreds of OTS with bulges among the original and new GUIDE-seq experiments. We then trained multiple machine-learning models, and demonstrated their state-of-the-art performance both in vitro and in cellula over all OTS and when focusing on OTS with bulges. Last, we visualized the key features learned by our models on OTS with bulges in a unique representation.

2.
Nat Commun ; 15(1): 2394, 2024 Mar 16.
Artigo em Inglês | MEDLINE | ID: mdl-38493141

RESUMO

We demonstrate a transcriptional regulatory design algorithm that can boost expression in yeast and mammalian cell lines. The system consists of a simplified transcriptional architecture composed of a minimal core promoter and a synthetic upstream regulatory region (sURS) composed of up to three motifs selected from a list of 41 motifs conserved in the eukaryotic lineage. The sURS system was first characterized using an oligo-library containing 189,990 variants. We validate the resultant expression model using a set of 43 unseen sURS designs. The validation sURS experiments indicate that a generic set of grammar rules for boosting and attenuation may exist in yeast cells. Finally, we demonstrate that this generic set of grammar rules functions similarly in mammalian CHO-K1 and HeLa cells. Consequently, our work provides a design algorithm for boosting the expression of promoters used for expressing industrially relevant proteins in yeast and mammalian cell lines.


Assuntos
Células Eucarióticas , Saccharomyces cerevisiae , Animais , Humanos , Saccharomyces cerevisiae/genética , Células HeLa , Regiões Promotoras Genéticas/genética , Expressão Gênica , Mamíferos/genética
3.
iScience ; 27(1): 108557, 2024 Jan 19.
Artigo em Inglês | MEDLINE | ID: mdl-38169993

RESUMO

CRISPR/Cas9 technology is revolutionizing the field of gene editing. While this technology enables the targeting of any gene, it may also target unplanned loci, termed off-target sites (OTS), which are a few mismatches, insertions, and deletions from the target. While existing methods for finding OTS up to a given mismatch threshold are efficient, other methods considering insertions and deletions are limited by long runtimes, incomplete OTS lists, and partial support of versatile thresholds. Here, we developed SWOffinder, an efficient method based on Smith-Waterman alignment to find all OTS up to some edit distance. We implemented an original trace-back approach to find OTS under versatile criteria, such as separate limits on the number of insertions, deletions, and mismatches. Compared to state-of-the-art methods, only SWOffinder finds all OTS in the genome in just a few minutes. SWOffinder enables accurate and efficient genomic search of OTS, which will lead to safer gene editing.

4.
Genome Res ; 33(7): 1154-1161, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37558282

RESUMO

Minimizers are ubiquitously used in data structures and algorithms for efficient searching, mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select a minimum k-mer in every L-long subsequence of the target sequence, where minimality is with respect to a predefined k-mer order. Commonly used minimizer orders select more k-mers than necessary and therefore provide limited improvement in runtime and memory usage of downstream analysis tasks. The recently introduced universal k-mer hitting sets produce minimizer orders with fewer selected k-mers. Generating compact universal k-mer hitting sets is currently infeasible for k > 13, and thus, they cannot help in the many applications that require minimizer orders for larger k Here, we close the gap of efficient minimizer orders for large values of k by introducing decycling-set-based minimizer orders: new minimizer orders based on minimum decycling sets. We show that in practice these new minimizer orders select a number of k-mers comparable to that of minimizer orders based on universal k-mer hitting sets and can also scale to a larger k Furthermore, we developed a method that computes the minimizers in a sequence on the fly without keeping the k-mers of a decycling set in memory. This enables the use of these minimizer orders for any value of k We expect the new orders to improve the runtime and memory usage of algorithms and data structures in high-throughput DNA sequencing analysis.


Assuntos
Algoritmos , Software , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos
5.
Brief Bioinform ; 24(4)2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37438149

RESUMO

Nucleic-acid G-quadruplexes (G4s) play vital roles in many cellular processes. Due to their importance, researchers have developed experimental assays to measure nucleic-acid G4s in high throughput. The generated high-throughput datasets gave rise to unique opportunities to develop machine-learning-based methods, and in particular deep neural networks, to predict G4s in any given nucleic-acid sequence and any species. In this paper, we review the success stories of deep-neural-network applications for G4 prediction. We first cover the experimental technologies that generated the most comprehensive nucleic-acid G4 high-throughput datasets in recent years. We then review classic rule-based methods for G4 prediction. We proceed by reviewing the major machine-learning and deep-neural-network applications to nucleic-acid G4 datasets and report a novel comparison between them. Next, we present the interpretability techniques used on the trained neural networks to learn key molecular principles underlying nucleic-acid G4 folding. As a new result, we calculate the overlap between measured DNA and RNA G4s and compare the performance of DNA- and RNA-G4 predictors on RNA- and DNA-G4 datasets, respectively, to demonstrate the potential of transfer learning from DNA G4s to RNA G4s. Last, we conclude with open questions in the field of nucleic-acid G4 prediction and computational modeling.


Assuntos
Quadruplex G , Ácidos Nucleicos , DNA/genética , RNA/genética , Redes Neurais de Computação
6.
PLoS Comput Biol ; 19(3): e1010948, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-36897885

RESUMO

G-quadruplexes are non-B-DNA structures that form in the genome facilitated by Hoogsteen bonds between guanines in single or multiple strands of DNA. The functions of G-quadruplexes are linked to various molecular and disease phenotypes, and thus researchers are interested in measuring G-quadruplex formation genome-wide. Experimentally measuring G-quadruplexes is a long and laborious process. Computational prediction of G-quadruplex propensity from a given DNA sequence is thus a long-standing challenge. Unfortunately, despite the availability of high-throughput datasets measuring G-quadruplex propensity in the form of mismatch scores, extant methods to predict G-quadruplex formation either rely on small datasets or are based on domain-knowledge rules. We developed G4mismatch, a novel algorithm to accurately and efficiently predict G-quadruplex propensity for any genomic sequence. G4mismatch is based on a convolutional neural network trained on almost 400 millions human genomic loci measured in a single G4-seq experiment. When tested on sequences from a held-out chromosome, G4mismatch, the first method to predict mismatch scores genome-wide, achieved a Pearson correlation of over 0.8. When benchmarked on independent datasets derived from various animal species, G4mismatch trained on human data predicted G-quadruplex propensity genome-wide with high accuracy (Pearson correlations greater than 0.7). Moreover, when tested in detecting G-quadruplexes genome-wide using the predicted mismatch scores, G4mismatch achieved superior performance compared to extant methods. Last, we demonstrate the ability to deduce the mechanism behind G-quadruplex formation by unique visualization of the principles learned by the model.


Assuntos
Quadruplex G , Animais , Humanos , DNA/genética , DNA/química , Genoma Humano , Genômica , Redes Neurais de Computação
7.
Front Cell Dev Biol ; 11: 1034604, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36891511

RESUMO

During neurogenesis, the generation and differentiation of neuronal progenitors into inhibitory gamma-aminobutyric acid-containing interneurons is dependent on the combinatorial activity of transcription factors (TFs) and their corresponding regulatory elements (REs). However, the roles of neuronal TFs and their target REs in inhibitory interneuron progenitors are not fully elucidated. Here, we developed a deep-learning-based framework to identify enriched TF motifs in gene REs (eMotif-RE), such as poised/repressed enhancers and putative silencers. Using epigenetic datasets (e.g., ATAC-seq and H3K27ac/me3 ChIP-seq) from cultured interneuron-like progenitors, we distinguished between active enhancer sequences (open chromatin with H3K27ac) and non-active enhancer sequences (open chromatin without H3K27ac). Using our eMotif-RE framework, we discovered enriched motifs of TFs such as ASCL1, SOX4, and SOX11 in the active enhancer set suggesting a cooperativity function for ASCL1 and SOX4/11 in active enhancers of neuronal progenitors. In addition, we found enriched ZEB1 and CTCF motifs in the non-active set. Using an in vivo enhancer assay, we showed that most of the tested putative REs from the non-active enhancer set have no enhancer activity. Two of the eight REs (25%) showed function as poised enhancers in the neuronal system. Moreover, mutated REs for ZEB1 and CTCF motifs increased their in vivo activity as enhancers indicating a repressive effect of ZEB1 and CTCF on these REs that likely function as repressed enhancers or silencers. Overall, our work integrates a novel framework based on deep learning together with a functional assay that elucidated novel functions of TFs and their corresponding REs. Our approach can be applied to better understand gene regulation not only in inhibitory interneuron differentiation but in other tissue and cell types.

8.
Nucleic Acids Res ; 50(20): 11426-11441, 2022 11 11.
Artigo em Inglês | MEDLINE | ID: mdl-36350614

RESUMO

RNA G-quadruplexes (rG4s) are RNA secondary structures, which are formed by guanine-rich sequences and have important cellular functions. Existing computational tools for rG4 prediction rely on specific sequence features and/or were trained on small datasets, without considering rG4 stability information, and are therefore sub-optimal. Here, we developed rG4detector, a convolutional neural network to identify potential rG4s in transcriptomics data. rG4detector outperforms existing methods in both predicting rG4 stability and in detecting rG4-forming sequences. To demonstrate the biological-relevance of rG4detector, we employed it to study RNAs that are bound by the RNA-binding protein G3BP1. G3BP1 is central to the induction of stress granules (SGs), which are cytoplasmic biomolecular condensates that form in response to a variety of cellular stresses. Unexpectedly, rG4detector revealed a dynamic enrichment of rG4s bound by G3BP1 in response to cellular stress. In addition, we experimentally characterized G3BP1 cross-talk with rG4s, demonstrating that G3BP1 is a bona fide rG4-binding protein and that endogenous rG4s are enriched within SGs. Furthermore, we found that reduced rG4 availability impairs SG formation. Hence, we conclude that rG4s play a direct role in SG biology via their interactions with RNA-binding proteins and that rG4detector is a novel useful tool for rG4 transcriptomics data analyses.


Assuntos
Quadruplex G , Proteínas de Ligação a RNA , Grânulos de Estresse , DNA Helicases/genética , DNA Helicases/metabolismo , Proteínas de Ligação a Poli-ADP-Ribose/genética , Proteínas de Ligação a Poli-ADP-Ribose/metabolismo , RNA/química , RNA Helicases/genética , RNA Helicases/metabolismo , Proteínas com Motivo de Reconhecimento de RNA/genética , Proteínas com Motivo de Reconhecimento de RNA/metabolismo , Proteínas de Ligação a RNA/metabolismo
9.
Bioinformatics ; 38(Suppl_2): ii62-ii67, 2022 09 16.
Artigo em Inglês | MEDLINE | ID: mdl-36124796

RESUMO

MOTIVATION: Cys2His2 zinc-finger (C2H2-ZF) proteins are the largest class of human transcription factors and hence play central roles in gene regulation and cell function. C2H2-ZF proteins are characterized by a DNA-binding domain containing multiple ZFs. A subset of the ZFs bind diverse DNA triplets. Despite their central roles, little is known about which of their ZFs are binding and how the DNA-binding preferences are encoded in the amino acid sequence of each ZF. RESULTS: We present DeepZF, a deep-learning-based pipeline for predicting binding ZFs and their DNA-binding preferences given only the amino acid sequence of a C2H2-ZF protein. To the best of our knowledge, we compiled the first in vivo dataset of binding and non-binding ZFs for training the first ZF-binding classifier. Our classifier, which is based on a novel protein transformer, achieved an average AUROC of 0.71. Moreover, we took advantage of both in vivo and in vitro datasets to learn the recognition code of ZF-DNA binding through transfer learning. Our newly developed model, which is the first to utilize deep learning for the task, achieved an average Pearson correlation greater than 0.94 over each of the three DNA binding positions. Together, DeepZF outperformed extant methods in the task of C2H2-ZF protein DNA-binding preferences prediction: it achieved an average Pearson correlation of 0.42 in motif similarity compared with an average correlation smaller than 0.1 achieved by extant methods. By applying established interpretability techniques, we show that DeepZF inferred biologically relevant binding principles, such as the effect of amino acid residue positions on ZF DNA-binding potential. AVAILABILITY AND IMPLEMENTATION: DeepZF code, model, and results are available via github.com/OrensteinLab/DeepZF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
DNA , Dedos de Zinco , Aminoácidos , DNA/metabolismo , Humanos , Aprendizado de Máquina , Fatores de Transcrição , Zinco
10.
Bioinformatics ; 38(Suppl 1): i161-i168, 2022 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-35758815

RESUMO

MOTIVATION: CRISPR/Cas9 technology has been revolutionizing the field of gene editing in recent years. Guide RNAs (gRNAs) enable Cas9 proteins to target specific genomic loci for editing. However, editing efficiency varies between gRNAs. Thus, computational methods were developed to predict editing efficiency for any gRNA of interest. High-throughput datasets of Cas9 editing efficiencies were produced to train machine-learning models to predict editing efficiency. However, these high-throughput datasets have low correlation with functional and endogenous editing. Another difficulty arises from the fact that functional and endogenous editing efficiency is more difficult to measure, and as a result, functional and endogenous datasets are too small to train accurate machine-learning models on. RESULTS: We developed DeepCRISTL, a deep-learning model to predict the on-target efficiency given a gRNA sequence. DeepCRISTL takes advantage of high-throughput datasets to learn general patterns of gRNA on-target editing efficiency, and then uses transfer learning (TL) to fine-tune the model and fit it to the functional and endogenous prediction task. We pre-trained the DeepCRISTL model on more than 150 000 gRNAs, produced through the DeepHF study as a high-throughput dataset of three Cas9 enzymes. We improved the DeepHF model by multi-task and ensemble techniques and achieved state-of-the-art results over each of the three enzymes: up to 0.89 in Spearman correlation between predicted and measured on-target efficiencies. To fine-tune model weights to predict on-target efficiency of functional or endogenous datasets, we tested several TL approaches, with gradual learning being the overall best performer, both when pre-trained on DeepHF and when pre-trained on CRISPROn, another high-throughput dataset. DeepCRISTL outperformed state-of-the-art methods on all functional and endogenous datasets. Using saliency maps, we identified and compared the important features learned by the model in each dataset. We believe DeepCRISTL will improve prediction performance in many other CRISPR/Cas9 editing contexts by leveraging TL to utilize both high-throughput datasets, and smaller and more biologically relevant datasets, such as functional and endogenous datasets. AVAILABILITY AND IMPLEMENTATION: DeepCRISTL is available via github.com/OrensteinLab/DeepCRISTL. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sistemas CRISPR-Cas , RNA Guia de Cinetoplastídeos , Edição de Genes/métodos , Genoma , Aprendizado de Máquina , RNA Guia de Cinetoplastídeos/genética
11.
BMC Bioinformatics ; 23(1): 253, 2022 Jun 24.
Artigo em Inglês | MEDLINE | ID: mdl-35751023

RESUMO

BACKGROUND: The human body is inhabited by a diverse community of commensal non-pathogenic bacteria, many of which are essential for our health. By contrast, pathogenic bacteria have the ability to invade their hosts and cause a disease. Characterizing the differences between pathogenic and commensal non-pathogenic bacteria is important for the detection of emerging pathogens and for the development of new treatments. Previous methods for classification of bacteria as pathogenic or non-pathogenic used either raw genomic reads or protein families as features. Using protein families instead of reads provided a better interpretability of the resulting model. However, the accuracy of protein-families-based classifiers can still be improved. RESULTS: We developed a wide scope pathogenicity classifier (WSPC), a new protein-content-based machine-learning classification model. We trained WSPC on a newly curated dataset of 641 bacterial genomes, where each genome belongs to a different species. A comparative analysis we conducted shows that WSPC outperforms existing models on two benchmark test sets. We observed that the most discriminative protein-family features in WSPC are widely spread among bacterial species. These features correspond to proteins that are involved in the ability of bacteria to survive and replicate during an infection, rather than proteins that are directly involved in damaging or invading the host.


Assuntos
Genoma Bacteriano , Genômica , Bactérias/genética , Genômica/métodos , Humanos , Aprendizado de Máquina , Filogenia , Virulência/genética
12.
Brief Bioinform ; 23(5)2022 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-35595297

RESUMO

CRISPR/Cas9 system is widely used in a broad range of gene-editing applications. While this editing technique is quite accurate in the target region, there may be many unplanned off-target sites (OTSs). Consequently, a plethora of computational methods have been developed to predict off-target cleavage sites given a guide RNA and a reference genome. However, these methods are based on small-scale datasets (only tens to hundreds of OTSs) produced by experimental techniques to detect OTSs with a low signal-to-noise ratio. Recently, CHANGE-seq, a new in vitro experimental technique to detect OTSs, was used to produce a dataset of unprecedented scale and quality (>200 000 OTS over 110 guide RNAs). In addition, the same study included in cellula GUIDE-seq experiments for 58 of the guide RNAs. Here, we fill the gap in previous computational methods by utilizing these data to systematically evaluate data processing and formulation of the CRISPR OTSs prediction problem. Our evaluations show that data transformation as a pre-processing phase is critical prior to model training. Moreover, we demonstrate the improvement gained by adding potential inactive OTSs to the training datasets. Furthermore, our results point to the importance of adding the number of mismatches between guide RNAs and their OTSs as a feature. Finally, we present predictive off-target in cellula models based on both in vitro and in cellula data and compare them to state-of-the-art methods in predicting true OTSs. Our conclusions will be instrumental in any future development of an off-target predictor based on high-throughput datasets.


Assuntos
Sistemas CRISPR-Cas , RNA Guia de Cinetoplastídeos , Edição de Genes/métodos , RNA Guia de Cinetoplastídeos/genética , Projetos de Pesquisa
13.
Bioinformatics ; 38(4): 1087-1101, 2022 01 27.
Artigo em Inglês | MEDLINE | ID: mdl-34849591

RESUMO

MOTIVATION: messenger RNA (mRNA) degradation plays critical roles in post-transcriptional gene regulation. A major component of mRNA degradation is determined by 3'-UTR elements. Hence, researchers are interested in studying mRNA dynamics as a function of 3'-UTR elements. A recent study measured the mRNA degradation dynamics of tens of thousands of 3'-UTR sequences using a massively parallel reporter assay. However, the computational approach used to model mRNA degradation was based on a simplifying assumption of a linear degradation rate. Consequently, the underlying mechanism of 3'-UTR elements is still not fully understood. RESULTS: Here, we developed deep neural networks to predict mRNA degradation dynamics and interpreted the networks to identify regulatory elements in the 3'-UTR and their positional effect. Given an input of a 110 nt-long 3'-UTR sequence and an initial mRNA level, the model predicts mRNA levels of eight consecutive time points. Our deep neural networks significantly improved prediction performance of mRNA degradation dynamics compared with extant methods for the task. Moreover, we demonstrated that models predicting the dynamics of two identical 3'-UTR sequences, differing by their poly(A) tail, performed better than single-task models. On the interpretability front, by using Integrated Gradients, our convolutional neural networks (CNNs) models identified known and novel cis-regulatory sequence elements of mRNA degradation. By applying a novel systematic evaluation of model interpretability, we demonstrated that the recurrent neural network models are inferior to the CNN models in terms of interpretability and that random initialization ensemble improves both prediction and interoperability performance. Moreover, using a mutagenesis analysis, we newly discovered the positional effect of various 3'-UTR elements. AVAILABILITY AND IMPLEMENTATION: All the code developed through this study is available at github.com/OrensteinLab/DeepUTR/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Simulação por Computador , Estabilidade de RNA , RNA Mensageiro , Regiões 3' não Traduzidas , Redes Neurais de Computação , RNA Mensageiro/química , Aprendizado Profundo
14.
IEEE/ACM Trans Comput Biol Bioinform ; 19(4): 1946-1955, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-33872156

RESUMO

G-quadruplexes (G4s) are nucleic acid secondary structures that form within guanine-rich DNA or RNA sequences. G4 formation can affect chromatin architecture and gene regulation, and has been associated with genomic instability, genetic diseases, and cancer progression. The experimental data produced by the G4-seq experiment provides unprecedented details on G4 formation in the genome. Still, running the experimental protocol on a whole genome is an expensive and time-consuming process. Thus, it is highly desirable to have a computational method to predict G4 formation in new DNA sequences or whole genomes. Here, we present G4detector, a new method based on a convolutional neural network to predict G4s from DNA sequences. On top of the sequence information, we improved prediction accuracy by the addition of RNA secondary structure information. To train and test G4detector, we compiled novel high-throughput benchmarks over multiple species genomes measured by the G4-seq protocol. We show that G4detector outperforms extant methods for the same task on all benchmark datasets, can detect G4s genome-wide with high accuracy, and is able to extrapolate human-trained measurements to various non-human species. The code and benchmarks are publicly available on github.com/OrensteinLab/G4detector.


Assuntos
Quadruplex G , DNA/química , DNA/genética , Genoma , Redes Neurais de Computação , RNA/química
15.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34017982

RESUMO

Understanding post-transcriptional gene regulation is a key challenge in today's biology. The new technologies of RNAcompete and RNA Bind-n-Seq enable the measurement of the binding intensities of one RNA-binding protein (RBP) to numerous synthetic RNA sequences in a single experiment. Recently, Van Nostrand et al. reported the results of RNA Bind-n-Seq experiments measuring binding of 78 human RBPs. Because 31 of these RBPs were also covered by RNAcompete technology, a large-scale comparison between implementations of these two in vitro technologies is now possible. Here, we assessed the similarities and differences between binding models, represented as a list of $k$-mer scores, inferred from RNAcompete and RNA Bind-n-Seq, and also measured how well these models predict in vivo binding. Our results show that RNA Bind-n-Seq- and RNAcompete-derived models agree (Pearson correlation $> 0.5$) for most RBPs (23 out of 31). RNA Bind-n-Seq-derived $k$-mer scores predict RNAcompete binding measurements quite well (average Pearson correlation 0.26), and both technologies produce $k$-mer scores that achieve comparable results in predicting in vivo binding (average AUC 0.7). When inspecting RNA structural preferences inferred from the data of RNA Bind-n-Seq and RNAcompete, we observed high concordance in binding preferences. Through our study, we developed a new $k$-mer score for RNA Bind-n-Seq and extended it to include RNA structural preferences.


Assuntos
Biologia Computacional , Bases de Dados Genéticas , Regulação da Expressão Gênica , Proteínas de Ligação a RNA , RNA , Sítios de Ligação , RNA/genética , RNA/metabolismo , Proteínas de Ligação a RNA/genética , Proteínas de Ligação a RNA/metabolismo
16.
Nat Commun ; 12(1): 1576, 2021 03 11.
Artigo em Inglês | MEDLINE | ID: mdl-33707432

RESUMO

We apply an oligo-library and machine learning-approach to characterize the sequence and structural determinants of binding of the phage coat proteins (CPs) of bacteriophages MS2 (MCP), PP7 (PCP), and Qß (QCP) to RNA. Using the oligo library, we generate thousands of candidate binding sites for each CP, and screen for binding using a high-throughput dose-response Sort-seq assay (iSort-seq). We then apply a neural network to expand this space of binding sites, which allowed us to identify the critical structural and sequence features for binding of each CP. To verify our model and experimental findings, we design several non-repetitive binding site cassettes and validate their functionality in mammalian cells. We find that the binding of each CP to RNA is characterized by a unique space of sequence and structural determinants, thus providing a more complete description of CP-RNA interaction as compared with previous low-throughput findings. Finally, based on the binding spaces we demonstrate a computational tool for the successful design and rapid synthesis of functional non-repetitive binding-site cassettes.


Assuntos
Allolevivirus/genética , Proteínas do Capsídeo/metabolismo , Escherichia coli/virologia , Levivirus/genética , RNA/metabolismo , Sítios de Ligação Microbiológicos/genética , Sítios de Ligação/genética , Linhagem Celular Tumoral , Escherichia coli/genética , Biblioteca Gênica , Humanos , Aprendizado de Máquina , Plasmídeos/genética
17.
Methods Mol Biol ; 2243: 95-105, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33606254

RESUMO

High-throughput sequencing machines can read millions of DNA molecules in parallel in a short time and at a relatively low cost. As a consequence, researchers have access to databases with millions of genomic samples. Searching and analyzing these large amounts of data require efficient algorithms.Universal hitting sets are sets of words that must be present in any long enough string. Using small universal hitting sets, it is possible to increase the efficiency of many high-throughput sequencing data analyses. But, generating minimum-size universal hitting sets is a hard problem. In this chapter, we cover our algorithmic developments to produce compact universal hitting sets and some of their potential applications.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Algoritmos , DNA/genética , Humanos , Software
18.
Bioinformatics ; 36(Suppl_2): i634-i642, 2020 12 30.
Artigo em Inglês | MEDLINE | ID: mdl-33381817

RESUMO

MOTIVATION: Transcription factor (TF) DNA-binding is a central mechanism in gene regulation. Biologists would like to know where and when these factors bind DNA. Hence, they require accurate DNA-binding models to enable binding prediction to any DNA sequence. Recent technological advancements measure the binding of a single TF to thousands of DNA sequences. One of the prevailing techniques, high-throughput SELEX, measures protein-DNA binding by high-throughput sequencing over several cycles of enrichment. Unfortunately, current computational methods to infer the binding preferences from high-throughput SELEX data do not exploit the richness of these data, and are under-using the most advanced computational technique, deep neural networks. RESULTS: To better characterize the binding preferences of TFs from these experimental data, we developed DeepSELEX, a new algorithm to infer intrinsic DNA-binding preferences using deep neural networks. DeepSELEX takes advantage of the richness of high-throughput sequencing data and learns the DNA-binding preferences by observing the changes in DNA sequences through the experimental cycles. DeepSELEX outperforms extant methods for the task of DNA-binding inference from high-throughput SELEX data in binding prediction in vitro and is on par with the state of the art in in vivo binding prediction. Analysis of model parameters reveals it learns biologically relevant features that shed light on TFs' binding mechanism. AVAILABILITY AND IMPLEMENTATION: DeepSELEX is available through github.com/OrensteinLab/DeepSELEX/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
DNA , Sequenciamento de Nucleotídeos em Larga Escala , Sítios de Ligação , DNA/genética , DNA/metabolismo , Ligação Proteica , Análise de Sequência de DNA
19.
Cancers (Basel) ; 12(6)2020 Jun 14.
Artigo em Inglês | MEDLINE | ID: mdl-32545894

RESUMO

Transcription factors encoded by Homeobox (HOX) genes play numerous key functions during early embryonic development and differentiation. Multiple reports have shown that mis-regulation of HOX gene expression plays key roles in the development of cancers. Their expression levels in cancers tend to differ based on tissue and tumor type. Here, we performed a comprehensive analysis comparing HOX gene expression in different cancer types, obtained from The Cancer Genome Atlas (TCGA), with matched healthy tissues, obtained from Genotype-Tissue Expression (GTEx). We identified and quantified differential expression patterns that confirmed previously identified expression changes and highlighted new differential expression signatures. We discovered differential expression patterns that are in line with patient survival data. This comprehensive and quantitative analysis provides a global picture of HOX genes' differential expression patterns in different cancer types.

20.
ACS Chem Biol ; 15(4): 925-935, 2020 04 17.
Artigo em Inglês | MEDLINE | ID: mdl-32216326

RESUMO

Single-stranded DNA (ssDNA) containing four guanine repeats can form G-quadruplex (G4) structures. While cellular proteins and small molecules can bind G4s, it has been difficult to broadly assess their DNA-binding specificity. Here, we use custom DNA microarrays to examine the binding specificities of proteins, small molecules, and antibodies across ∼15,000 potential G4 structures. Molecules used include fluorescently labeled pyridostatin (Cy5-PDS, a small molecule), BG4 (Cy5-BG4, a G4-specific antibody), and eight proteins (GST-tagged nucleolin, IGF2, CNBP, FANCJ, PIF1, BLM, DHX36, and WRN). Cy5-PDS and Cy5-BG4 selectively bind sequences known to form G4s, confirming their formation on the microarrays. Cy5-PDS binding decreased when G4 formation was inhibited using lithium or when ssDNA features on the microarray were made double-stranded. Similar conditions inhibited the binding of all other molecules except for CNBP and PIF1. We report that proteins have different G4-binding preferences suggesting unique cellular functions. Finally, competition experiments are used to assess the binding specificity of an unlabeled small molecule, revealing the structural features in the G4 required to achieve selectivity. These data demonstrate that the microarray platform can be used to assess the binding preferences of molecules to G4s on a broad scale, helping to understand the properties that govern molecular recognition.


Assuntos
DNA de Cadeia Simples/metabolismo , Proteínas de Ligação a DNA/metabolismo , Quadruplex G , DNA de Cadeia Simples/genética , Humanos , Análise de Sequência com Séries de Oligonucleotídeos , Polimorfismo de Nucleotídeo Único , Ligação Proteica
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA