Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 45
Filtrar
1.
Genome Res ; 33(7): 1154-1161, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37558282

RESUMO

Minimizers are ubiquitously used in data structures and algorithms for efficient searching, mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select a minimum k-mer in every L-long subsequence of the target sequence, where minimality is with respect to a predefined k-mer order. Commonly used minimizer orders select more k-mers than necessary and therefore provide limited improvement in runtime and memory usage of downstream analysis tasks. The recently introduced universal k-mer hitting sets produce minimizer orders with fewer selected k-mers. Generating compact universal k-mer hitting sets is currently infeasible for k > 13, and thus, they cannot help in the many applications that require minimizer orders for larger k Here, we close the gap of efficient minimizer orders for large values of k by introducing decycling-set-based minimizer orders: new minimizer orders based on minimum decycling sets. We show that in practice these new minimizer orders select a number of k-mers comparable to that of minimizer orders based on universal k-mer hitting sets and can also scale to a larger k Furthermore, we developed a method that computes the minimizers in a sequence on the fly without keeping the k-mers of a decycling set in memory. This enables the use of these minimizer orders for any value of k We expect the new orders to improve the runtime and memory usage of algorithms and data structures in high-throughput DNA sequencing analysis.


Assuntos
Algoritmos , Software , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos
2.
Nucleic Acids Res ; 52(12): 6777-6790, 2024 Jul 08.
Artigo em Inglês | MEDLINE | ID: mdl-38813823

RESUMO

The CRISPR/Cas9 system is a highly accurate gene-editing technique, but it can also lead to unintended off-target sites (OTS). Consequently, many high-throughput assays have been developed to measure OTS in a genome-wide manner, and their data was used to train machine-learning models to predict OTS. However, these models are inaccurate when considering OTS with bulges due to limited data compared to OTS without bulges. Recently, CHANGE-seq, a new in vitro technique to detect OTS, was used to produce a dataset of unprecedented scale and quality. In addition, the same study produced in cellula GUIDE-seq experiments, but none of these GUIDE-seq experiments included bulges. Here, we generated the most comprehensive GUIDE-seq dataset with bulges, and trained and evaluated state-of-the-art machine-learning models that consider OTS with bulges. We first reprocessed the publicly available experimental raw data of the CHANGE-seq study to generate 20 new GUIDE-seq experiments, and hundreds of OTS with bulges among the original and new GUIDE-seq experiments. We then trained multiple machine-learning models, and demonstrated their state-of-the-art performance both in vitro and in cellula over all OTS and when focusing on OTS with bulges. Last, we visualized the key features learned by our models on OTS with bulges in a unique representation.


Assuntos
Sistemas CRISPR-Cas , Edição de Genes , Aprendizado de Máquina , Edição de Genes/métodos , Humanos , RNA Guia de Sistemas CRISPR-Cas/genética
3.
Brief Bioinform ; 24(4)2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37438149

RESUMO

Nucleic-acid G-quadruplexes (G4s) play vital roles in many cellular processes. Due to their importance, researchers have developed experimental assays to measure nucleic-acid G4s in high throughput. The generated high-throughput datasets gave rise to unique opportunities to develop machine-learning-based methods, and in particular deep neural networks, to predict G4s in any given nucleic-acid sequence and any species. In this paper, we review the success stories of deep-neural-network applications for G4 prediction. We first cover the experimental technologies that generated the most comprehensive nucleic-acid G4 high-throughput datasets in recent years. We then review classic rule-based methods for G4 prediction. We proceed by reviewing the major machine-learning and deep-neural-network applications to nucleic-acid G4 datasets and report a novel comparison between them. Next, we present the interpretability techniques used on the trained neural networks to learn key molecular principles underlying nucleic-acid G4 folding. As a new result, we calculate the overlap between measured DNA and RNA G4s and compare the performance of DNA- and RNA-G4 predictors on RNA- and DNA-G4 datasets, respectively, to demonstrate the potential of transfer learning from DNA G4s to RNA G4s. Last, we conclude with open questions in the field of nucleic-acid G4 prediction and computational modeling.


Assuntos
Quadruplex G , Ácidos Nucleicos , DNA/genética , RNA/genética , Redes Neurais de Computação
4.
Bioinformatics ; 2024 Jul 29.
Artigo em Inglês | MEDLINE | ID: mdl-39073893

RESUMO

MOTIVATION: CRISPR/Cas9 technology has been revolutionizing the field of gene editing. Guide RNAs (gRNAs) enable Cas9 proteins to target specific genomic loci for editing. However, editing efficiency varies between gRNAs and so computational methods were developed to predict editing efficiency for any gRNA of interest. High-throughput datasets of Cas9 editing efficiencies were produced to train machine-learning models to predict editing efficiency. However, these high-throughput datasets have a low correlation with functional and endogenous datasets, which are too small to train accurate machine-learning models on. RESULTS: We developed DeepCRISTL, a deep-learning model to predict the editing efficiency in a specific cellular context. DeepCRISTL takes advantage of high-throughput datasets to learn general patterns of gRNA editing efficiency, and then fine-tunes the model on functional or endogenous data to fit a specific cellular context. We tested two state-of-the-art models trained on high-throughput datasets for editing efficiency prediction, our newly improved DeepHF and CRISPRon, combined with various transfer-learning approaches. The combination of CRISPRon and fine-tuning all model weights was the overall best performer. DeepCRISTL outperformed state-of-the-art methods in predicting editing efficiency in a specific cellular context on functional and endogenous datasets. Using saliency maps, we identified and compared the important features learned by DeepCRISTL across cellular contexts. We believe DeepCRISTL will improve prediction performance in many other CRISPR/Cas9 editing contexts by leveraging transfer learning to utilize both high-throughput datasets and smaller and more biologically relevant datasets. AVAILABILITY: DeepCRISTL is available via github.com/OrensteinLab/DeepCRISTL. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

5.
Brief Bioinform ; 23(5)2022 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-35595297

RESUMO

CRISPR/Cas9 system is widely used in a broad range of gene-editing applications. While this editing technique is quite accurate in the target region, there may be many unplanned off-target sites (OTSs). Consequently, a plethora of computational methods have been developed to predict off-target cleavage sites given a guide RNA and a reference genome. However, these methods are based on small-scale datasets (only tens to hundreds of OTSs) produced by experimental techniques to detect OTSs with a low signal-to-noise ratio. Recently, CHANGE-seq, a new in vitro experimental technique to detect OTSs, was used to produce a dataset of unprecedented scale and quality (>200 000 OTS over 110 guide RNAs). In addition, the same study included in cellula GUIDE-seq experiments for 58 of the guide RNAs. Here, we fill the gap in previous computational methods by utilizing these data to systematically evaluate data processing and formulation of the CRISPR OTSs prediction problem. Our evaluations show that data transformation as a pre-processing phase is critical prior to model training. Moreover, we demonstrate the improvement gained by adding potential inactive OTSs to the training datasets. Furthermore, our results point to the importance of adding the number of mismatches between guide RNAs and their OTSs as a feature. Finally, we present predictive off-target in cellula models based on both in vitro and in cellula data and compare them to state-of-the-art methods in predicting true OTSs. Our conclusions will be instrumental in any future development of an off-target predictor based on high-throughput datasets.


Assuntos
Sistemas CRISPR-Cas , RNA Guia de Cinetoplastídeos , Edição de Genes/métodos , RNA Guia de Cinetoplastídeos/genética , Projetos de Pesquisa
6.
PLoS Comput Biol ; 19(3): e1010948, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-36897885

RESUMO

G-quadruplexes are non-B-DNA structures that form in the genome facilitated by Hoogsteen bonds between guanines in single or multiple strands of DNA. The functions of G-quadruplexes are linked to various molecular and disease phenotypes, and thus researchers are interested in measuring G-quadruplex formation genome-wide. Experimentally measuring G-quadruplexes is a long and laborious process. Computational prediction of G-quadruplex propensity from a given DNA sequence is thus a long-standing challenge. Unfortunately, despite the availability of high-throughput datasets measuring G-quadruplex propensity in the form of mismatch scores, extant methods to predict G-quadruplex formation either rely on small datasets or are based on domain-knowledge rules. We developed G4mismatch, a novel algorithm to accurately and efficiently predict G-quadruplex propensity for any genomic sequence. G4mismatch is based on a convolutional neural network trained on almost 400 millions human genomic loci measured in a single G4-seq experiment. When tested on sequences from a held-out chromosome, G4mismatch, the first method to predict mismatch scores genome-wide, achieved a Pearson correlation of over 0.8. When benchmarked on independent datasets derived from various animal species, G4mismatch trained on human data predicted G-quadruplex propensity genome-wide with high accuracy (Pearson correlations greater than 0.7). Moreover, when tested in detecting G-quadruplexes genome-wide using the predicted mismatch scores, G4mismatch achieved superior performance compared to extant methods. Last, we demonstrate the ability to deduce the mechanism behind G-quadruplex formation by unique visualization of the principles learned by the model.


Assuntos
Quadruplex G , Animais , Humanos , DNA/genética , DNA/química , Genoma Humano , Genômica , Redes Neurais de Computação
7.
Nucleic Acids Res ; 50(20): 11426-11441, 2022 11 11.
Artigo em Inglês | MEDLINE | ID: mdl-36350614

RESUMO

RNA G-quadruplexes (rG4s) are RNA secondary structures, which are formed by guanine-rich sequences and have important cellular functions. Existing computational tools for rG4 prediction rely on specific sequence features and/or were trained on small datasets, without considering rG4 stability information, and are therefore sub-optimal. Here, we developed rG4detector, a convolutional neural network to identify potential rG4s in transcriptomics data. rG4detector outperforms existing methods in both predicting rG4 stability and in detecting rG4-forming sequences. To demonstrate the biological-relevance of rG4detector, we employed it to study RNAs that are bound by the RNA-binding protein G3BP1. G3BP1 is central to the induction of stress granules (SGs), which are cytoplasmic biomolecular condensates that form in response to a variety of cellular stresses. Unexpectedly, rG4detector revealed a dynamic enrichment of rG4s bound by G3BP1 in response to cellular stress. In addition, we experimentally characterized G3BP1 cross-talk with rG4s, demonstrating that G3BP1 is a bona fide rG4-binding protein and that endogenous rG4s are enriched within SGs. Furthermore, we found that reduced rG4 availability impairs SG formation. Hence, we conclude that rG4s play a direct role in SG biology via their interactions with RNA-binding proteins and that rG4detector is a novel useful tool for rG4 transcriptomics data analyses.


Assuntos
Quadruplex G , Proteínas de Ligação a RNA , Grânulos de Estresse , DNA Helicases/genética , DNA Helicases/metabolismo , Proteínas de Ligação a Poli-ADP-Ribose/genética , Proteínas de Ligação a Poli-ADP-Ribose/metabolismo , RNA/química , RNA Helicases/genética , RNA Helicases/metabolismo , Proteínas com Motivo de Reconhecimento de RNA/genética , Proteínas com Motivo de Reconhecimento de RNA/metabolismo , Proteínas de Ligação a RNA/metabolismo
8.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34017982

RESUMO

Understanding post-transcriptional gene regulation is a key challenge in today's biology. The new technologies of RNAcompete and RNA Bind-n-Seq enable the measurement of the binding intensities of one RNA-binding protein (RBP) to numerous synthetic RNA sequences in a single experiment. Recently, Van Nostrand et al. reported the results of RNA Bind-n-Seq experiments measuring binding of 78 human RBPs. Because 31 of these RBPs were also covered by RNAcompete technology, a large-scale comparison between implementations of these two in vitro technologies is now possible. Here, we assessed the similarities and differences between binding models, represented as a list of $k$-mer scores, inferred from RNAcompete and RNA Bind-n-Seq, and also measured how well these models predict in vivo binding. Our results show that RNA Bind-n-Seq- and RNAcompete-derived models agree (Pearson correlation $> 0.5$) for most RBPs (23 out of 31). RNA Bind-n-Seq-derived $k$-mer scores predict RNAcompete binding measurements quite well (average Pearson correlation 0.26), and both technologies produce $k$-mer scores that achieve comparable results in predicting in vivo binding (average AUC 0.7). When inspecting RNA structural preferences inferred from the data of RNA Bind-n-Seq and RNAcompete, we observed high concordance in binding preferences. Through our study, we developed a new $k$-mer score for RNA Bind-n-Seq and extended it to include RNA structural preferences.


Assuntos
Biologia Computacional , Bases de Dados Genéticas , Regulação da Expressão Gênica , Proteínas de Ligação a RNA , RNA , Sítios de Ligação , RNA/genética , RNA/metabolismo , Proteínas de Ligação a RNA/genética , Proteínas de Ligação a RNA/metabolismo
9.
Bioinformatics ; 38(Suppl 1): i161-i168, 2022 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-35758815

RESUMO

MOTIVATION: CRISPR/Cas9 technology has been revolutionizing the field of gene editing in recent years. Guide RNAs (gRNAs) enable Cas9 proteins to target specific genomic loci for editing. However, editing efficiency varies between gRNAs. Thus, computational methods were developed to predict editing efficiency for any gRNA of interest. High-throughput datasets of Cas9 editing efficiencies were produced to train machine-learning models to predict editing efficiency. However, these high-throughput datasets have low correlation with functional and endogenous editing. Another difficulty arises from the fact that functional and endogenous editing efficiency is more difficult to measure, and as a result, functional and endogenous datasets are too small to train accurate machine-learning models on. RESULTS: We developed DeepCRISTL, a deep-learning model to predict the on-target efficiency given a gRNA sequence. DeepCRISTL takes advantage of high-throughput datasets to learn general patterns of gRNA on-target editing efficiency, and then uses transfer learning (TL) to fine-tune the model and fit it to the functional and endogenous prediction task. We pre-trained the DeepCRISTL model on more than 150 000 gRNAs, produced through the DeepHF study as a high-throughput dataset of three Cas9 enzymes. We improved the DeepHF model by multi-task and ensemble techniques and achieved state-of-the-art results over each of the three enzymes: up to 0.89 in Spearman correlation between predicted and measured on-target efficiencies. To fine-tune model weights to predict on-target efficiency of functional or endogenous datasets, we tested several TL approaches, with gradual learning being the overall best performer, both when pre-trained on DeepHF and when pre-trained on CRISPROn, another high-throughput dataset. DeepCRISTL outperformed state-of-the-art methods on all functional and endogenous datasets. Using saliency maps, we identified and compared the important features learned by the model in each dataset. We believe DeepCRISTL will improve prediction performance in many other CRISPR/Cas9 editing contexts by leveraging TL to utilize both high-throughput datasets, and smaller and more biologically relevant datasets, such as functional and endogenous datasets. AVAILABILITY AND IMPLEMENTATION: DeepCRISTL is available via github.com/OrensteinLab/DeepCRISTL. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sistemas CRISPR-Cas , RNA Guia de Cinetoplastídeos , Edição de Genes/métodos , Genoma , Aprendizado de Máquina , RNA Guia de Cinetoplastídeos/genética
10.
Bioinformatics ; 38(4): 1087-1101, 2022 01 27.
Artigo em Inglês | MEDLINE | ID: mdl-34849591

RESUMO

MOTIVATION: messenger RNA (mRNA) degradation plays critical roles in post-transcriptional gene regulation. A major component of mRNA degradation is determined by 3'-UTR elements. Hence, researchers are interested in studying mRNA dynamics as a function of 3'-UTR elements. A recent study measured the mRNA degradation dynamics of tens of thousands of 3'-UTR sequences using a massively parallel reporter assay. However, the computational approach used to model mRNA degradation was based on a simplifying assumption of a linear degradation rate. Consequently, the underlying mechanism of 3'-UTR elements is still not fully understood. RESULTS: Here, we developed deep neural networks to predict mRNA degradation dynamics and interpreted the networks to identify regulatory elements in the 3'-UTR and their positional effect. Given an input of a 110 nt-long 3'-UTR sequence and an initial mRNA level, the model predicts mRNA levels of eight consecutive time points. Our deep neural networks significantly improved prediction performance of mRNA degradation dynamics compared with extant methods for the task. Moreover, we demonstrated that models predicting the dynamics of two identical 3'-UTR sequences, differing by their poly(A) tail, performed better than single-task models. On the interpretability front, by using Integrated Gradients, our convolutional neural networks (CNNs) models identified known and novel cis-regulatory sequence elements of mRNA degradation. By applying a novel systematic evaluation of model interpretability, we demonstrated that the recurrent neural network models are inferior to the CNN models in terms of interpretability and that random initialization ensemble improves both prediction and interoperability performance. Moreover, using a mutagenesis analysis, we newly discovered the positional effect of various 3'-UTR elements. AVAILABILITY AND IMPLEMENTATION: All the code developed through this study is available at github.com/OrensteinLab/DeepUTR/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Simulação por Computador , Estabilidade de RNA , RNA Mensageiro , Regiões 3' não Traduzidas , Redes Neurais de Computação , RNA Mensageiro/química , Aprendizado Profundo
11.
Bioinformatics ; 38(Suppl_2): ii62-ii67, 2022 09 16.
Artigo em Inglês | MEDLINE | ID: mdl-36124796

RESUMO

MOTIVATION: Cys2His2 zinc-finger (C2H2-ZF) proteins are the largest class of human transcription factors and hence play central roles in gene regulation and cell function. C2H2-ZF proteins are characterized by a DNA-binding domain containing multiple ZFs. A subset of the ZFs bind diverse DNA triplets. Despite their central roles, little is known about which of their ZFs are binding and how the DNA-binding preferences are encoded in the amino acid sequence of each ZF. RESULTS: We present DeepZF, a deep-learning-based pipeline for predicting binding ZFs and their DNA-binding preferences given only the amino acid sequence of a C2H2-ZF protein. To the best of our knowledge, we compiled the first in vivo dataset of binding and non-binding ZFs for training the first ZF-binding classifier. Our classifier, which is based on a novel protein transformer, achieved an average AUROC of 0.71. Moreover, we took advantage of both in vivo and in vitro datasets to learn the recognition code of ZF-DNA binding through transfer learning. Our newly developed model, which is the first to utilize deep learning for the task, achieved an average Pearson correlation greater than 0.94 over each of the three DNA binding positions. Together, DeepZF outperformed extant methods in the task of C2H2-ZF protein DNA-binding preferences prediction: it achieved an average Pearson correlation of 0.42 in motif similarity compared with an average correlation smaller than 0.1 achieved by extant methods. By applying established interpretability techniques, we show that DeepZF inferred biologically relevant binding principles, such as the effect of amino acid residue positions on ZF DNA-binding potential. AVAILABILITY AND IMPLEMENTATION: DeepZF code, model, and results are available via github.com/OrensteinLab/DeepZF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
DNA , Dedos de Zinco , Aminoácidos , DNA/metabolismo , Humanos , Aprendizado de Máquina , Fatores de Transcrição , Zinco
12.
BMC Bioinformatics ; 23(1): 253, 2022 Jun 24.
Artigo em Inglês | MEDLINE | ID: mdl-35751023

RESUMO

BACKGROUND: The human body is inhabited by a diverse community of commensal non-pathogenic bacteria, many of which are essential for our health. By contrast, pathogenic bacteria have the ability to invade their hosts and cause a disease. Characterizing the differences between pathogenic and commensal non-pathogenic bacteria is important for the detection of emerging pathogens and for the development of new treatments. Previous methods for classification of bacteria as pathogenic or non-pathogenic used either raw genomic reads or protein families as features. Using protein families instead of reads provided a better interpretability of the resulting model. However, the accuracy of protein-families-based classifiers can still be improved. RESULTS: We developed a wide scope pathogenicity classifier (WSPC), a new protein-content-based machine-learning classification model. We trained WSPC on a newly curated dataset of 641 bacterial genomes, where each genome belongs to a different species. A comparative analysis we conducted shows that WSPC outperforms existing models on two benchmark test sets. We observed that the most discriminative protein-family features in WSPC are widely spread among bacterial species. These features correspond to proteins that are involved in the ability of bacteria to survive and replicate during an infection, rather than proteins that are directly involved in damaging or invading the host.


Assuntos
Genoma Bacteriano , Genômica , Bactérias/genética , Genômica/métodos , Humanos , Aprendizado de Máquina , Filogenia , Virulência/genética
13.
Genes Dev ; 28(19): 2163-74, 2014 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-25223897

RESUMO

Transcription of protein-coding genes is highly dependent on the RNA polymerase II core promoter. Core promoters, generally defined as the regions that direct transcription initiation, consist of functional core promoter motifs (such as the TATA-box, initiator [Inr], and downstream core promoter element [DPE]) that confer specific properties to the core promoter. The known basal transcription factors that support TATA-dependent transcription are insufficient for in vitro transcription of DPE-dependent promoters. In search of a transcription factor that supports DPE-dependent transcription, we used a biochemical complementation approach and identified the Drosophila TBP (TATA-box-binding protein)-related factor 2 (TRF2) as an enriched factor in the fractions that support DPE-dependent transcription. We demonstrate that the short TRF2 isoform preferentially activates DPE-dependent promoters. DNA microarray analysis reveals the enrichment of DPE promoters among short TRF2 up-regulated genes. Using primer extension analysis and reporter assays, we show the importance of the DPE in transcriptional regulation of TRF2 target genes. It was previously shown that, unlike TBP, TRF2 fails to bind DNA containing TATA-boxes. Using microfluidic affinity analysis, we discovered that short TRF2-bound DNA oligos are enriched for Inr and DPE motifs. Taken together, our findings highlight the role of short TRF2 as a preferential core promoter regulator.


Assuntos
Proteínas de Drosophila/metabolismo , Drosophila melanogaster/genética , Drosophila melanogaster/metabolismo , Regulação da Expressão Gênica , Proteína 2 de Ligação a Repetições Teloméricas/metabolismo , Motivos de Aminoácidos , Animais , Linhagem Celular , Células Cultivadas , Proteínas de Drosophila/genética , Ligação Proteica , TATA Box , Proteína 2 de Ligação a Repetições Teloméricas/genética
14.
Bioinformatics ; 36(Suppl_2): i634-i642, 2020 12 30.
Artigo em Inglês | MEDLINE | ID: mdl-33381817

RESUMO

MOTIVATION: Transcription factor (TF) DNA-binding is a central mechanism in gene regulation. Biologists would like to know where and when these factors bind DNA. Hence, they require accurate DNA-binding models to enable binding prediction to any DNA sequence. Recent technological advancements measure the binding of a single TF to thousands of DNA sequences. One of the prevailing techniques, high-throughput SELEX, measures protein-DNA binding by high-throughput sequencing over several cycles of enrichment. Unfortunately, current computational methods to infer the binding preferences from high-throughput SELEX data do not exploit the richness of these data, and are under-using the most advanced computational technique, deep neural networks. RESULTS: To better characterize the binding preferences of TFs from these experimental data, we developed DeepSELEX, a new algorithm to infer intrinsic DNA-binding preferences using deep neural networks. DeepSELEX takes advantage of the richness of high-throughput sequencing data and learns the DNA-binding preferences by observing the changes in DNA sequences through the experimental cycles. DeepSELEX outperforms extant methods for the task of DNA-binding inference from high-throughput SELEX data in binding prediction in vitro and is on par with the state of the art in in vivo binding prediction. Analysis of model parameters reveals it learns biologically relevant features that shed light on TFs' binding mechanism. AVAILABILITY AND IMPLEMENTATION: DeepSELEX is available through github.com/OrensteinLab/DeepSELEX/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
DNA , Sequenciamento de Nucleotídeos em Larga Escala , Sítios de Ligação , DNA/genética , DNA/metabolismo , Ligação Proteica , Análise de Sequência de DNA
15.
Bioinformatics ; 36(11): 3357-3364, 2020 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-32176271

RESUMO

MOTIVATION: High-throughput protein screening is a critical technique for dissecting and designing protein function. Libraries for these assays can be created through a number of means, including targeted or random mutagenesis of a template protein sequence or direct DNA synthesis. However, mutagenic library construction methods often yield vastly more nonfunctional than functional variants and, despite advances in large-scale DNA synthesis, individual synthesis of each desired DNA template is often prohibitively expensive. Consequently, many protein-screening libraries rely on the use of degenerate codons (DCs), mixtures of DNA bases incorporated at specific positions during DNA synthesis, to generate highly diverse protein-variant pools from only a few low-cost synthesis reactions. However, selecting DCs for sets of sequences that covary at multiple positions dramatically increases the difficulty of designing a DC library and leads to the creation of many undesired variants that can quickly outstrip screening capacity. RESULTS: We introduce a novel algorithm for total DC library optimization, degenerate codon design (DeCoDe), based on integer linear programming. DeCoDe significantly outperforms state-of-the-art DC optimization algorithms and scales well to more than a hundred proteins sharing complex patterns of covariation (e.g. the lab-derived avGFP lineage). Moreover, DeCoDe is, to our knowledge, the first DC design algorithm with the capability to encode mixed-length protein libraries. We anticipate DeCoDe to be broadly useful for a variety of library generation problems, ranging from protein engineering attempts that leverage mutual information to the reconstruction of ancestral protein states. AVAILABILITY AND IMPLEMENTATION: github.com/OrensteinLab/DeCoDe. CONTACT: yaronore@bgu.ac.il. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Engenharia de Proteínas , Proteínas , Algoritmos , Sequência de Aminoácidos , Códon/genética , Biblioteca Gênica
16.
Proc Natl Acad Sci U S A ; 115(16): E3702-E3711, 2018 04 17.
Artigo em Inglês | MEDLINE | ID: mdl-29588420

RESUMO

Transcription factors (TFs) are primary regulators of gene expression in cells, where they bind specific genomic target sites to control transcription. Quantitative measurements of TF-DNA binding energies can improve the accuracy of predictions of TF occupancy and downstream gene expression in vivo and shed light on how transcriptional networks are rewired throughout evolution. Here, we present a sequencing-based TF binding assay and analysis pipeline (BET-seq, for Binding Energy Topography by sequencing) capable of providing quantitative estimates of binding energies for more than one million DNA sequences in parallel at high energetic resolution. Using this platform, we measured the binding energies associated with all possible combinations of 10 nucleotides flanking the known consensus DNA target interacting with two model yeast TFs, Pho4 and Cbf1. A large fraction of these flanking mutations change overall binding energies by an amount equal to or greater than consensus site mutations, suggesting that current definitions of TF binding sites may be too restrictive. By systematically comparing estimates of binding energies output by deep neural networks (NNs) and biophysical models trained on these data, we establish that dinucleotide (DN) specificities are sufficient to explain essentially all variance in observed binding behavior, with Cbf1 binding exhibiting significantly more nonadditivity than Pho4. NN-derived binding energies agree with orthogonal biochemical measurements and reveal that dynamically occupied sites in vivo are both energetically and mutationally distant from the highest affinity sites.


Assuntos
DNA/metabolismo , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Fatores de Transcrição/metabolismo , Sequência de Bases , Fatores de Transcrição de Zíper de Leucina e Hélice-Alça-Hélix Básicos/metabolismo , Sítios de Ligação , Simulação por Computador , Proteínas de Ligação a DNA/metabolismo , Elementos E-Box , Biblioteca Gênica , Técnicas Analíticas Microfluídicas , Método de Monte Carlo , Ligação Proteica , Proteínas de Saccharomyces cerevisiae/metabolismo , Análise de Sequência de DNA , Termodinâmica , Transcrição Gênica
17.
Bioinformatics ; 34(17): i638-i646, 2018 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-30423078

RESUMO

Motivation: The complexes formed by binding of proteins to RNAs play key roles in many biological processes, such as splicing, gene expression regulation, translation and viral replication. Understanding protein-RNA binding may thus provide important insights to the functionality and dynamics of many cellular processes. This has sparked substantial interest in exploring protein-RNA binding experimentally, and predicting it computationally. The key computational challenge is to efficiently and accurately infer protein-RNA binding models that will enable prediction of novel protein-RNA interactions to additional transcripts of interest. Results: We developed DLPRB (Deep Learning for Protein-RNA Binding), a new deep neural network (DNN) approach for learning intrinsic protein-RNA binding preferences and predicting novel interactions. We present two different network architectures: a convolutional neural network (CNN), and a recurrent neural network (RNN). The novelty of our network hinges upon two key aspects: (i) the joint analysis of both RNA sequence and structure, which is represented as a probability vector of different RNA structural contexts; (ii) novel features in the architecture of the networks, such as the application of RNNs to RNA-binding prediction, and the combination of hundreds of variable-length filters in the CNN. Our results in inferring accurate RNA-binding models from high-throughput in vitro data exhibit substantial improvements, compared to all previous approaches for protein-RNA binding prediction (both DNN and non-DNN based). A more modest, yet statistically significant, improvement is achieved for in vivo binding prediction. When incorporating experimentally-measured RNA structure, compared to predicted one, the improvement on in vivo data increases. By visualizing the binding specificities, we can gain biological insights underlying the mechanism of protein RNA-binding. Availability and implementation: The source code is publicly available at https://github.com/ilanbb/dlprb. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Aprendizado Profundo , Redes Neurais de Computação , Proteínas de Ligação a RNA/metabolismo , RNA/metabolismo , Software
18.
BMC Genomics ; 19(1): 154, 2018 02 20.
Artigo em Inglês | MEDLINE | ID: mdl-29463232

RESUMO

BACKGROUND: RNA-binding proteins (RBPs) play vital roles in many processes in the cell. Different RBPs bind RNA with different sequence and structure specificities. While sequence specificities for a large set of 205 RBPs have been reported through the RNAcompete compendium, structure specificities are known for only a small fraction. The main limitation lies in the design of the RNAcompete technology, which tests RBP binding against unstructured RNA probes, making it difficult to infer structural preferences from these data. We recently developed RCK, an algorithm to infer sequence and structural binding models from RNAcompete data. The set of binding models enables, for the first time, a large-scale assessment of RNA structure in the RBPome. RESULTS: We re-validate and uncover the role of RNA structure in the RPBome through novel analysis of the largest-scale dataset to date. First, we show that RNA structure exists in presumably unstructured RNA probes and that its variability is correlated with RNA-binding. Second, we examine the structural binding preferences of RBPs and discover an overall preference to bind RNA loops. Third, we significantly improve protein-binding prediction using RNA structure, both in vitro and in vivo. Lastly, we demonstrate that RNA structural binding preferences can be inferred for new proteins from solely their amino acid content. CONCLUSIONS: By counter-intuitively demonstrating through our analysis that we can predict both the RNA structure of and RBP binding to these putatively unstructured RNAs, we transform a compendium of RNA-binding proteins into a valuable resource for structure-based binding models. We uncover the important role RNA structure plays in protein-RNA interaction for hundreds of RNA-binding proteins.


Assuntos
Conformação de Ácido Nucleico , Proteínas de Ligação a RNA/química , RNA/química , Motivos de Aminoácidos , Sítios de Ligação , Modelos Teóricos , Motivos de Nucleotídeos , Ligação Proteica , RNA/genética , Proteínas de Ligação a RNA/genética , Proteínas de Ligação a RNA/metabolismo , Reprodutibilidade dos Testes , Relação Estrutura-Atividade
19.
Bioinformatics ; 33(14): i110-i117, 2017 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-28881970

RESUMO

MOTIVATION: The minimizers scheme is a method for selecting k -mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g. too many k -mers are selected when processing certain sequences). Some of these problems were already known to the authors of the minimizers technique, and the natural lexicographic ordering of k -mers used by minimizers was recognized as their origin. Many software tools using minimizers employ ad hoc variations of the lexicographic order to alleviate those issues. RESULTS: We provide an in-depth analysis of the effect of k -mer ordering on the performance of the minimizers technique. By using small universal hitting sets (a recently defined concept), we show how to significantly improve the performance of minimizers and avoid some of its worse behaviors. Based on these results, we encourage bioinformatics software developers to use an ordering based on a universal hitting set or, if not possible, a randomized ordering, rather than the lexicographic order. This analysis also settles negatively a conjecture (by Schleimer et al. ) on the expected density of minimizers in a random sequence. AVAILABILITY AND IMPLEMENTATION: The software used for this analysis is available on GitHub: https://github.com/gmarcais/minimizers.git . CONTACT: gmarcais@cs.cmu.edu or carlk@cs.cmu.edu.


Assuntos
Genoma Humano , Genômica/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Humanos
20.
Mol Syst Biol ; 13(2): 910, 2017 02 06.
Artigo em Inglês | MEDLINE | ID: mdl-28167566

RESUMO

Transcription factors (TFs) achieve DNA-binding specificity through contacts with functional groups of bases (base readout) and readout of structural properties of the double helix (shape readout). Currently, it remains unclear whether DNA shape readout is utilized by only a few selected TF families, or whether this mechanism is used extensively by most TF families. We resequenced data from previously published HT-SELEX experiments, the most extensive mammalian TF-DNA binding data available to date. Using these data, we demonstrated the contributions of DNA shape readout across diverse TF families and its importance in core motif-flanking regions. Statistical machine-learning models combined with feature-selection techniques helped to reveal the nucleotide position-dependent DNA shape readout in TF-binding sites and the TF family-specific position dependence. Based on these results, we proposed novel DNA shape logos to visualize the DNA shape preferences of TFs. Overall, this work suggests a way of obtaining mechanistic insights into TF-DNA binding without relying on experimentally solved all-atom structures.


Assuntos
DNA/química , Análise de Sequência de DNA/métodos , Fatores de Transcrição/metabolismo , Animais , Sítios de Ligação , DNA/genética , DNA/metabolismo , Bases de Dados Genéticas , Humanos , Aprendizado de Máquina , Mamíferos/genética , Camundongos , Conformação de Ácido Nucleico , Fatores de Transcrição/genética
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa