Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 28
Filtrar
Mais filtros

Bases de dados
Tipo de documento
Intervalo de ano de publicação
1.
Genome Res ; 30(7): 1040-1046, 2020 07.
Artigo em Inglês | MEDLINE | ID: mdl-32660981

RESUMO

Transcription is tightly regulated by cis-regulatory DNA elements where transcription factors (TFs) can bind. Thus, identification of TF binding sites (TFBSs) is key to understanding gene expression and whole regulatory networks within a cell. The standard approaches used for TFBS prediction, such as position weight matrices (PWMs) and chromatin immunoprecipitation followed by sequencing (ChIP-seq), are widely used but have their drawbacks, including high false-positive rates and limited antibody availability, respectively. Several computational footprinting algorithms have been developed to detect TFBSs by investigating chromatin accessibility patterns; however, these also have limitations. We have developed a footprinting method to predict TF footprints in active chromatin elements (TRACE) to improve the prediction of TFBS footprints. TRACE incorporates DNase-seq data and PWMs within a multivariate hidden Markov model (HMM) to detect footprint-like regions with matching motifs. TRACE is an unsupervised method that accurately annotates binding sites for specific TFs automatically with no requirement for pregenerated candidate binding sites or ChIP-seq training data. Compared with published footprinting algorithms, TRACE has the best overall performance with the distinct advantage of targeting multiple motifs in a single model.


Assuntos
Cromatina/metabolismo , Pegada de DNA/métodos , Análise de Sequência de DNA , Fatores de Transcrição/metabolismo , Sítios de Ligação , Linhagem Celular , Desoxirribonucleases , Humanos , Células K562 , Cadeias de Markov , Motivos de Nucleotídeos
2.
Bioinformatics ; 36(9): 2690-2696, 2020 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-31999322

RESUMO

MOTIVATION: Position-specific probability matrices (PPMs, also called position-specific weight matrices) have been the dominating model for transcription factor (TF)-binding motifs in DNA. There is, however, increasing recent evidence of better performance of higher order models such as Markov models of order one, also called adjacent dinucleotide matrices (ADMs). ADMs can model dependencies between adjacent nucleotides, unlike PPMs. A modeling technique and software tool that would estimate such models simultaneously both for monomers and their dimers have been missing. RESULTS: We present an ADM-based mixture model for monomeric and dimeric TF-binding motifs and an expectation maximization algorithm MODER2 for learning such models from training data and seeds. The model is a mixture that includes monomers and dimers, built from the monomers, with a description of the dimeric structure (spacing, orientation). The technique is modular, meaning that the co-operative effect of dimerization is made explicit by evaluating the difference between expected and observed models. The model is validated using HT-SELEX and generated datasets, and by comparing to some earlier PPM and ADM techniques. The ADM models explain data slightly better than PPM models for 314 tested TFs (or their DNA-binding domains) from four families (bHLH, bZIP, ETS and Homeodomain), the ADM mixture models by MODER2 being the best on average. AVAILABILITY AND IMPLEMENTATION: Software implementation is available from https://github.com/jttoivon/moder2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Software , Fatores de Transcrição , Algoritmos , Sítios de Ligação , Motivos de Nucleotídeos , Matrizes de Pontuação de Posição Específica , Ligação Proteica , Fatores de Transcrição/genética
3.
BMC Genomics ; 21(1): 86, 2020 Jan 28.
Artigo em Inglês | MEDLINE | ID: mdl-31992191

RESUMO

BACKGROUND: Branch points (BPs) map within short motifs upstream of acceptor splice sites (3'ss) and are essential for splicing of pre-mature mRNA. Several BP-dedicated bioinformatics tools, including HSF, SVM-BPfinder, BPP, Branchpointer, LaBranchoR and RNABPS were developed during the last decade. Here, we evaluated their capability to detect the position of BPs, and also to predict the impact on splicing of variants occurring upstream of 3'ss. RESULTS: We used a large set of constitutive and alternative human 3'ss collected from Ensembl (n = 264,787 3'ss) and from in-house RNAseq experiments (n = 51,986 3'ss). We also gathered an unprecedented collection of functional splicing data for 120 variants (62 unpublished) occurring in BP areas of disease-causing genes. Branchpointer showed the best performance to detect the relevant BPs upstream of constitutive and alternative 3'ss (99.48 and 65.84% accuracies, respectively). For variants occurring in a BP area, BPP emerged as having the best performance to predict effects on mRNA splicing, with an accuracy of 89.17%. CONCLUSIONS: Our investigations revealed that Branchpointer was optimal to detect BPs upstream of 3'ss, and that BPP was most relevant to predict splicing alteration due to variants in the BP area.


Assuntos
Íntrons , Precursores de RNA , Sítios de Splice de RNA , Splicing de RNA , Processamento Alternativo , Biologia Computacional/métodos , Humanos , Motivos de Nucleotídeos , Matrizes de Pontuação de Posição Específica , Processamento Pós-Transcricional do RNA , Curva ROC , Reprodutibilidade dos Testes
4.
J Genet ; 982019 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-31544779

RESUMO

To detect the genetic variation and relationships among different Salvia ecotypes/species, the gene targeted CAAT box derived polymorphism (CBDP) markers were employed in terms of their efficiency. In this study, 25 CBDP primers amplified a total of 323 different polymorphic fragments that discriminate all 26 Salvia ecotypes/species and produced an informative and differentiated dendrogram and population structure. The CBDP markers were found to be effective in Salvia genetic diversity estimation with regard to the averages polymorphism (100%), polymorphism information content (PIC = 0.89), marker index (MI = 4.5) and the effective multiplex ratio (EMR = 5.01) which were higher than other reported markers on Salvia. The extent of heterozygosity (0.034≤H≤0.223) and Shannon index (0.042≤I≤0.278) indicated a high level of genetic variation among Salvia species. The species containing the highest basic chromosome number (X = 12) revealed the highest values for the number of different (Na) and effective (N e) alleles, Shannon index (I), and heterozygosity (H). Additionally, the tetraploid species showed high values of N a, Ne, I and H compared to the diploid species. Mean of gene differentiation (Gst) among Salvia species was 0.792, and the estimation of gene flow (Nm) was 0.13, indicating high genetic differentiation. Remarkably, similar results were obtained from the principal co-ordinate analysis (PCoA) as compared with the cluster analysis, in which all different Salvia species formed individual groups. In conclusion, because the CBDP markers are derived from the gene containing regions of the genome, consequently, the high genetic diversity among studied Salvia species would be more useful for crop improvement programmes, such as hybridization between species and QTL mapping. The potential of CBDPs for analysing the phylogeny and genetic diversity of Salvia species is another key result with practical implications.


Assuntos
Polimorfismo Genético , Salvia/genética , Alelos , Análise por Conglomerados , DNA de Plantas/química , Fluxo Gênico , Marcadores Genéticos , Genética Populacional , Motivos de Nucleotídeos , Filogenia
5.
Nat Commun ; 10(1): 3552, 2019 08 07.
Artigo em Inglês | MEDLINE | ID: mdl-31391532

RESUMO

CRISPR-Cas9 is widely used in genomic editing, but the kinetics of target search and its relation to the cellular concentration of Cas9 have remained elusive. Effective target search requires constant screening of the protospacer adjacent motif (PAM) and a 30 ms upper limit for screening was recently found. To further quantify the rapid switching between DNA-bound and freely-diffusing states of dCas9, we developed an open-microscopy framework, the miCube, and introduce Monte-Carlo diffusion distribution analysis (MC-DDA). Our analysis reveals that dCas9 is screening PAMs 40% of the time in Gram-positive Lactoccous lactis, averaging 17 ± 4 ms per binding event. Using heterogeneous dCas9 expression, we determine the number of cellular target-containing plasmids and derive the copy number dependent Cas9 cleavage. Furthermore, we show that dCas9 is not irreversibly bound to target sites but can still interfere with plasmid replication. Taken together, our quantitative data facilitates further optimization of the CRISPR-Cas toolbox.


Assuntos
Proteína 9 Associada à CRISPR/metabolismo , Edição de Genes , Microscopia/métodos , Plasmídeos/genética , Imagem Individual de Molécula/métodos , Proteína 9 Associada à CRISPR/genética , Dosagem de Genes , Lactococcus lactis/genética , Lactococcus lactis/metabolismo , Proteínas Luminescentes/genética , Proteínas Luminescentes/metabolismo , Microscopia/instrumentação , Modelos Genéticos , Método de Monte Carlo , Motivos de Nucleotídeos/genética , Proteínas Recombinantes de Fusão/genética , Proteínas Recombinantes de Fusão/metabolismo , Imagem Individual de Molécula/instrumentação , Fatores de Tempo , Proteína Vermelha Fluorescente
6.
Nucleic Acids Res ; 47(4): 1628-1636, 2019 02 28.
Artigo em Inglês | MEDLINE | ID: mdl-30590725

RESUMO

Bound by transcription factors, DNA motifs (i.e. transcription factor binding sites) are prevalent and important for gene regulation in different tissues at different developmental stages of eukaryotes. Although considerable efforts have been made on elucidating monomeric DNA motif patterns, our knowledge on heterodimeric DNA motifs are still far from complete. Therefore, we propose to develop a computational approach to synthesize a heterodimeric DNA motif from two monomeric DNA motifs. The approach is sequentially divided into two components (Phases A and B). In Phase A, we propose to develop the inference models on how two DNA monomeric motifs can be oriented and overlapped with each other at nucleotide level. In Phase B, given the two monomeric DNA motifs oriented, we further propose to develop DNA-binding family-specific input-output hidden Markov models (IOHMMs) to synthesize a heterodimeric DNA motif. To validate the approach, we execute and cross-validate it with the experimentally verified 618 heterodimeric DNA motifs across 49 DNA-binding family combinations. We observe that our approach can even "rescue" the existing heterodimeric DNA motif pattern (i.e. HOXB2_EOMES) previously published on Nature. Lastly, we apply the proposed approach to infer previously uncharacterized heterodimeric motifs. Their motif instances are supported by DNase accessibility, gene ontology, protein-protein interactions, in vivo ChIP-seq peaks, and even structural data from PDB. A public web-server is built for open accessibility and scientific impact. Its address is listed as follows: http://motif.cs.cityu.edu.hk/custom/MotifKirin.


Assuntos
Biologia Computacional , Genômica/métodos , Motivos de Nucleotídeos/genética , Fatores de Transcrição/genética , Algoritmos , Sítios de Ligação/genética , Replicação do DNA/genética , Regulação da Expressão Gênica no Desenvolvimento/genética , Humanos , Cadeias de Markov , Elementos Reguladores de Transcrição/genética , Análise de Sequência de DNA/métodos , Software , Fatores de Transcrição/química
7.
Nucleic Acids Res ; 46(W1): W215-W220, 2018 07 02.
Artigo em Inglês | MEDLINE | ID: mdl-29846656

RESUMO

The BaMM web server offers four tools: (i) de-novo discovery of enriched motifs in a set of nucleotide sequences, (ii) scanning a set of nucleotide sequences with motifs to find motif occurrences, (iii) searching with an input motif for similar motifs in our BaMM database with motifs for >1000 transcription factors, trained from the GTRD ChIP-seq database and (iv) browsing and keyword searching the motif database. In contrast to most other servers, we represent sequence motifs not by position weight matrices (PWMs) but by Bayesian Markov Models (BaMMs) of order 4, which we showed previously to perform substantially better in ROC analyses than PWMs or first order models. To address the inadequacy of P- and E-values as measures of motif quality, we introduce the AvRec score, the average recall over the TP-to-FP ratio between 1 and 100. The BaMM server is freely accessible without registration at https://bammmotif.mpibpc.mpg.de.


Assuntos
Motivos de Nucleotídeos , Sequências Reguladoras de Ácido Nucleico , Software , Animais , Teorema de Bayes , Bases de Dados de Ácidos Nucleicos , Humanos , Internet , Cadeias de Markov , Camundongos , Ratos , Análise de Sequência , Fatores de Transcrição/metabolismo
8.
Genome Biol ; 18(1): 240, 2017 12 28.
Artigo em Inglês | MEDLINE | ID: mdl-29284540

RESUMO

The iCLIP and eCLIP techniques facilitate the detection of protein-RNA interaction sites at high resolution, based on diagnostic events at crosslink sites. However, previous methods do not explicitly model the specifics of iCLIP and eCLIP truncation patterns and possible biases. We developed PureCLIP ( https://github.com/skrakau/PureCLIP ), a hidden Markov model based approach, which simultaneously performs peak-calling and individual crosslink site detection. It explicitly incorporates a non-specific background signal and, for the first time, non-specific sequence biases. On both simulated and real data, PureCLIP is more accurate in calling crosslink sites than other state-of-the-art methods and has a higher agreement across replicates.


Assuntos
Sítios de Ligação , Biologia Computacional/métodos , Proteínas de Ligação a RNA/metabolismo , RNA/genética , RNA/metabolismo , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Cadeias de Markov , Motivos de Nucleotídeos , Reprodutibilidade dos Testes , Análise de Sequência de RNA , Software
9.
Nucleic Acids Res ; 44(13): 6055-69, 2016 07 27.
Artigo em Inglês | MEDLINE | ID: mdl-27288444

RESUMO

Position weight matrices (PWMs) are the standard model for DNA and RNA regulatory motifs. In PWMs nucleotide probabilities are independent of nucleotides at other positions. Models that account for dependencies need many parameters and are prone to overfitting. We have developed a Bayesian approach for motif discovery using Markov models in which conditional probabilities of order k - 1 act as priors for those of order k This Bayesian Markov model (BaMM) training automatically adapts model complexity to the amount of available data. We also derive an EM algorithm for de-novo discovery of enriched motifs. For transcription factor binding, BaMMs achieve significantly (P = 1/16) higher cross-validated partial AUC than PWMs in 97% of 446 ChIP-seq ENCODE datasets and improve performance by 36% on average. BaMMs also learn complex multipartite motifs, improving predictions of transcription start sites, polyadenylation sites, bacterial pause sites, and RNA binding sites by 26-101%. BaMMs never performed worse than PWMs. These robust improvements argue in favour of generally replacing PWMs by BaMMs.


Assuntos
Proteínas de Ligação a DNA/genética , DNA/genética , Motivos de Nucleotídeos/genética , Sequências Reguladoras de Ácido Nucleico/genética , Algoritmos , Teorema de Bayes , Sítios de Ligação , Biologia Computacional , Cadeias de Markov , Matrizes de Pontuação de Posição Específica , Software
10.
BMC Bioinformatics ; 17(Suppl 19): 502, 2016 Dec 22.
Artigo em Inglês | MEDLINE | ID: mdl-28155646

RESUMO

BACKGROUND: Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set of biological sequences, which is a fundamental problem in molecular biology research for the understanding of transcriptional regulation. Here we present two methods that make use of topic models for motif finding. First, we developed an algorithm in which first a set of biological sequences are treated as text documents, and the k-mers contained in them as words, to then build a correlated topic model (CTM) and iteratively reduce its perplexity. We also used the perplexity measurement of CTMs to improve our previous algorithm based on a genetic algorithm and several statistical coefficients. RESULTS: The algorithms were tested with 56 data sets from four different species and compared to 14 other methods by the use of several coefficients both at nucleotide and site level. The results of our first approach showed a performance comparable to the other methods studied, especially at site level and in sensitivity scores, in which it scored better than any of the 14 existing tools. In the case of our previous algorithm, the new approach with the addition of the perplexity measurement clearly outperformed all of the other methods in sensitivity, both at nucleotide and site level, and in overall performance at site level. CONCLUSIONS: The statistics obtained show that the performance of a motif finding method based on the use of a CTM is satisfying enough to conclude that the application of topic models is a valid method for developing motif finding algorithms. Moreover, the addition of topic models to a previously developed method dramatically increased its performance, suggesting that this combined algorithm can be a useful tool to successfully predict motifs in different kinds of sets of DNA sequences.


Assuntos
Algoritmos , Biologia Computacional/métodos , Modelos Teóricos , Motivos de Nucleotídeos/genética , Sequências Reguladoras de Ácido Nucleico/genética , Análise de Sequência de DNA/métodos , Fatores de Transcrição/metabolismo , Sítios de Ligação , Humanos , Método de Monte Carlo , Ligação Proteica
11.
Nucleic Acids Res ; 43(15): 7504-20, 2015 Sep 03.
Artigo em Inglês | MEDLINE | ID: mdl-26130723

RESUMO

Predicting RNA 3D structure from sequence is a major challenge in biophysics. An important sub-goal is accurately identifying recurrent 3D motifs from RNA internal and hairpin loop sequences extracted from secondary structure (2D) diagrams. We have developed and validated new probabilistic models for 3D motif sequences based on hybrid Stochastic Context-Free Grammars and Markov Random Fields (SCFG/MRF). The SCFG/MRF models are constructed using atomic-resolution RNA 3D structures. To parameterize each model, we use all instances of each motif found in the RNA 3D Motif Atlas and annotations of pairwise nucleotide interactions generated by the FR3D software. Isostericity relations between non-Watson-Crick basepairs are used in scoring sequence variants. SCFG techniques model nested pairs and insertions, while MRF ideas handle crossing interactions and base triples. We use test sets of randomly-generated sequences to set acceptance and rejection thresholds for each motif group and thus control the false positive rate. Validation was carried out by comparing results for four motif groups to RMDetect. The software developed for sequence scoring (JAR3D) is structured to automatically incorporate new motifs as they accumulate in the RNA 3D Motif Atlas when new structures are solved and is available free for download.


Assuntos
Modelos Estatísticos , RNA/química , Análise de Sequência de RNA/métodos , Sequência de Bases , Variação Genética , Cadeias de Markov , Motivos de Nucleotídeos , Alinhamento de Sequência , Software
12.
Methods Enzymol ; 553: 115-35, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25726463

RESUMO

The modular organization of RNA structure has been exploited in various computational and theoretical approaches to identify RNA tertiary (3D) motifs and assemble RNA structures. Riboswitches exemplify this modularity in terms of both structural and functional adaptability of RNA components. Here, we extend our computational approach based on tree graph sampling to the prediction of riboswitch topologies by defining additional edges to mimick pseudoknots. Starting from a secondary (2D) structure, we construct an initial graph deduced from predicted junction topologies by our data-mining algorithm RNAJAG trained on known RNAs; we sample these graphs in 3D space guided by knowledge-based statistical potentials derived from bending and torsion measures of internal loops as well as radii of gyration for known RNAs. We present graph sampling results for 10 representative riboswitches, 6 of them with pseudoknots, and compare our predictions to solved structures based on global and local RMSD measures. Our results indicate that the helical arrangements in riboswitches can be approximated using our combination of modified 3D tree graph representations for pseudoknots, junction prediction, graph moves, and scoring functions. Future challenges in the field of riboswitch prediction and design are also discussed.


Assuntos
Biologia Computacional/métodos , Conformação de Ácido Nucleico , Riboswitch , Algoritmos , Mineração de Dados , Modelos Moleculares , Método de Monte Carlo , Motivos de Nucleotídeos
13.
Bioinformatics ; 31(10): 1561-8, 2015 May 15.
Artigo em Inglês | MEDLINE | ID: mdl-25583120

RESUMO

MOTIVATION: The motif discovery problem consists of finding recurring patterns of short strings in a set of nucleotide sequences. This classical problem is receiving renewed attention as most early motif discovery methods lack the ability to handle large data of recent genome-wide ChIP studies. New ChIP-tailored methods focus on reducing computation time and pay little regard to the accuracy of motif detection. Unlike such methods, our method focuses on increasing the detection accuracy while maintaining the computation efficiency at an acceptable level. The major advantage of our method is that it can mine diverse multiple motifs undetectable by current methods. RESULTS: The repulsive parallel Markov chain Monte Carlo (RPMCMC) algorithm that we propose is a parallel version of the widely used Gibbs motif sampler. RPMCMC is run on parallel interacting motif samplers. A repulsive force is generated when different motifs produced by different samplers near each other. Thus, different samplers explore different motifs. In this way, we can detect much more diverse motifs than conventional methods can. Through application to 228 transcription factor ChIP-seq datasets of the ENCODE project, we show that the RPMCMC algorithm can find many reliable cofactor interacting motifs that existing methods are unable to discover.


Assuntos
Algoritmos , Motivos de Nucleotídeos/genética , Elementos Reguladores de Transcrição , Análise de Sequência de DNA/métodos , Fatores de Transcrição/metabolismo , Imunoprecipitação da Cromatina , Humanos , Cadeias de Markov , Método de Monte Carlo , Regiões Promotoras Genéticas
14.
IEEE J Biomed Health Inform ; 19(2): 677-86, 2015 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-24833606

RESUMO

Finding conserved locations or motifs in genomic sequences is of paramount importance. Expectation maximization (EM)-based algorithms are widely employed to solve motif finding problems. The present study proposes a novel initialization technique and model-shifting scheme for Monte-Carlo-based EM methods for motif finding. Two popular EM-based motif finding algorithms are compared to the proposed method, which offers improved motif prediction accuracy on a synthetic dataset and a true biological dataset.


Assuntos
DNA/análise , DNA/química , Motivos de Nucleotídeos/genética , Análise de Sequência de DNA/métodos , Algoritmos , Bases de Dados Factuais , Humanos , Método de Monte Carlo , Alinhamento de Sequência
15.
BMC Genomics ; 15 Suppl 9: S15, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25521044

RESUMO

BACKGROUND: G-quadruplexes are four-stranded structures formed in guanine-rich nucleotide sequences. Several functional roles of DNA G-quadruplexes have so far been investigated, where their putative functional roles during DNA replication and transcription have been suggested. A necessary condition for G-quadruplex formation is the presence of four regions of tandem guanines called G-runs and three nucleotide subsequences called loops that connect G-runs. A simple computational way to detect potential G-quadruplex regions in a given genomic sequence is pattern matching with regular expression. Although many putative G-quadruplex motifs can be found in most genomes by the regular expression-based approach, the majority of these sequences are unlikely to form G-quadruplexes because they are unstable as compared with canonical double helix structures. RESULTS: Here we present elaborate computational models for representing DNA G-quadruplex motifs using hidden Markov models (HMMs). Use of HMMs enables us to evaluate G-quadruplex motifs quantitatively by a probabilistic measure. In addition, the parameters of HMMs can be trained by using experimentally verified data. Computational experiments in discriminating between positive and negative G-quadruplex sequences as well as reducing putative G-quadruplexes in the human genome were carried out, indicating that HMM-based models can discern bona fide G-quadruplex structures well and one of them has the possibility of reducing false positive G-quadruplexes predicted by existing regular expression-based methods. Furthermore, our results show that one of our models can be specialized to detect G-quadruplex sequences whose functional roles are expected to be involved in DNA transcription. CONCLUSIONS: The HMM-based method along with the conventional pattern matching approach can contribute to reducing costly and laborious wet-lab experiments to perform functional analysis on a given set of potential G-quadruplexes of interest. The C++ and Perl programs are available at http://tcs.cira.kyoto-u.ac.jp/~ykato/program/g4hmm/.


Assuntos
Quadruplex G , Genômica/métodos , Cadeias de Markov , Motivos de Nucleotídeos , Bases de Dados Genéticas , Humanos
16.
Nucleic Acids Res ; 42(21): 12995-3011, 2014 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-25389269

RESUMO

We present a discriminative learning method for pattern discovery of binding sites in nucleic acid sequences based on hidden Markov models. Sets of positive and negative example sequences are mined for sequence motifs whose occurrence frequency varies between the sets. The method offers several objective functions, but we concentrate on mutual information of condition and motif occurrence. We perform a systematic comparison of our method and numerous published motif-finding tools. Our method achieves the highest motif discovery performance, while being faster than most published methods. We present case studies of data from various technologies, including ChIP-Seq, RIP-Chip and PAR-CLIP, of embryonic stem cell transcription factors and of RNA-binding proteins, demonstrating practicality and utility of the method. For the alternative splicing factor RBM10, our analysis finds motifs known to be splicing-relevant. The motif discovery method is implemented in the free software package Discrover. It is applicable to genome- and transcriptome-scale data, makes use of available repeat experiments and aside from binary contrasts also more complex data configurations can be utilized.


Assuntos
Proteínas de Ligação a DNA/metabolismo , Proteínas de Ligação a RNA/metabolismo , Análise de Sequência de DNA/métodos , Análise de Sequência de RNA/métodos , Animais , Sítios de Ligação , Imunoprecipitação da Cromatina , Células-Tronco Embrionárias/metabolismo , Humanos , Cadeias de Markov , Camundongos , Motivos de Nucleotídeos , Fatores de Transcrição/metabolismo
17.
PLoS One ; 9(1): e85629, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24465627

RESUMO

The binding affinity of DNA-binding proteins such as transcription factors is mainly determined by the base composition of the corresponding binding site on the DNA strand. Most proteins do not bind only a single sequence, but rather a set of sequences, which may be modeled by a sequence motif. Algorithms for de novo motif discovery differ in their promoter models, learning approaches, and other aspects, but typically use the statistically simple position weight matrix model for the motif, which assumes statistical independence among all nucleotides. However, there is no clear justification for that assumption, leading to an ongoing debate about the importance of modeling dependencies between nucleotides within binding sites. In the past, modeling statistical dependencies within binding sites has been hampered by the problem of limited data. With the rise of high-throughput technologies such as ChIP-seq, this situation has now changed, making it possible to make use of statistical dependencies effectively. In this work, we investigate the presence of statistical dependencies in binding sites of the human enhancer-blocking insulator protein CTCF by using the recently developed model class of inhomogeneous parsimonious Markov models, which is capable of modeling complex dependencies while avoiding overfitting. These findings lead to a more detailed characterization of the CTCF binding motif, which is only poorly represented by independent nucleotide frequencies at several positions, predominantly at the 3' end.


Assuntos
Algoritmos , Proteínas de Ligação a DNA/genética , Modelos Genéticos , Motivos de Nucleotídeos/genética , Proteínas Repressoras/genética , Sequência de Bases , Sítios de Ligação/genética , Fator de Ligação a CCCTC , Linhagem Celular , Células Cultivadas , Proteínas de Ligação a DNA/metabolismo , Células HeLa , Células Hep G2 , Humanos , Células K562 , Células MCF-7 , Cadeias de Markov , Ligação Proteica , Proteínas Repressoras/metabolismo
18.
J Math Biol ; 69(1): 147-82, 2014 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-23739838

RESUMO

Sojourn-times provide a versatile framework to assess the statistical significance of motifs in genome-wide searches even under non-Markovian background models. However, the large state spaces encountered in genomic sequence analyses make the exact calculation of sojourn-time distributions computationally intractable in long sequences. Here, we use coupling and analytic combinatoric techniques to approximate these distributions in the general setting of Polish state spaces, which encompass discrete state spaces. Our approximations are accompanied with explicit, easy to compute, error bounds for total variation distance. Broadly speaking, if Tn is the random number of times a Markov chain visits a certain subset T of states in its first n transitions, then we can usually approximate the distribution of Tn for n of order (1 − α)(−m), where m is the largest integer for which the exact distribution of Tm is accessible and 0 ≤ α ≤ 1 is an ergodicity coefficient associated with the probability transition kernel of the chain. This gives access to approximations of sojourn-times in the intermediate regime where n is perhaps too large for exact calculations, but too small to rely on Normal approximations or stationarity assumptions underlying Poisson and compound Poisson approximations. As proof of concept, we approximate the distribution of the number of matches with a motif in promoter regions of C.


Assuntos
Sequência de Bases/genética , Cadeias de Markov , Modelos Estatísticos , Motivos de Nucleotídeos/genética , Animais , Caenorhabditis elegans/genética , Regiões Promotoras Genéticas
19.
BMC Bioinformatics ; 14 Suppl 9: S4, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23902564

RESUMO

BACKGROUND: Motif discovery is the problem of finding recurring patterns in biological data. Patterns can be sequential, mainly when discovered in DNA sequences. They can also be structural (e.g. when discovering RNA motifs). Finding common structural patterns helps to gain a better understanding of the mechanism of action (e.g. post-transcriptional regulation). Unlike DNA motifs, which are sequentially conserved, RNA motifs exhibit conservation in structure, which may be common even if the sequences are different. Over the past few years, hundreds of algorithms have been developed to solve the sequential motif discovery problem, while less work has been done for the structural case. METHODS: In this paper, we survey, classify, and compare different algorithms that solve the structural motif discovery problem, where the underlying sequences may be different. We highlight their strengths and weaknesses. We start by proposing a benchmark dataset and a measurement tool that can be used to evaluate different motif discovery approaches. Then, we proceed by proposing our experimental setup. Finally, results are obtained using the proposed benchmark to compare available tools. To the best of our knowledge, this is the first attempt to compare tools solely designed for structural motif discovery. RESULTS: Results show that the accuracy of discovered motifs is relatively low. The results also suggest a complementary behavior among tools where some tools perform well on simple structures, while other tools are better for complex structures. CONCLUSIONS: We have classified and evaluated the performance of available structural motif discovery tools. In addition, we have proposed a benchmark dataset with tools that can be used to evaluate newly developed tools.


Assuntos
Algoritmos , Biologia Computacional/métodos , Motivos de Nucleotídeos , Análise de Sequência de RNA/métodos , Sequência Conservada , Modelos Estatísticos
20.
Nucleic Acids Res ; 41(16): e153, 2013 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-23814189

RESUMO

Protein-binding microarray (PBM) is a high-throughout platform that can measure the DNA-binding preference of a protein in a comprehensive and unbiased manner. A typical PBM experiment can measure binding signal intensities of a protein to all the possible DNA k-mers (k=8∼10); such comprehensive binding affinity data usually need to be reduced and represented as motif models before they can be further analyzed and applied. Since proteins can often bind to DNA in multiple modes, one of the major challenges is to decompose the comprehensive affinity data into multimodal motif representations. Here, we describe a new algorithm that uses Hidden Markov Models (HMMs) and can derive precise and multimodal motifs using belief propagations. We describe an HMM-based approach using belief propagations (kmerHMM), which accepts and preprocesses PBM probe raw data into median-binding intensities of individual k-mers. The k-mers are ranked and aligned for training an HMM as the underlying motif representation. Multiple motifs are then extracted from the HMM using belief propagations. Comparisons of kmerHMM with other leading methods on several data sets demonstrated its effectiveness and uniqueness. Especially, it achieved the best performance on more than half of the data sets. In addition, the multiple binding modes derived by kmerHMM are biologically meaningful and will be useful in interpreting other genome-wide data such as those generated from ChIP-seq. The executables and source codes are available at the authors' websites: e.g. http://www.cs.toronto.edu/∼wkc/kmerHMM.


Assuntos
Proteínas de Ligação a DNA/metabolismo , DNA/química , Análise Serial de Proteínas , Análise de Sequência de DNA/métodos , Fatores de Transcrição/metabolismo , Algoritmos , Animais , Sítios de Ligação , DNA/metabolismo , Cadeias de Markov , Camundongos , Motivos de Nucleotídeos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA