Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 38
Filtrar
1.
Nucleic Acids Res ; 51(18): e95, 2023 Oct 13.
Artigo em Inglês | MEDLINE | ID: mdl-37650641

RESUMO

Several studies suggested that transcription factor (TF) binding to DNA may be impaired or enhanced by DNA methylation. We present MeDeMo, a toolbox for TF motif analysis that combines information about DNA methylation with models capturing intra-motif dependencies. In a large-scale study using ChIP-seq data for 335 TFs, we identify novel TFs that show a binding behaviour associated with DNA methylation. Overall, we find that the presence of CpG methylation decreases the likelihood of binding for the majority of methylation-associated TFs. For a considerable subset of TFs, we show that intra-motif dependencies are pivotal for accurately modelling the impact of DNA methylation on TF binding. We illustrate that the novel methylation-aware TF binding models allow to predict differential ChIP-seq peaks and improve the genome-wide analysis of TF binding. Our work indicates that simplistic models that neglect the effect of DNA methylation on DNA binding may lead to systematic underperformance for methylation-associated TFs.

2.
Nucleic Acids Res ; 50(4): 2387-2400, 2022 02 28.
Artigo em Inglês | MEDLINE | ID: mdl-35150566

RESUMO

Transcription activator-like effectors (TALEs) are bacterial proteins with a programmable DNA-binding domain, which turned them into exceptional tools for biotechnology. TALEs contain a central array of consecutive 34 amino acid long repeats to bind DNA in a simple one-repeat-to-one-nucleotide manner. However, a few naturally occurring aberrant repeat variants break this strict binding mechanism, allowing for the recognition of an additional sequence with a -1 nucleotide frameshift. The limits and implications of this extended TALE binding mode are largely unexplored. Here, we analyse the complete diversity of natural and artificially engineered aberrant repeats for their impact on the DNA binding of TALEs. Surprisingly, TALEs with several aberrant repeats can loop out multiple repeats simultaneously without losing DNA-binding capacity. We also characterized members of the only natural TALE class harbouring two aberrant repeats and confirmed that their target is the major virulence factor OsSWEET13 from rice. In an aberrant TALE repeat, the position and nature of the amino acid sequence strongly influence its function. We explored the tolerance of TALE repeats towards alterations further and demonstrate that inserts as large as GFP can be tolerated without disrupting DNA binding. This illustrates the extraordinary DNA-binding capacity of TALEs and opens new uses in biotechnology.


Assuntos
DNA , Efetores Semelhantes a Ativadores de Transcrição , DNA/química , Nucleotídeos , Efetores Semelhantes a Ativadores de Transcrição/química , Ativação Transcricional , Virulência/genética
3.
BMC Genomics ; 24(1): 151, 2023 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-36973643

RESUMO

BACKGROUND: Most plant-pathogenic Xanthomonas bacteria harbor transcription activator-like effector (TALE) genes, which function as transcriptional activators of host plant genes and support infection. The entire repertoire of up to 29 TALE genes of a Xanthomonas strain is also referred to as TALome. The DNA-binding domain of TALEs is comprised of highly conserved repeats and TALE genes often occur in gene clusters, which precludes the assembly of TALE-carrying Xanthomonas genomes based on standard sequencing approaches. RESULTS: Here, we report the successful assembly of the 5 Mbp genomes of five Xanthomonas strains from Oxford Nanopore Technologies (ONT) sequencing data. For one of these strains, Xanthomonas oryzae pv. oryzae (Xoo) PXO35, we illustrate why Illumina short reads and longer PacBio reads are insufficient to fully resolve the genome. While ONT reads are perfectly suited to yield highly contiguous genomes, they suffer from a specific error profile within homopolymers. To still yield complete and correct TALomes from ONT assemblies, we present a computational correction pipeline specifically tailored to TALE genes, which yields at least comparable accuracy as Illumina-based polishing. We further systematically assess the ONT-based pipeline for its multiplexing capacity and find that, combined with computational correction, the complete TALome of Xoo PXO35 could have been reconstructed from less than 20,000 ONT reads. CONCLUSIONS: Our results indicate that multiplexed ONT sequencing combined with a computational correction of TALE genes constitutes a highly capable tool for characterizing the TALomes of huge collections of Xanthomonas strains in the future.


Assuntos
Sequenciamento por Nanoporos , Xanthomonas , Efetores Semelhantes a Ativadores de Transcrição/genética , Xanthomonas/genética , Genoma
4.
BMC Genomics ; 22(1): 914, 2021 Dec 29.
Artigo em Inglês | MEDLINE | ID: mdl-34965853

RESUMO

BACKGROUND: The yield of many crop plants can be substantially reduced by plant-pathogenic Xanthomonas bacteria. The infection strategy of many Xanthomonas strains is based on transcription activator-like effectors (TALEs), which are secreted into the host cells and act as transcriptional activators of plant genes that are beneficial for the bacteria.The modular DNA binding domain of TALEs contains tandem repeats, each comprising two hyper-variable amino acids. These repeat-variable diresidues (RVDs) bind to their target box and determine the specificity of a TALE.All available tools for the prediction of TALE targets within the host plant suffer from many false positives. In this paper we propose a strategy to improve prediction accuracy by considering the epigenetic state of the host plant genome in the region of the target box. RESULTS: To this end, we extend our previously published tool PrediTALE by considering two epigenetic features: (i) chromatin accessibility of potentially bound regions and (ii) DNA methylation of cytosines within target boxes. Here, we determine the epigenetic features from publicly available DNase-seq, ATAC-seq, and WGBS data in rice.We benchmark the utility of both epigenetic features separately and in combination, deriving ground-truth from RNA-seq data of infections studies in rice. We find an improvement for each individual epigenetic feature, but especially the combination of both.Having established an advantage in TALE target predicting considering epigenetic features, we use these data for promoterome and genome-wide scans by our new tool EpiTALE, leading to several novel putative virulence targets. CONCLUSIONS: Our results suggest that it would be worthwhile to collect condition-specific chromatin accessibility data and methylation information when studying putative virulence targets of Xanthomonas TALEs.


Assuntos
Doenças das Plantas , Xanthomonas , Proteínas de Bactérias/genética , Epigênese Genética , Doenças das Plantas/genética , Efetores Semelhantes a Ativadores de Transcrição/genética , Xanthomonas/genética , Xanthomonas/metabolismo
5.
Bioinformatics ; 35(22): 4812-4814, 2019 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-31225867

RESUMO

SUMMARY: Statistical dependencies are present in a variety of sequence data, but are not discernible from traditional sequence logos. Here, we present the R package DepLogo for visualizing inter-position dependencies in aligned sequence data as dependency logos. Dependency logos make dependency structures, which correspond to regular co-occurrences of symbols at dependent positions, visually perceptible. To this end, sequences are partitioned based on their symbols at highly dependent positions as measured by mutual information, and each partition obtains its own visual representation. We illustrate the utility of the DepLogo package in several use cases generating dependency logos from DNA, RNA and protein sequences. AVAILABILITY AND IMPLEMENTATION: The DepLogo R package is available from CRAN and its source code is available at https://github.com/Jstacs/DepLogo. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Software , DNA , Matrizes de Pontuação de Posição Específica , Análise de Sequência de DNA
6.
PLoS Comput Biol ; 15(7): e1007206, 2019 07.
Artigo em Inglês | MEDLINE | ID: mdl-31295249

RESUMO

Plant-pathogenic Xanthomonas bacteria secrete transcription activator-like effectors (TALEs) into host cells, where they act as transcriptional activators on plant target genes to support bacterial virulence. TALEs have a unique modular DNA-binding domain composed of tandem repeats. Two amino acids within each tandem repeat, termed repeat-variable diresidues, bind to contiguous nucleotides on the DNA sequence and determine target specificity. In this paper, we propose a novel approach for TALE target prediction to identify potential virulence targets. Our approach accounts for recent findings concerning TALE targeting, including frame-shift binding by repeats of aberrant lengths, and the flexible strand orientation of target boxes relative to the transcription start of the downstream target gene. The computational model can account for dependencies between adjacent RVD positions. Model parameters are learned from the wealth of quantitative data that have been generated over the last years. We benchmark the novel approach, termed PrediTALE, using RNA-seq data after Xanthomonas infection in rice, and find an overall improvement of prediction performance compared with previous approaches. Using PrediTALE, we are able to predict several novel putative virulence targets. However, we also observe that no target genes are predicted by any prediction tool for several TALEs, which we term orphan TALEs for this reason. We postulate that one explanation for orphan TALEs are incomplete gene annotations and, hence, propose to replace promoterome-wide by genome-wide scans for target boxes. We demonstrate that known targets from promoterome-wide scans may be recovered by genome-wide scans, whereas the latter, combined with RNA-seq data, are able to detect putative targets independent of existing gene annotations.


Assuntos
Modelos Biológicos , Oryza/microbiologia , Doenças das Plantas/microbiologia , Efetores Semelhantes a Ativadores de Transcrição/fisiologia , Xanthomonas/patogenicidade , Biologia Computacional , Genes de Plantas , Genoma de Planta , Interações entre Hospedeiro e Microrganismos/genética , Interações entre Hospedeiro e Microrganismos/fisiologia , Oryza/genética , Doenças das Plantas/genética , Sequências de Repetição em Tandem , Efetores Semelhantes a Ativadores de Transcrição/genética , Sítio de Iniciação de Transcrição , Virulência/genética , Virulência/fisiologia , Xanthomonas/genética , Xanthomonas/fisiologia
7.
BMC Bioinformatics ; 19(1): 189, 2018 05 30.
Artigo em Inglês | MEDLINE | ID: mdl-29843602

RESUMO

BACKGROUND: Genome annotation is of key importance in many research questions. The identification of protein-coding genes is often based on transcriptome sequencing data, ab-initio or homology-based prediction. Recently, it was demonstrated that intron position conservation improves homology-based gene prediction, and that experimental data improves ab-initio gene prediction. RESULTS: Here, we present an extension of the gene prediction program GeMoMa that utilizes amino acid sequence conservation, intron position conservation and optionally RNA-seq data for homology-based gene prediction. We show on published benchmark data for plants, animals and fungi that GeMoMa performs better than the gene prediction programs BRAKER1, MAKER2, and CodingQuarry, and purely RNA-seq-based pipelines for transcript identification. In addition, we demonstrate that using multiple reference organisms may help to further improve the performance of GeMoMa. Finally, we apply GeMoMa to four nematode species and to the recently published barley reference genome indicating that current annotations of protein-coding genes may be refined using GeMoMa predictions. CONCLUSIONS: GeMoMa might be of great utility for annotating newly sequenced genomes but also for finding homologs of a specific gene or gene family. GeMoMa has been published under GNU GPL3 and is freely available at http://www.jstacs.de/index.php/GeMoMa .


Assuntos
Perfilação da Expressão Gênica , Genes Fúngicos , Genes de Plantas , Análise de Sequência de RNA , Homologia de Sequência de Aminoácidos , Software , Animais , Genômica , Hordeum/genética , Íntrons , Anotação de Sequência Molecular , Nematoides/genética
8.
Bioinformatics ; 33(4): 580-582, 2017 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-28035026

RESUMO

Summary: Recent studies have shown that the traditional position weight matrix model is often insufficient for modeling transcription factor binding sites, as intra-motif dependencies play a significant role for an accurate description of binding motifs. Here, we present the Java application InMoDe, a collection of tools for learning, leveraging and visualizing such dependencies of putative higher order. The distinguishing feature of InMoDe is a robust model selection from a class of parsimonious models, taking into account dependencies only if justified by the data while choosing for simplicity otherwise. Availability and Implementation: InMoDe is implemented in Java and is available as command line application, as application with a graphical user-interface, and as an integration into Galaxy on the project website at http://www.jstacs.de/index.php/InMoDe . Contact: ralf.eggeling@cs.helsinki.fi.


Assuntos
Biologia Computacional/métodos , DNA/metabolismo , Regiões Promotoras Genéticas , Software , Fatores de Transcrição/metabolismo , Animais , Sítios de Ligação/genética , Imunoprecipitação da Cromatina , Humanos , Aprendizado de Máquina , Análise de Sequência de DNA/métodos
9.
Nucleic Acids Res ; 44(9): e89, 2016 05 19.
Artigo em Inglês | MEDLINE | ID: mdl-26893356

RESUMO

Annotation of protein-coding genes is very important in bioinformatics and biology and has a decisive influence on many downstream analyses. Homology-based gene prediction programs allow for transferring knowledge about protein-coding genes from an annotated organism to an organism of interest.Here, we present a homology-based gene prediction program called GeMoMa. GeMoMa utilizes the conservation of intron positions within genes to predict related genes in other organisms. We assess the performance of GeMoMa and compare it with state-of-the-art competitors on plant and animal genomes using an extended best reciprocal hit approach. We find that GeMoMa often makes more precise predictions than its competitors yielding a substantially increased number of correct transcripts. Subsequently, we exemplarily validate GeMoMa predictions using Sanger sequencing. Finally, we use RNA-seq data to compare the predictions of homology-based gene prediction programs, and find again that GeMoMa performs well.Hence, we conclude that exploiting intron position conservation improves homology-based gene prediction, and we make GeMoMa freely available as command-line tool and Galaxy integration.


Assuntos
Biologia Computacional/métodos , Modelos Genéticos , Anotação de Sequência Molecular/métodos , RNA Mensageiro/genética , Análise de Sequência de RNA/métodos , Algoritmos , Animais , Arabidopsis/genética , Sequência de Bases , Carica/genética , Galinhas/genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Íntrons/genética , Camundongos , Oryza/genética , Reação em Cadeia da Polimerase , Homologia de Sequência do Ácido Nucleico , Solanum tuberosum/genética , Nicotiana/genética
10.
J Exp Bot ; 68(3): 539-552, 2017 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-28007950

RESUMO

Auxin is an essential regulator of plant growth and development, and auxin signaling components are conserved among land plants. Yet, a remarkable degree of natural variation in physiological and transcriptional auxin responses has been described among Arabidopsis thaliana accessions. As intraspecies comparisons offer only limited genetic variation, we here inspect the variation of auxin responses between A. thaliana and A. lyrata. This approach allowed the identification of conserved auxin response genes including novel genes with potential relevance for auxin biology. Furthermore, promoter divergences were analyzed for putative sources of variation. De novo motif discovery identified novel and variants of known elements with potential relevance for auxin responses, emphasizing the complex, and yet elusive, code of element combinations accounting for the diversity in transcriptional auxin responses. Furthermore, network analysis revealed correlations of interspecies differences in the expression of AUX/IAA gene clusters and classic auxin-related genes. We conclude that variation in general transcriptional and physiological auxin responses may originate substantially from functional or transcriptional variations in the TIR1/AFB, AUX/IAA, and ARF signaling network. In that respect, AUX/IAA gene expression divergence potentially reflects differences in the manner in which different species transduce identical auxin signals into gene expression responses.


Assuntos
Proteínas de Arabidopsis/genética , Arabidopsis/genética , Regulação da Expressão Gênica de Plantas , Ácidos Indolacéticos/metabolismo , Reguladores de Crescimento de Plantas/metabolismo , Arabidopsis/metabolismo , Proteínas de Arabidopsis/metabolismo , Perfilação da Expressão Gênica , Transdução de Sinais
11.
Nucleic Acids Res ; 43(18): e119, 2015 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-26116565

RESUMO

Binding of transcription factors to DNA is one of the keystones of gene regulation. The existence of statistical dependencies between binding site positions is widely accepted, while their relevance for computational predictions has been debated. Building probabilistic models of binding sites that may capture dependencies is still challenging, since the most successful motif discovery approaches require numerical optimization techniques, which are not suited for selecting dependency structures. To overcome this issue, we propose sparse local inhomogeneous mixture (Slim) models that combine putative dependency structures in a weighted manner allowing for numerical optimization of dependency structure and model parameters simultaneously. We find that Slim models yield a substantially better prediction performance than previous models on genomic context protein binding microarray data sets and on ChIP-seq data sets. To elucidate the reasons for the improved performance, we develop dependency logos, which allow for visual inspection of dependency structures within binding sites. We find that the dependency structures discovered by Slim models are highly diverse and highly transcription factor-specific, which emphasizes the need for flexible dependency models. The observed dependency structures range from broad heterogeneities to sparse dependencies between neighboring and non-neighboring binding site positions.


Assuntos
Modelos Estatísticos , Elementos Reguladores de Transcrição , Análise de Sequência de DNA/métodos , Fatores de Transcrição/metabolismo , Sítios de Ligação , DNA/química , DNA/metabolismo , Humanos , Motivos de Nucleotídeos
12.
Bioinformatics ; 31(15): 2595-7, 2015 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-25810428

RESUMO

Precision-recall (PR) and receiver operating characteristic (ROC) curves are valuable measures of classifier performance. Here, we present the R-package PRROC, which allows for computing and visualizing both PR and ROC curves. In contrast to available R-packages, PRROC allows for computing PR and ROC curves and areas under these curves for soft-labeled data using a continuous interpolation between the points of PR curves. In addition, PRROC provides a generic plot function for generating publication-quality graphics of PR and ROC curves.


Assuntos
Gráficos por Computador , Interpretação Estatística de Dados , Computação Matemática , Curva ROC , Software , Área Sob a Curva , Humanos , Reconhecimento Automatizado de Padrão , Interface Usuário-Computador
13.
BMC Bioinformatics ; 16: 387, 2015 Nov 17.
Artigo em Inglês | MEDLINE | ID: mdl-26577052

RESUMO

BACKGROUND: For three decades, sequence logos are the de facto standard for the visualization of sequence motifs in biology and bioinformatics. Reasons for this success story are their simplicity and clarity. The number of inferred and published motifs grows with the number of data sets and motif extraction algorithms. Hence, it becomes more and more important to perceive differences between motifs. However, motif differences are hard to detect from individual sequence logos in case of multiple motifs for one transcription factor, highly similar binding motifs of different transcription factors, or multiple motifs for one protein domain. RESULTS: Here, we present DiffLogo, a freely available, extensible, and user-friendly R package for visualizing motif differences. DiffLogo is capable of showing differences between DNA motifs as well as protein motifs in a pair-wise manner resulting in publication-ready figures. In case of more than two motifs, DiffLogo is capable of visualizing pair-wise differences in a tabular form. Here, the motifs are ordered by similarity, and the difference logos are colored for clarity. We demonstrate the benefit of DiffLogo on CTCF motifs from different human cell lines, on E-box motifs of three basic helix-loop-helix transcription factors as examples for comparison of DNA motifs, and on F-box domains from three different families as example for comparison of protein motifs. CONCLUSIONS: DiffLogo provides an intuitive visualization of motif differences. It enables the illustration and investigation of differences between highly similar motifs such as binding patterns of transcription factors for different cell types, treatments, and algorithmic approaches.


Assuntos
Algoritmos , Motivos de Aminoácidos/genética , Fatores de Transcrição Hélice-Alça-Hélice Básicos/genética , Gráficos por Computador , Motivos de Nucleotídeos/genética , Análise de Sequência de DNA/métodos , Software , Fator de Ligação a CCCTC , Biologia Computacional/métodos , Humanos , Estrutura Terciária de Proteína , Proteínas Repressoras/genética , Células Tumorais Cultivadas
14.
Nucleic Acids Res ; 41(21): e197, 2013 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-24057214

RESUMO

De novo motif discovery has been an important challenge of bioinformatics for the past two decades. Since the emergence of high-throughput techniques like ChIP-seq, ChIP-exo and protein-binding microarrays (PBMs), the focus of de novo motif discovery has shifted to runtime and accuracy on large data sets. For this purpose, specialized algorithms have been designed for discovering motifs in ChIP-seq or PBM data. However, none of the existing approaches work perfectly for all three high-throughput techniques. In this article, we propose Dimont, a general approach for fast and accurate de novo motif discovery from high-throughput data. We demonstrate that Dimont yields a higher number of correct motifs from ChIP-seq data than any of the specialized approaches and achieves a higher accuracy for predicting PBM intensities from probe sequence than any of the approaches specifically designed for that purpose. Dimont also reports the expected motifs for several ChIP-exo data sets. Investigating differences between in vitro and in vivo binding, we find that for most transcription factors, the motifs discovered by Dimont are in good accordance between techniques, but we also find notable exceptions. We also observe that modeling intra-motif dependencies may increase accuracy, which indicates that more complex motif models are a worthwhile field of research.


Assuntos
Imunoprecipitação da Cromatina/métodos , DNA/química , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise Serial de Proteínas/métodos , Análise de Sequência de DNA/métodos , Humanos , Motivos de Nucleotídeos , Software
15.
Bioinformatics ; 29(22): 2931-2, 2013 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-23995255

RESUMO

SUMMARY: Transcription activator-like effector nucleases (TALENs) have become an accepted tool for targeted mutagenesis, but undesired off-targets remain an important issue. We present TALENoffer, a novel tool for the genome-wide prediction of TALEN off-targets. We show that TALENoffer successfully predicts known off-targets of engineered TALENs and yields a competitive runtime, scanning complete mammalian genomes within a few minutes. AVAILABILITY: TALENoffer is available as a command line program from http://www.jstacs.de/index.php/TALENoffer and as a Galaxy server at http://galaxy.informatik.uni-halle.de. CONTACT: grau@informatik.uni-halle.de


Assuntos
Endodesoxirribonucleases/metabolismo , Software , Animais , Proteínas de Ligação a DNA/metabolismo , Genoma , Modelos Estatísticos , Mutagênese , Engenharia de Proteínas
16.
PLoS Comput Biol ; 9(3): e1002962, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23526890

RESUMO

Transcription activator-like (TAL) effectors are injected into host plant cells by Xanthomonas bacteria to function as transcriptional activators for the benefit of the pathogen. The DNA binding domain of TAL effectors is composed of conserved amino acid repeat structures containing repeat-variable diresidues (RVDs) that determine DNA binding specificity. In this paper, we present TALgetter, a new approach for predicting TAL effector target sites based on a statistical model. In contrast to previous approaches, the parameters of TALgetter are estimated from training data computationally. We demonstrate that TALgetter successfully predicts known TAL effector target sites and often yields a greater number of predictions that are consistent with up-regulation in gene expression microarrays than an existing approach, Target Finder of the TALE-NT suite. We study the binding specificities estimated by TALgetter and approve that different RVDs are differently important for transcriptional activation. In subsequent studies, the predictions of TALgetter indicate a previously unreported positional preference of TAL effector target sites relative to the transcription start site. In addition, several TAL effectors are predicted to bind to the TATA-box, which might constitute one general mode of transcriptional activation by TAL effectors. Scrutinizing the predicted target sites of TALgetter, we propose several novel TAL effector virulence targets in rice and sweet orange. TAL-mediated induction of the candidates is supported by gene expression microarrays. Validity of these targets is also supported by functional analogy to known TAL effector targets, by an over-representation of TAL effector targets with similar function, or by a biological function related to pathogen infection. Hence, these predicted TAL effector virulence targets are promising candidates for studying the virulence function of TAL effectors. TALgetter is implemented as part of the open-source Java library Jstacs, and is freely available as a web-application and a command line program.


Assuntos
Proteínas de Bactérias/química , Proteínas de Ligação a DNA/química , Regulação da Expressão Gênica de Plantas , Fatores de Transcrição/química , Sequência de Aminoácidos , Proteínas de Bactérias/genética , Proteínas de Bactérias/metabolismo , Biologia Computacional/métodos , Proteínas de Ligação a DNA/genética , Proteínas de Ligação a DNA/metabolismo , Perfilação da Expressão Gênica , Análise de Sequência com Séries de Oligonucleotídeos , Doenças das Plantas/genética , Doenças das Plantas/microbiologia , Ligação Proteica , Reprodutibilidade dos Testes , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Xanthomonas/genética , Xanthomonas/patogenicidade
17.
PLoS Comput Biol ; 7(2): e1001070, 2011 Feb 10.
Artigo em Inglês | MEDLINE | ID: mdl-21347314

RESUMO

Transcription factors are a main component of gene regulation as they activate or repress gene expression by binding to specific binding sites in promoters. The de-novo discovery of transcription factor binding sites in target regions obtained by wet-lab experiments is a challenging problem in computational biology, which has not been fully solved yet. Here, we present a de-novo motif discovery tool called Dispom for finding differentially abundant transcription factor binding sites that models existing positional preferences of binding sites and adjusts the length of the motif in the learning process. Evaluating Dispom, we find that its prediction performance is superior to existing tools for de-novo motif discovery for 18 benchmark data sets with planted binding sites, and for a metazoan compendium based on experimental data from micro-array, ChIP-chip, ChIP-DSL, and DamID as well as Gene Ontology data. Finally, we apply Dispom to find binding sites differentially abundant in promoters of auxin-responsive genes extracted from Arabidopsis thaliana microarray data, and we find a motif that can be interpreted as a refined auxin responsive element predominately positioned in the 250-bp region upstream of the transcription start site. Using an independent data set of auxin-responsive genes, we find in genome-wide predictions that the refined motif is more specific for auxin-responsive genes than the canonical auxin-responsive element. In general, Dispom can be used to find differentially abundant motifs in sequences of any origin. However, the positional distribution learned by Dispom is especially beneficial if all sequences are aligned to some anchor point like the transcription start site in case of promoter sequences. We demonstrate that the combination of searching for differentially abundant motifs and inferring a position distribution from the data is beneficial for de-novo motif discovery. Hence, we make the tool freely available as a component of the open-source Java framework Jstacs and as a stand-alone application at http://www.jstacs.de/index.php/Dispom.


Assuntos
Fatores de Transcrição/metabolismo , Animais , Arabidopsis/efeitos dos fármacos , Arabidopsis/genética , Arabidopsis/metabolismo , Proteínas de Arabidopsis/genética , Proteínas de Arabidopsis/metabolismo , Sítios de Ligação/genética , Biologia Computacional , DNA de Plantas/genética , DNA de Plantas/metabolismo , Bases de Dados Genéticas , Genes de Plantas/efeitos dos fármacos , Humanos , Ácidos Indolacéticos/farmacologia , Modelos Genéticos , Modelos Estatísticos , Regiões Promotoras Genéticas
18.
Microbiol Spectr ; 10(2): e0012122, 2022 04 27.
Artigo em Inglês | MEDLINE | ID: mdl-35311568

RESUMO

The genome of the metal-resistant, hydrogen-oxidizing bacterium Cupriavidus metallidurans contains a large number of horizontally acquired plasmids and genomic islands that were integrated into its chromosome or chromid. For the C. metallidurans CH34 wild-type strain growing under nonchallenging conditions, 5,763 transcriptional starting sequences (TSSs) were determined. Using a custom-built motif discovery software based on hidden Markov models, patterns upstream of the TSSs were identified. The pattern TTGACA, -35.6 ± 1.6 bp upstream of the TSSs, in combination with a TATAAT sequence 15.8 ± 1.4 bp upstream occurred frequently, especially upstream of the TSSs for 48 housekeeping genes, and these were assigned to promoters used by RNA polymerase containing the main housekeeping sigma factor RpoD. From patterns upstream of the housekeeping genes, a score for RpoD-dependent promoters in C. metallidurans was derived and applied to all 5,763 TSSs. Among these, 2,572 TSSs could be associated with RpoD with high probability, 373 with low probability, and 2,818 with no probability. In a detailed analysis of horizontally acquired genes involved in metal resistance and not involved in this process, the TSSs responsible for the expression of these genes under nonchallenging conditions were assigned to RpoD- or non-RpoD-dependent promoters. RpoD-dependent promoters occurred frequently in horizontally acquired metal resistance and other determinants, which should allow their initial expression in a new host. However, other sigma factors and sense/antisense effects also contribute-maybe to mold in subsequent adaptation steps the assimilated gene into the regulatory network of the cell. IMPORTANCE In their natural environment, bacteria are constantly acquiring genes by horizontal gene transfer. To be of any benefit, these genes should be expressed. We show here that the main housekeeping sigma factor RpoD plays an important role in the expression of horizontally acquired genes in the metal-resistant hydrogen-oxidizing bacterium C. metallidurans. By conservation of the RpoD recognition consensus sequence, a newly arriving gene has a high probability to be expressed in the new host cell. In addition to integrons and genes travelling together with that for their sigma factor, conservation of the RpoD consensus sequence may be an important contributor to the overall evolutionary success of horizontal gene transfer in bacteria. Using C. metallidurans as an example, this publication sheds some light on the fate and function of horizontally acquired genes in bacteria.


Assuntos
Cupriavidus , Fator sigma , Proteínas de Bactérias/genética , Proteínas de Bactérias/metabolismo , Cupriavidus/genética , Cupriavidus/metabolismo , Hidrogênio/metabolismo , Metais/metabolismo , Fator sigma/metabolismo
19.
BMC Bioinformatics ; 11: 149, 2010 Mar 22.
Artigo em Inglês | MEDLINE | ID: mdl-20307305

RESUMO

BACKGROUND: One of the challenges of bioinformatics remains the recognition of short signal sequences in genomic DNA such as donor or acceptor splice sites, splicing enhancers or silencers, translation initiation sites, transcription start sites, transcription factor binding sites, nucleosome binding sites, miRNA binding sites, or insulator binding sites. During the last decade, a wealth of algorithms for the recognition of such DNA sequences has been developed and compared with the goal of improving their performance and to deepen our understanding of the underlying cellular processes. Most of these algorithms are based on statistical models belonging to the family of Markov random fields such as position weight matrix models, weight array matrix models, Markov models of higher order, or moral Bayesian networks. While in many comparative studies different learning principles or different statistical models have been compared, the influence of choosing different prior distributions for the model parameters when using different learning principles has been overlooked, and possibly lead to questionable conclusions. RESULTS: With the goal of allowing direct comparisons of different learning principles for models from the family of Markov random fields based on the same a-priori information, we derive a generalization of the commonly-used product-Dirichlet prior. We find that the derived prior behaves like a Gaussian prior close to the maximum and like a Laplace prior in the far tails. In two case studies, we illustrate the utility of the derived prior for a direct comparison of different learning principles with different models for the recognition of binding sites of the transcription factor Sp1 and human donor splice sites. CONCLUSIONS: We find that comparisons of different learning principles using the same a-priori information can lead to conclusions different from those of previous studies in which the effect resulting from different priors has been neglected. We implement the derived prior is implemented in the open-source library Jstacs to enable an easy application to comparative studies of different learning principles in the field of sequence analysis.


Assuntos
Análise de Sequência de DNA/métodos , Teorema de Bayes , Sítios de Ligação , DNA/química , Cadeias de Markov , Reconhecimento Automatizado de Padrão/métodos , Alinhamento de Sequência/métodos
20.
BMC Bioinformatics ; 11: 98, 2010 Feb 22.
Artigo em Inglês | MEDLINE | ID: mdl-20175896

RESUMO

BACKGROUND: The recognition of functional binding sites in genomic DNA remains one of the fundamental challenges of genome research. During the last decades, a plethora of different and well-adapted models has been developed, but only little attention has been payed to the development of different and similarly well-adapted learning principles. Only recently it was noticed that discriminative learning principles can be superior over generative ones in diverse bioinformatics applications, too. RESULTS: Here, we propose a generalization of generative and discriminative learning principles containing the maximum likelihood, maximum a posteriori, maximum conditional likelihood, maximum supervised posterior, generative-discriminative trade-off, and penalized generative-discriminative trade-off learning principles as special cases, and we illustrate its efficacy for the recognition of vertebrate transcription factor binding sites. CONCLUSIONS: We find that the proposed learning principle helps to improve the recognition of transcription factor binding sites, enabling better computational approaches for extracting as much information as possible from valuable wet-lab data. We make all implementations available in the open-source library Jstacs so that this learning principle can be easily applied to other classification problems in the field of genome and epigenome analysis.


Assuntos
Armazenamento e Recuperação da Informação/métodos , Algoritmos , DNA/química , DNA/metabolismo , Análise Discriminante , Genoma , Genômica , Funções Verossimilhança , Reconhecimento Automatizado de Padrão
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA