Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 7 de 7
Filter
1.
Nucleic Acids Res ; 45(8): e65, 2017 05 05.
Article in English | MEDLINE | ID: mdl-28082394

ABSTRACT

Our current knowledge of eukaryotic promoters indicates their complex architecture that is often composed of numerous functional motifs. Most of known promoters include multiple and in some cases mutually exclusive transcription start sites (TSSs). Moreover, TSS selection depends on cell/tissue, development stage and environmental conditions. Such complex promoter structures make their computational identification notoriously difficult. Here, we present TSSPlant, a novel tool that predicts both TATA and TATA-less promoters in sequences of a wide spectrum of plant genomes. The tool was developed by using large promoter collections from ppdb and PlantProm DB. It utilizes eighteen significant compositional and signal features of plant promoter sequences selected in this study, that feed the artificial neural network-based model trained by the backpropagation algorithm. TSSPlant achieves significantly higher accuracy compared to the next best promoter prediction program for both TATA promoters (MCC≃0.84 and F1-score≃0.91 versus MCC≃0.51 and F1-score≃0.71) and TATA-less promoters (MCC≃0.80, F1-score≃0.89 versus MCC≃0.29 and F1-score≃0.50). TSSPlant is available to download as a standalone program at http://www.cbrc.kaust.edu.sa/download/.


Subject(s)
Genome, Plant , Neural Networks, Computer , Plant Proteins/genetics , Promoter Regions, Genetic , RNA Polymerase II/genetics , Transcription Initiation Site , Arabidopsis/genetics , Arabidopsis/metabolism , Gene Expression , Oryza/genetics , Oryza/metabolism , Plant Proteins/metabolism , RNA Polymerase II/metabolism , Sequence Analysis, DNA , Software
2.
Bioinformatics ; 31(21): 3544-5, 2015 Nov 01.
Article in English | MEDLINE | ID: mdl-26142184

ABSTRACT

UNLABELLED: Gene transcription is mostly conducted through interactions of various transcription factors and their binding sites on DNA (regulatory elements, REs). Today, we are still far from understanding the real regulatory content of promoter regions. Computer methods for identification of REs remain a widely used tool for studying and understanding transcriptional regulation mechanisms. The Nsite, NsiteH and NsiteM programs perform searches for statistically significant (non-random) motifs of known human, animal and plant one-box and composite REs in a single genomic sequence, in a pair of aligned homologous sequences and in a set of functionally related sequences, respectively. AVAILABILITY AND IMPLEMENTATION: Pre-compiled executables built under commonly used operating systems are available for download by visiting http://www.molquest.kaust.edu.sa and http://www.softberry.com. CONTACT: solovictor@gmail.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Promoter Regions, Genetic , Software , Animals , Binding Sites , Genomics , Humans , Nucleotide Motifs , Plants/genetics , Regulatory Sequences, Nucleic Acid , Sequence Analysis, DNA , Transcription Factors/metabolism
3.
BMC Genomics ; 11: 646, 2010 Nov 19.
Article in English | MEDLINE | ID: mdl-21092114

ABSTRACT

BACKGROUND: mRNA polyadenylation is an essential step of pre-mRNA processing in eukaryotes. Accurate prediction of the pre-mRNA 3'-end cleavage/polyadenylation sites is important for defining the gene boundaries and understanding gene expression mechanisms. RESULTS: 28761 human mapped poly(A) sites have been classified into three classes containing different known forms of polyadenylation signal (PAS) or none of them (PAS-strong, PAS-weak and PAS-less, respectively) and a new computer program POLYAR for the prediction of poly(A) sites of each class was developed. In comparison with polya_svm (till date the most accurate computer program for prediction of poly(A) sites) while searching for PAS-strong poly(A) sites in human sequences, POLYAR had a significantly higher prediction sensitivity (80.8% versus 65.7%) and specificity (66.4% versus 51.7%) However, when a similar sort of search was conducted for PAS-weak and PAS-less poly(A) sites, both programs had a very low prediction accuracy, which indicates that our knowledge about factors involved in the determination of the poly(A) sites is not sufficient to identify such polyadenylation regions. CONCLUSIONS: We present a new classification of polyadenylation sites into three classes and a novel computer program POLYAR for prediction of poly(A) sites/regions of each of the class. In tests, POLYAR shows high accuracy of prediction of the PAS-strong poly(A) sites, though this program's efficiency in searching for PAS-weak and PAS-less poly(A) sites is not very high but is comparable to other available programs. These findings suggest that additional characteristics of such poly(A) sites remain to be elucidated. POLYAR program with a stand-alone version for downloading is available at http://cub.comsats.edu.pk/polyapredict.htm.


Subject(s)
Computational Biology/methods , Poly A/genetics , Software , 5' Untranslated Regions/genetics , Base Sequence , Humans , Introns/genetics , Polyadenylation/genetics
4.
Nucleic Acids Res ; 31(1): 114-7, 2003 Jan 01.
Article in English | MEDLINE | ID: mdl-12519961

ABSTRACT

PlantProm DB, a plant promoter database, is an annotated, non-redundant collection of proximal promoter sequences for RNA polymerase II with experimentally determined transcription start site(s), TSS, from various plant species. The first release (2002.01) of PlantProm DB contains 305 entries including 71, 220 and 14 promoters from monocot, dicot and other plants, respectively. It provides DNA sequence of the promoter regions (-200 : +51) with TSS on the fixed position +201, taxonomic/promoter type classification of promoters and Nucleotide Frequency Matrices (NFM) for promoter elements: TATA-box, CCAAT-box and TSS-motif (Inr). Analysis of TSS-motifs revealed that their composition is different in dicots and monocots, as well as for TATA and TATA-less promoters. The database serves as learning set in developing plant promoter prediction programs. One such program (TSSP) based on discriminant analysis has been created by Softberry Inc. and the application of a support ftp: vector machine approach for promoter identification is under development. PlantProm DB is available at http://mendel.cs.rhul.ac.uk/ and http://www.softberry.com/.


Subject(s)
Databases, Nucleic Acid , Genes, Plant , Promoter Regions, Genetic , RNA Polymerase II/genetics , Response Elements , Sequence Analysis, DNA
5.
Methods Mol Biol ; 674: 57-83, 2010.
Article in English | MEDLINE | ID: mdl-20827586

ABSTRACT

Promoter sequences are the main regulatory elements of gene expression. Their recognition by computer algorithms is fundamental for understanding gene expression patterns, cell specificity and development. This chapter describes the advanced approaches to identify promoters in animal, plant and bacterial sequences. Also, we discuss an approach to identify statistically significant regulatory motifs in genomic sequences.


Subject(s)
Computational Biology/methods , Gene Expression Regulation/genetics , Promoter Regions, Genetic/genetics , Algorithms , Animals , Bacteria/genetics , Base Sequence , DNA/genetics , DNA/metabolism , Humans , Mice , Molecular Sequence Data , Plants/genetics , Rats , Sequence Homology, Nucleic Acid , Software , Transcription Factors/metabolism
6.
Bioinformatics ; 19(15): 1964-71, 2003 Oct 12.
Article in English | MEDLINE | ID: mdl-14555630

ABSTRACT

UNLABELLED: In this paper we propose a new method for recognition of prokaryotic promoter regions with startpoints of transcription. The method is based on Sequence Alignment Kernel, a function reflecting the quantitative measure of match between two sequences. This kernel function is further used in Dual SVM, which performs the recognition. Several recognition methods have been trained and tested on positive data set, consisting of 669 sigma70-promoter regions with known transcription startpoints of Escherichia coli and two negative data sets of 709 examples each, taken from coding and non-coding regions of the same genome. The results show that our method performs well and achieves 16.5% average error rate on positive & coding negative data and 18.6% average error rate on positive & non-coding negative data. AVAILABILITY: The demo version of our method is accessible from our website http://mendel.cs.rhul.ac.uk/


Subject(s)
Algorithms , Artificial Intelligence , Escherichia coli/genetics , Gene Expression Profiling/methods , Pattern Recognition, Automated , Promoter Regions, Genetic/genetics , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Reproducibility of Results , Sensitivity and Specificity
7.
Plant Mol Biol ; 52(5): 923-34, 2003 Jul.
Article in English | MEDLINE | ID: mdl-14558655

ABSTRACT

Pairwise comparison of whole plastid and draft nuclear genomic sequences of Arabidopsis thaliana and Oryza sativa L. ssp. indica shows that rice nuclear genomic sequences contain homologs of plastid DNA covering about 94 kb (83%) of plastid genome and including one or more full-length intact (without mutations resulting in premature stop codons) homologues of 26 known protein-coding (KPC) plastid genes. By contrast, only about 20 kb (16%) of chloroplast DNA, including a single intact plastid-derived KPC gene, is presented in the nucleus of A. thaliana. Sixteen rice plastid genes have at least one nuclear copy without any mutation or with only synonymous substitutions. Nuclear copies for other ten plastid genes contain both synonymous and non-synonymous substitutions. Multiple ESTs for 25 out of 26 KPC genes were also found, as well as putative promoters for some of them. The study of substitutions pattern shows that some of nuclear homologues of plastid genes may be functional and/or are under the pressure of the positive natural selection. The similar comparative analysis performed on rice chromosome 1 revealed 27 contigs containing plastid-derived sequences, totalling about 84 kb and covering two thirds of chloroplast DNA, with the intact nuclear copies of 26 different KPC genes. One of these contigs, AP003280, includes almost 57 kb (45%) of chloroplast genome with the intact copies of 22 KPC genes. At the same time, we observed that relative locations of homologues in plastid DNA and the nuclear genome are significantly different.


Subject(s)
Arabidopsis/genetics , Cell Nucleus/genetics , Genome, Plant , Oryza/genetics , Plastids/genetics , Chromosomes, Plant/genetics , DNA, Chloroplast/genetics , Gene Dosage , Genes, Plant/genetics , Nuclear Proteins/genetics , Plant Proteins/genetics
SELECTION OF CITATIONS
SEARCH DETAIL