Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 24
Filter
Add more filters











Publication year range
1.
Prog Biophys Mol Biol ; 193: 46-54, 2024 Sep 10.
Article in English | MEDLINE | ID: mdl-39260792

ABSTRACT

DNA is the macromolecule responsible for storing the genetic information of a cell and it has intrinsic properties such as deformability, stability and curvature. DNA Curvature plays an important role in gene transcription and, consequently, in the subsequent production of proteins, a fundamental process of cells. With recent advances in bioinformatics and theoretical biology, it became possible to analyze and understand the involvement of DNA Curvature as a discriminatory characteristic of gene-promoting regions. These regions act as sites where RNAp (ribonucleic acid-polymerase) binds to initiate transcription. This review aims to describe the formation of Curvature, as well as highlight its importance in predicting promoters. Furthermore, this article provides the potential of DNA Curvature as a distinguishing feature for promoter prediction tools, as well as outlining the calculation procedures that have been described by other researchers. This work may support further studies directed towards the enhancement of promoter prediction software.

2.
Genes (Basel) ; 14(7)2023 07 13.
Article in English | MEDLINE | ID: mdl-37510345

ABSTRACT

Promoters are DNA non-coding regions around the transcription start site and are responsible for regulating the gene transcription process. Due to their key role in gene function and transcriptional activity, the prediction of promoter sequences and their core elements accurately is a crucial research area in bioinformatics. At present, models based on machine learning and deep learning have been developed for promoter prediction. However, these models cannot mine the deeper biological information of promoter sequences and consider the complex relationship among promoter sequences. In this work, we propose a novel prediction model called PromGER to predict eukaryotic promoter sequences. For a promoter sequence, firstly, PromGER utilizes four types of feature-encoding methods to extract local information within promoter sequences. Secondly, according to the potential relationships among promoter sequences, the whole promoter sequences are constructed as a graph. Furthermore, three different scales of graph-embedding methods are applied for obtaining the global feature information more comprehensively in the graph. Finally, combining local features with global features of sequences, PromGER analyzes and predicts promoter sequences through a tree-based ensemble-learning framework. Compared with seven existing methods, PromGER improved the average specificity of 13%, accuracy of 10%, Matthew's correlation coefficient of 16%, precision of 4%, F1 score of 6%, and AUC of 9%. Specifically, this study interpreted the PromGER by the t-distributed stochastic neighbor embedding (t-SNE) method and SHAPley Additive exPlanations (SHAP) value analysis, which demonstrates the interpretability of the model.


Subject(s)
Eukaryota , Eukaryotic Cells , Promoter Regions, Genetic , Computational Biology/methods , Machine Learning
3.
Front Genet ; 13: 1067562, 2022.
Article in English | MEDLINE | ID: mdl-36523764

ABSTRACT

Since the introduction of the first transformer model with a unique self-attention mechanism, natural language processing (NLP) models have attained state-of-the-art (SOTA) performance on various tasks. As DNA is the blueprint of life, it can be viewed as an unusual language, with its characteristic lexicon and grammar. Therefore, NLP models may provide insights into the meaning of the sequential structure of DNA. In the current study, we employed and compared the performance of popular SOTA NLP models (i.e., XLNET, BERT, and a variant DNABERT trained on the human genome) to predict and analyze the promoters in freshwater cyanobacterium Synechocystis sp. PCC 6803 and the fastest growing cyanobacterium Synechococcus elongatus sp. UTEX 2973. These freshwater cyanobacteria are promising hosts for phototrophically producing value-added compounds from CO2. Through a custom pipeline, promoters and non-promoters from Synechococcus elongatus sp. UTEX 2973 were used to train the model. The trained model achieved an AUROC score of 0.97 and F1 score of 0.92. During cross-validation with promoters from Synechocystis sp. PCC 6803, the model achieved an AUROC score of 0.96 and F1 score of 0.91. To increase accessibility, we developed an integrated platform (TSSNote-CyaPromBERT) to facilitate large dataset extraction, model training, and promoter prediction from public dRNA-seq datasets. Furthermore, various visualization tools have been incorporated to address the "black box" issue of deep learning and feature analysis. The learning transfer ability of large language models may help identify and analyze promoter regions for newly isolated strains with similar lineages.

4.
Biology (Basel) ; 11(8)2022 Jul 26.
Article in English | MEDLINE | ID: mdl-35892972

ABSTRACT

In this study, we used a mathematical method for the multiple alignment of highly divergent sequences (MAHDS) to create a database of potential promoter sequences (PPSs) in the Capsicum annuum genome. To search for PPSs, 20 statistically significant classes of sequences located in the range from -499 to +100 nucleotides near the annotated genes were calculated. For each class, a position-weight matrix (PWM) was computed and then used to identify PPSs in the C. annuum genome. In total, 825,136 PPSs were detected, with a false positive rate of 0.13%. The PPSs obtained with the MAHDS method were tested using TSSFinder, which detects transcription start sites. The databank of the found PPSs provides their coordinates in chromosomes, the alignment of each PPS with the PWM, and the level of statistical significance as a normal distribution argument, and can be used in genetic engineering and biotechnology.

5.
Comput Biol Med ; 147: 105627, 2022 08.
Article in English | MEDLINE | ID: mdl-35671653

ABSTRACT

Locating the promoter region in DNA sequences is of paramount importance in bioinformatics. This problem has been widely studied in the literature, but it has not yet been fully resolved. Some researchers have shown remarkable results using convolutional networks that allowed the automatic extraction of features from a DNA chain. However, a single architecture schema that could learn the promoter prediction task competitively for several organisms has not yet been achieved. Thus, researchers must seek new architectures by hand-designing or by Neural Architecture Search for each new evaluated organism dataset. This work proposes a versatile architecture based on a capsule network that can accurately identify promoter sequences in raw DNA data from five different organisms, eukaryotic and prokaryotic. Our architecture, the CapsProm, could help create models with minimum effort to learn the promoter identification task between different datasets. Furthermore, the CapsProm showed competitive results, overcoming the baseline method in five out of seven tested datasets (F1-score). The models and source code are made available at https://github.com/lauromoraes/CapsNet-promoter.


Subject(s)
Computational Biology , Neural Networks, Computer , Computational Biology/methods , DNA , Promoter Regions, Genetic/genetics , Software
6.
Int J Mol Sci ; 23(7)2022 Mar 28.
Article in English | MEDLINE | ID: mdl-35409058

ABSTRACT

Nucleic acids are the basic units of deoxyribonucleic acid (DNA) sequencing. Every organism demonstrates different DNA sequences with specific nucleotides. It reveals the genetic information carried by a particular DNA segment. Nucleic acid sequencing expresses the evolutionary changes among organisms and revolutionizes disease diagnosis in animals. This paper proposes a generative adversarial networks (GAN) model to create synthetic nucleic acid sequences of the cat genome tuned to exhibit specific desired properties. We obtained the raw sequence data from Illumina next generation sequencing. Various data preprocessing steps were performed using Cutadapt and DADA2 tools. The processed data were fed to the GAN model that was designed following the architecture of Wasserstein GAN with gradient penalty (WGAN-GP). We introduced a predictor and an evaluator in our proposed GAN model to tune the synthetic sequences to acquire certain realistic properties. The predictor was built for extracting samples with a promoter sequence, and the evaluator was built for filtering samples that scored high for motif-matching. The filtered samples were then passed to the discriminator. We evaluated our model based on multiple metrics and demonstrated outputs for latent interpolation, latent complementation, and motif-matching. Evaluation results showed our proposed GAN model achieved 93.7% correlation with the original data and produced significant outcomes as compared to existing models for sequence generation.


Subject(s)
Adenosine Deaminase , Image Processing, Computer-Assisted , DNA , Image Processing, Computer-Assisted/methods , Intercellular Signaling Peptides and Proteins
7.
Bioprocess Biosyst Eng ; 45(5): 955-967, 2022 May.
Article in English | MEDLINE | ID: mdl-35279747

ABSTRACT

Promoters contribute to research in the context of many diseases, such as coronary heart disease, diabetes and tumors, and one fundamental task is to identify promoters. Deep learning is widely used in the study of promoter sequence recognition. Although deep models have fast and accurate recognition capabilities, they are also limited by their reliance on large amounts of high-quality data. Therefore, we performed transfer learning on a typical deep network based on residual ideas, called a deep residual network (ResNet), to solve the problem of a deep network's high dependence on large amounts of data in the process of promoter prediction. We used binary one-hot encoding to represent the promoter and took advantage of ResNet to extract feature representations from organisms with a large amount of promoter data. Then, we transferred the learned structural parameters to target organisms with insufficient promoter data to improve the generalization performance of ResNet in target organisms. We evaluated the promoter datasets of four organisms (Bacillus subtilis, Escherichia coli, Saccharomyces cerevisiae and Drosophila melanogaster). The experimental results showed that the AUCs of ResNet's promoter prediction after deep transfer were 0.8537 and 0.8633, which increased by 0.1513 and 0.1376 in prokaryotes and eukaryotes, respectively.


Subject(s)
Drosophila melanogaster , Eukaryota , Animals , Bacillus subtilis/genetics , Escherichia coli/genetics , Machine Learning , Promoter Regions, Genetic
8.
Int J Mol Sci ; 23(3)2022 Feb 03.
Article in English | MEDLINE | ID: mdl-35163661

ABSTRACT

The identification of promoters is an essential step in the genome annotation process, providing a framework for gene regulatory networks and their role in transcription regulation. Despite considerable advances in the high-throughput determination of transcription start sites (TSSs) and transcription factor binding sites (TFBSs), experimental methods are still time-consuming and expensive. Instead, several computational approaches have been developed to provide fast and reliable means for predicting the location of TSSs and regulatory motifs on a genome-wide scale. Numerous studies have been carried out on the regulatory elements of mammalian genomes, but plant promoters, especially in gymnosperms, have been left out of the limelight and, therefore, have been poorly investigated. The aim of this study was to enhance and expand the existing genome annotations using computational approaches for genome-wide prediction of TSSs in the four conifer species: loblolly pine, white spruce, Norway spruce, and Siberian larch. Our pipeline will be useful for TSS predictions in other genomes, especially for draft assemblies, where reliable TSS predictions are not usually available. We also explored some of the features of the nucleotide composition of the predicted promoters and compared the GC properties of conifer genes with model monocot and dicot plants. Here, we demonstrate that even incomplete genome assemblies and partial annotations can be a reliable starting point for TSS annotation. The results of the TSS prediction in four conifer species have been deposited in the Persephone genome browser, which allows smooth visualization and is optimized for large data sets. This work provides the initial basis for future experimental validation and the study of the regulatory regions to understand gene regulation in gymnosperms.


Subject(s)
Genome, Plant , Tracheophyta/genetics , Transcription Initiation Site , Base Composition/genetics , Binding Sites , DNA, Plant/genetics , Exons/genetics , Molecular Sequence Annotation , Nucleotide Motifs/genetics , Nucleotides/metabolism , Open Reading Frames/genetics , Promoter Regions, Genetic , Transcription Factors/metabolism
9.
Genome Biol ; 22(1): 318, 2021 11 17.
Article in English | MEDLINE | ID: mdl-34789306

ABSTRACT

Promoters are genomic regions where the transcription machinery binds to initiate the transcription of specific genes. Computational tools for identifying bacterial promoters have been around for decades. However, most of these tools were designed to recognize promoters in one or few bacterial species. Here, we present Promotech, a machine-learning-based method for promoter recognition in a wide range of bacterial species. We compare Promotech's performance with the performance of five other promoter prediction methods. Promotech outperforms these other programs in terms of area under the precision-recall curve (AUPRC) or precision at the same level of recall. Promotech is available at https://github.com/BioinformaticsLabAtMUN/PromoTech .


Subject(s)
Bacteria/genetics , Computational Biology/methods , Promoter Regions, Genetic , Genomics , Machine Learning , Software
10.
ACS Synth Biol ; 10(6): 1394-1405, 2021 06 18.
Article in English | MEDLINE | ID: mdl-33988977

ABSTRACT

Engineering microorganisms into biological factories that convert renewable feedstocks into valuable materials is a major goal of synthetic biology; however, for many nonmodel organisms, we do not yet have the genetic tools, such as suites of strong promoters, necessary to effectively engineer them. In this work, we developed a computational framework that can leverage standard RNA-seq data sets to identify sets of constitutive, strongly expressed genes and predict strong promoter signals within their upstream regions. The framework was applied to a diverse collection of RNA-seq data measured for the methanotroph Methylotuvimicrobium buryatense 5GB1 and identified 25 genes that were constitutively, strongly expressed across 12 experimental conditions. For each gene, the framework predicted short (27-30 nucleotide) sequences as candidate promoters and derived -35 and -10 consensus promoter motifs (TTGACA and TATAAT, respectively) for strong expression in M. buryatense. This consensus closely matches the canonical E. coli sigma-70 motif and was found to be enriched in promoter regions of the genome. A subset of promoter predictions was experimentally validated in a XylE reporter assay, including the consensus promoter, which showed high expression. The pmoC, pqqA, and ssrA promoter predictions were additionally screened in an experiment that scrambled the -35 and -10 signal sequences, confirming that transcription initiation was disrupted when these specific regions of the predicted sequence were altered. These results indicate that the computational framework can make biologically meaningful promoter predictions and identify key pieces of regulatory systems that can serve as foundational tools for engineering diverse microorganisms for biomolecule production.


Subject(s)
Metabolic Engineering/methods , Methylococcaceae/genetics , Methylococcaceae/metabolism , Promoter Regions, Genetic/genetics , RNA-Seq/methods , Base Sequence , Computational Biology/methods , DNA-Directed RNA Polymerases/genetics , Escherichia coli/genetics , Genome, Bacterial , RNA, Bacterial/genetics , Sigma Factor/genetics , Transcription Initiation Site , Transcription Initiation, Genetic , Transcriptome/genetics
11.
PeerJ Comput Sci ; 7: e365, 2021.
Article in English | MEDLINE | ID: mdl-33817015

ABSTRACT

Gene promoters are the key DNA regulatory elements positioned around the transcription start sites and are responsible for regulating gene transcription process. Various alignment-based, signal-based and content-based approaches are reported for the prediction of promoters. However, since all promoter sequences do not show explicit features, the prediction performance of these techniques is poor. Therefore, many machine learning and deep learning models have been proposed for promoter prediction. In this work, we studied methods for vector encoding and promoter classification using genome sequences of three distinct higher eukaryotes viz. yeast (Saccharomyces cerevisiae), A. thaliana (plant) and human (Homo sapiens). We compared one-hot vector encoding method with frequency-based tokenization (FBT) for data pre-processing on 1-D Convolutional Neural Network (CNN) model. We found that FBT gives a shorter input dimension reducing the training time without affecting the sensitivity and specificity of classification. We employed the deep learning techniques, mainly CNN and recurrent neural network with Long Short Term Memory (LSTM) and random forest (RF) classifier for promoter classification at k-mer sizes of 2, 4 and 8. We found CNN to be superior in classification of promoters from non-promoter sequences (binary classification) as well as species-specific classification of promoter sequences (multiclass classification). In summary, the contribution of this work lies in the use of synthetic shuffled negative dataset and frequency-based tokenization for pre-processing. This study provides a comprehensive and generic framework for classification tasks in genomic applications and can be extended to various classification problems.

12.
Biomolecules ; 10(10)2020 09 28.
Article in English | MEDLINE | ID: mdl-32998424

ABSTRACT

CXCL8 (interleukin-8, IL-8) is a CXC family chemokine that recruits specific target cells and mediates inflammation and wound healing. This study reports the identification and characterization of two cxcl8 homologs from rock bream, Oplegnathus fasciatus. Investigation of molecular signature, homology, phylogeny, and gene structure suggested that they belonged to lineages 1 (L1) and 3 (L3), and designated Ofcxcl8-L1 and Ofcxcl8-L3. While Ofcxcl8-L1 and Ofcxcl8-L3 revealed quadripartite and tripartite organization, in place of the mammalian ELR (Glu-Leu-Arg) motif, their peptides harbored EMH (Glu-Met-His) and NSH (Asn-Ser-His) motifs, respectively. Transcripts of Ofcxcl8s were constitutively detected by Quantitative Real-Time PCR (qPCR) in 11 tissues examined, however, at different levels. Ofcxcl8-L1 transcript robustly responded to treatments with stimulants, such as flagellin, concanavalin A, lipopolysaccharide, and poly(I:C), and pathogens, including Edwardsiella tarda, Streptococcus iniae, and rock bream iridovirus, when compared with Ofcxcl8-L3 mRNA. The differences in the putative promoter features may partly explain the differential transcriptional modulation of Ofcxcl8s. Purified recombinant OfCXCL8 (rOfCXCL8) proteins were used in in vitro chemotaxis and proliferation assays. Despite the lack of ELR motif, both rOfCXCL8s exhibited leukocyte chemotactic and proliferative functions, where the potency of rOfCXCL8-L1 was robust and significant compared to that of rOfCXCL8-L3. The results, taken together, are indicative of the crucial importance of Ofcxcl8s in inflammatory responses and immunoregulatory roles in rock bream immunity.


Subject(s)
Genomics , Interleukin-8/metabolism , Perciformes/metabolism , Amino Acid Motifs , Amino Acid Sequence , Animals , Edwardsiella tarda/physiology , Fish Proteins/classification , Fish Proteins/genetics , Fish Proteins/metabolism , Interleukin-8/classification , Interleukin-8/genetics , Iridovirus/physiology , Lipopolysaccharides/pharmacology , Perciformes/genetics , Perciformes/microbiology , Phylogeny , Poly I-C/pharmacology , Promoter Regions, Genetic , Protein Domains , Protein Isoforms/classification , Protein Isoforms/genetics , Protein Isoforms/metabolism , RNA, Messenger/metabolism , Recombinant Proteins/biosynthesis , Recombinant Proteins/chemistry , Recombinant Proteins/isolation & purification , Sequence Alignment , Transcription, Genetic/drug effects
13.
mSystems ; 5(4)2020 08 25.
Article in English | MEDLINE | ID: mdl-32843538

ABSTRACT

The promoter region is a key element required for the production of RNA in bacteria. While new high-throughput technology allows massively parallel mapping of promoter elements, we still mainly rely on bioinformatics tools to predict such elements in bacterial genomes. Additionally, despite many different prediction tools having become popular to identify bacterial promoters, no systematic comparison of such tools has been performed. Here, we performed a systematic comparison between several widely used promoter prediction tools (BPROM, bTSSfinder, BacPP, CNNProm, IBBP, Virtual Footprint, iPro70-FMWin, 70ProPred, iPromoter-2L, and MULTiPly) using well-defined sequence data sets and standardized metrics to determine how well those tools performed related to each other. For this, we used data sets of experimentally validated promoters from Escherichia coli and a control data set composed of randomly generated sequences with similar nucleotide distributions. We compared the performance of the tools using metrics such as specificity, sensitivity, accuracy, and Matthews correlation coefficient (MCC). We show that the widely used BPROM presented the worse performance among the compared tools, while four tools (CNNProm, iPro70-FMWin, 70ProPred, and iPromoter-2L) offered high predictive power. Of these tools, iPro70-FMWin exhibited the best results for most of the metrics used. We present here some potentials and limitations of available tools, and we hope that future work can build upon our effort to systematically characterize this useful class of bioinformatics tools.IMPORTANCE The correct mapping of promoter elements is a crucial step in microbial genomics. Also, when combining new DNA elements into synthetic sequences, predicting the potential generation of new promoter sequences is critical. Over the last years, many bioinformatics tools have been created to allow users to predict promoter elements in a sequence or genome of interest. Here, we assess the predictive power of some of the main prediction tools available using well-defined promoter data sets. Using Escherichia coli as a model organism, we demonstrated that while some tools are biased toward AT-rich sequences, others are very efficient in identifying real promoters with low false-negative rates. We hope the potentials and limitations presented here will help the microbiology community to choose promoter prediction tools among many available alternatives.

14.
DNA Cell Biol ; 36(12): 1081-1092, 2017 Dec.
Article in English | MEDLINE | ID: mdl-29039971

ABSTRACT

Phytoplasmas are obligate intracellular parasitic bacteria that infect both plants and insects. We previously identified the sigma factor RpoD-dependent consensus promoter sequence of phytoplasma. However, the genome-wide landscape of RNA transcripts, including non-coding RNAs (ncRNAs) and RpoD-independent promoter elements, was still unknown. In this study, we performed an improved RNA sequencing analysis for genome-wide identification of the transcription start sites (TSSs) and the consensus promoter sequences. We constructed cDNA libraries using a random adenine/thymine hexamer primer, in addition to a conventional random hexamer primer, for efficient sequencing of 5'-termini of AT-rich phytoplasma RNAs. We identified 231 TSSs, which were classified into four categories: mRNA TSSs, internal sense TSSs, antisense TSSs (asTSSs), and orphan TSSs (oTSSs). The presence of asTSSs and oTSSs indicated the genome-wide transcription of ncRNAs, which might act as regulatory ncRNAs in phytoplasmas. This is the first description of genome-wide phytoplasma ncRNAs. Using a de novo motif discovery program, we identified two consensus motif sequences located upstream of the TSSs. While one was almost identical to the RpoD-dependent consensus promoter sequence, the other was an unidentified novel motif, which might be recognized by another transcription initiation factor. These findings are valuable for understanding the regulatory mechanism of phytoplasma gene expression.


Subject(s)
Phytoplasma/genetics , Animals , Base Sequence , Conserved Sequence , Gene Library , Genome, Bacterial , Insecta/microbiology , Phytoplasma/pathogenicity , Plants/microbiology , Promoter Regions, Genetic , RNA, Bacterial/genetics , RNA, Untranslated/genetics , Sequence Analysis, RNA , Transcription Initiation Site
15.
DNA Res ; 24(3): 271-278, 2017 Jun 01.
Article in English | MEDLINE | ID: mdl-28158431

ABSTRACT

In our previous study, a methodology was established to predict transcriptional regulatory elements in promoter sequences using transcriptome data based on a frequency comparison of octamers. Some transcription factors, including the NAC family, cannot be covered by this method because their binding sequences have non-specific spacers in the middle of the two binding sites. In order to remove this blind spot in promoter prediction, we have extended our analysis by including bipartite octamers that are composed of '4 bases-a spacer with a flexible length-4 bases'. 8,044 pre-selected bipartite octamers, which had an overrepresentation of specific spacer lengths in promoter sequences and sequences related to core elements removed, were subjected to frequency comparison analysis. Prediction of ER stress-responsive elements in the BiP/BiPL promoter and an ANAC017 target sequence resulted in precise detection of true positives, judged by functional analyses of a reported article and our own in vitro protein-DNA binding assays. These results demonstrate that incorporation of bipartite octamers with continuous ones improves promoter prediction significantly.


Subject(s)
Arabidopsis Proteins/genetics , Arabidopsis/genetics , Gene Expression Regulation, Plant , Genomics , Promoter Regions, Genetic , Transcription Factors/metabolism , Arabidopsis/metabolism , Transcriptome
16.
World J Microbiol Biotechnol ; 33(2): 23, 2017 Feb.
Article in English | MEDLINE | ID: mdl-28044271

ABSTRACT

Production of useful chemicals by industrial microorganisms has been attracting more and more attention. Microorganisms screened from their natural environment usually suffer from low productivity, low stress resistance, and accumulation of by-products. In order to overcome these disadvantages, rational engineering of microorganisms to achieve specific industrial goals has become routine. Rapid development of metabolic engineering and synthetic biology strategies provide novel methods to improve the performance of industrial microorganisms. Rational regulation of gene expression by specific promoters is essential to engineer industrial microorganisms for high-efficiency production of target chemicals. Identification, modification, and application of suitable promoters could provide powerful switches at the transcriptional level for fine-tuning of a single gene or a group of genes, which are essential for the reconstruction of pathways. In this review, the characteristics of promoters from eukaryotic, prokaryotic, and archaea microorganisms are briefly introduced. Identification of promoters based on both traditional biochemical and systems biology routes are summarized. Besides rational modification, de novo design of promoters to achieve gradient, dynamic, and logic gate regulation are also introduced. Furthermore, flexible application of static and dynamic promoters for the rational engineering of industrial microorganisms is highlighted. From the perspective of powerful promoters in industrial microorganisms, this review will provide an extensive description of how to regulate gene expression in industrial microorganisms to achieve more useful goals.


Subject(s)
Gene Expression , Industrial Microbiology/methods , Promoter Regions, Genetic , Archaea/genetics , Bacteria/genetics , Eukaryota/genetics , Gene Expression Regulation , Metabolic Engineering/methods , Synthetic Biology/methods
17.
DNA Res ; 24(1): 25-35, 2017 Feb 01.
Article in English | MEDLINE | ID: mdl-27803028

ABSTRACT

Next-generation sequencing studies have revealed that a variety of transcripts are present in the prokaryotic transcriptome and a significant fraction of them are functional, being involved in various regulatory activities apart from coding for proteins. Identification of promoters associated with different transcripts is necessary for characterization of the transcriptome. Promoter regions have been shown to have unique structural features as compared with their flanking region, in organisms covering all domains of life. Here we report an in silico analysis of DNA sequence dependent structural properties like stability, bendability and curvature in the promoter region of six different prokaryotic transcriptomes. Using these structural features, we predicted promoters associated with different categories of transcripts (mRNA, internal, antisense and non-coding), which constitute the transcriptome. Promoter annotation using structural features is fairly accurate and reliable with about 50% of the primary promoters being characterized by all three structural properties while at least one property identifies 95%. We also studied the relative differences of these structural features in terms of gene expression and found that the features, viz. lower stability, lesser bendability and higher curvature are more prominent in the promoter regions which are associated with high gene expression as compared with low expression genes. Hence, promoters, which are associated with higher gene expression, get annotated well using DNA structural features as compared with those, which are linked to lower gene expression.


Subject(s)
DNA/chemistry , Gene Expression , Promoter Regions, Genetic , Transcriptome , Bacteria/genetics
18.
BMC Genomics ; 17: 302, 2016 Apr 23.
Article in English | MEDLINE | ID: mdl-27107716

ABSTRACT

BACKGROUND: Differential RNA-sequencing (dRNA-seq) is indispensable for determination of primary transcriptomes. However, using dRNA-seq data to map transcriptional start sites (TSSs) and promoters genome-wide is a bioinformatics challenge. We performed dRNA-seq of Bradyrhizobium japonicum USDA 110, the nitrogen-fixing symbiont of soybean, and developed algorithms to map TSSs and promoters. RESULTS: A specialized machine learning procedure for TSS recognition allowed us to map 15,923 TSSs: 14,360 in free-living bacteria, 4329 in symbiosis with soybean and 2766 in both conditions. Further, we provide proteomic evidence for 4090 proteins, among them 107 proteins corresponding to new genes and 178 proteins with N-termini different from the existing annotation (72 and 109 of them with TSS support, respectively). Guided by proteomics evidence, previously identified TSSs and TSSs experimentally validated here, we assign a score threshold to flag 14 % of the mapped TSSs as a class of lower confidence. However, this class of lower confidence contains valid TSSs of low-abundant transcripts. Moreover, we developed a de novo algorithm to identify promoter motifs upstream of mapped TSSs, which is publicly available, and found motifs mainly used in symbiosis (similar to RpoN-dependent promoters) or under both conditions (similar to RpoD-dependent promoters). Mapped TSSs and putative promoters, proteomic evidence and updated gene annotation were combined into an annotation file. CONCLUSIONS: The genome-wide TSS and promoter maps along with the extended genome annotation of B. japonicum represent a valuable resource for future systems biology studies and for detailed analyses of individual non-coding transcripts and ORFs. Our data will also provide new insights into bacterial gene regulation during the agriculturally important symbiosis between rhizobia and legumes.


Subject(s)
Bradyrhizobium/genetics , Chromosome Mapping/methods , Promoter Regions, Genetic , Transcription Initiation Site , Algorithms , Computational Biology , Machine Learning , Proteome , RNA, Bacterial/genetics , Sequence Analysis, RNA , Glycine max/microbiology , Symbiosis
19.
FEBS Open Bio ; 5: 813-23, 2015.
Article in English | MEDLINE | ID: mdl-26566476

ABSTRACT

A 14 kDa protein homologous to the γ-d-glutamyl-l-diamino acid endopeptidase members of the NlpC/P60 Superfamily has been described in Dermatophagoides pteronyssinus and Dermatophagoides farinae but it is not clear whether other species produce homologues. Bioinformatics revealed homologous genes in other Sarcopteformes mite species (Psoroptes ovis and Blomia tropicalis) but not in Tetranychus urticae and Metaseiulus occidentalis. The degrees of identity (similarity) between the D. pteronyssinus mature protein and those from D. farinae, P. ovis and B. tropicalis were 82% (96%), 77% (93%) and 61% (82%), respectively. Phylogenetic studies showed the mite proteins were monophyletic and shared a common ancestor with both actinomycetes and ascomycetes. The gene encoding the D. pteronyssinus protein was polymorphic and intronless in contrast to that reported for D. farinae. Homology studies suggest that the mite, ascomycete and actinomycete proteins are involved in the catalysis of stem peptide attached to peptidoglycan. The finding of a gene encoding a P60 family member in the D. pteronyssinus genome together with the presence of a bacterial promotor suggests an evolutionary link to one or more prokaryotic endosymbionts.

20.
Int J Bioinform Res Appl ; 11(4): 347-65, 2015.
Article in English | MEDLINE | ID: mdl-26561319

ABSTRACT

In this work we described a bacterial open reading frame with two different directions of nucleotide usage biases in its two parts. The level of GC-content in third codon positions (3GC) is equal to 40.17 ± 0.22% during the most of the length of Corynebacterium diphtheriae spaC gene. However, in the 3'-end of the same gene (from codon #1600 to codon #1873) 3GC level is equal to 64.61 ± 0.91%. Using original methodology ('VVTAK Sliding window' and 'VVTAK VarInvar') we approved that there is an ongoing mutational AT-pressure during the most of the length of spaC gene (up to codon #1599), and there is an ongoing mutational G-pressure in the 3′-end of spaC. Intragenic promoters predicted by three different methods may be the cause of the differences in preferable types of nucleotide mutations in spaC parts because of their autonomous transcription.


Subject(s)
Bacterial Proteins/genetics , Codon/genetics , Corynebacterium diphtheriae/genetics , Genes, Bacterial/genetics , Membrane Proteins/genetics , Base Composition/genetics , Genomics , Mutation , Open Reading Frames
SELECTION OF CITATIONS
SEARCH DETAIL