RESUMEN
Overactive Janus kinases (JAKs) are known to drive leukemia, making them well-suited targets for treatment. We sought to identify new JAK-activating mutations and instead found a JAK1-inactivating pseudokinase mutation, V666G. In contrast to other pseudokinase mutations that canonically lead to an active kinase, the JAK1 V666G mutation led to under-activation seen by reduced phosphorylation. To understand the functional role of JAK1 V666G in modifying kinase activity we investigated its influence on other JAK kinases and within the Interleukin-2 pathway. JAK1 V666G not only inhibited its own activity, but its presence could inhibit other JAK kinases. These findings provide new insights into the potential of JAK1 pseudokinase to modulate its own activity, as well as of other JAK kinases. Thus, the features of the JAK1 V666 region in modifying JAK kinases can be exploited to allosterically inhibit overactive JAKs.
Asunto(s)
Interleucina-2 , Leucemia , Humanos , Fosforilación , Interleucina-2/genética , Interleucina-2/metabolismo , Janus Quinasa 1/genética , Janus Quinasa 1/metabolismo , Transducción de Señal , Quinasas Janus/metabolismo , Janus Quinasa 3/genética , Janus Quinasa 3/metabolismoRESUMEN
Mucin-type O-glycosylation is one of the most common posttranslational modifications of proteins. The abnormal expression of various polypeptide GalNAc-transferases (GalNAc-Ts) which initiate and define sites of O-glycosylation are linked to many cancers and other diseases. Current O-glycosyation prediction programs utilize O-glycoproteomics data obtained without regard to the transferase isoform (s) responsible for the glycosylation. With 20 different GalNAc-Ts in humans, having an ability to predict and interpret O-glycosylation sites in terms of specific GalNAc-T isoforms is invaluable. To fill this gap, ISOGlyP (Isoform-Specific O-Glycosylation Prediction) has been developed. Using position-specific enhancement values generated based on GalNAc-T isoform-specific amino acid preferences, ISOGlyP predicts the propensity that a site would be glycosylated by a specific transferase. ISOGlyP gave an overall prediction accuracy of 70% against in vivo data, which is comparable to that of the NetOGlyc4.0 predictor. Additionally, ISOGlyP can identify the known effects of long- and short-range prior glycosylation and can generate potential peptide sequences selectively glycosylated by specific isoforms. ISOGlyP is freely available for use at ISOGlyP.utep.edu. The code is also available on GitHub (https://github.com/jonmohl/ISOGlyP).
Asunto(s)
Mucina-1/metabolismo , Glicosilación , Humanos , Mucina-1/química , Isoformas de ProteínasRESUMEN
Mucin-type O-glycosylation is one of the most common post-translational modifications of proteins. This glycosylation is initiated in the Golgi by the addition of the sugar N-acetylgalactosamine (GalNAc) onto protein Ser and Thr residues by a family of polypeptide GalNAc transferases. In humans there are 20 isoforms that are differentially expressed across tissues that serve multiple important biological roles. Using random peptide substrates, isoform specific amino acid preferences have been obtained in the form of enhancement values (EV). These EVs alone have previously been used to predict O-glycosylation sites via the web based ISOGlyP (Isoform Specific O-Glycosylation Prediction) tool. Here we explore additional protein features to determine whether these can complement the random peptide derived enhancement values and increase the predictive power of ISOGlyP. The inclusion of additional protein substrate features (such as secondary structure and surface accessibility) was found to increase sensitivity with minimal loss of specificity, when tested with three different published in vivo O-glycoproteomics data sets, thus increasing the overall accuracy of the ISOGlyP predictions.
RESUMEN
BACKGROUND: Ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Their secondary structures are crucial for the RNA functionality, and the prediction of the secondary structures is widely studied. Our previous research shows that cutting long sequences into shorter chunks, predicting secondary structures of the chunks independently using thermodynamic methods, and reconstructing the entire secondary structure from the predicted chunk structures can yield better accuracy than predicting the secondary structure using the RNA sequence as a whole. The chunking, prediction, and reconstruction processes can use different methods and parameters, some of which produce more accurate predictions than others. In this paper, we study the prediction accuracy and efficiency of three different chunking methods using seven popular secondary structure prediction programs that apply to two datasets of RNA with known secondary structures, which include both pseudoknotted and non-pseudoknotted sequences, as well as a family of viral genome RNAs whose structures have not been predicted before. Our modularized MapReduce framework based on Hadoop allows us to study the problem in a parallel and robust environment. RESULTS: On average, the maximum accuracy retention values are larger than one for our chunking methods and the seven prediction programs over 50 non-pseudoknotted sequences, meaning that the secondary structure predicted using chunking is more similar to the real structure than the secondary structure predicted by using the whole sequence. We observe similar results for the 23 pseudoknotted sequences, except for the NUPACK program using the centered chunking method. The performance analysis for 14 long RNA sequences from the Nodaviridae virus family outlines how the coarse-grained mapping of chunking and predictions in the MapReduce framework exhibits shorter turnaround times for short RNA sequences. However, as the lengths of the RNA sequences increase, the fine-grained mapping can surpass the coarse-grained mapping in performance. CONCLUSIONS: By using our MapReduce framework together with statistical analysis on the accuracy retention results, we observe how the inversion-based chunking methods can outperform predictions using the whole sequence. Our chunk-based approach also enables us to predict secondary structures for very long RNA sequences, which is not feasible with traditional methods alone.
Asunto(s)
Biología Computacional/métodos , Conformación de Ácido Nucleico , ARN/química , Algoritmos , Emparejamiento Base , Secuencia de Bases , Modelos Moleculares , Nodaviridae/genética , ARN Viral/química , Reproducibilidad de los Resultados , Inversión de SecuenciaRESUMEN
G protein-coupled receptors (GPCRs) are the largest class of cell-surface receptor proteins with important functions in signal transduction and often serve as therapeutic drug targets. With the rapidly growing public data on three dimensional (3D) structures of GPCRs and GPCR-ligand interactions, computational prediction of GPCR ligand binding becomes a convincing option to high throughput screening and other experimental approaches during the beginning phases of ligand discovery. In this work, we set out to computationally uncover and understand the binding of a single ligand to GPCRs from several different families. Three-dimensional structural comparisons of the GPCRs that bind to the same ligand revealed local 3D structural similarities and often these regions overlap with locations of binding pockets. These pockets were found to be similar (based on backbone geometry and side-chain orientation using APoc), and they correlate positively with electrostatic properties of the pockets. Moreover, the more similar the pockets, the more likely a ligand binding to the pockets will interact with similar residues, have similar conformations, and produce similar binding affinities across the pockets. These findings can be exploited to improve protein function inference, drug repurposing and drug toxicity prediction, and accelerate the development of new drugs.
Asunto(s)
Receptores Acoplados a Proteínas G , Sitios de Unión , Humanos , Ligandos , Modelos Moleculares , Conformación Molecular , Unión Proteica , Conformación Proteica , Receptores Acoplados a Proteínas G/metabolismoRESUMEN
HLA-A 0201-restricted virus-specific CD8(+) CTL do not appear to control HIV effectively in vivo. To enhance the immunogenicity of a highly conserved subdominant epitope, TV9 (TLNAWVKVV, p24 Gag(19-27)), mimotopes were designed by screening a large combinatorial nonapeptide library with TV9-specific CTL primed in vitro from healthy donors. A mimic peptide with a low binding affinity to HLA-A 0201, TV9p6 (KINAWIKVV), was studied further. Parallel cultures of in vitro-primed CTL showed that TV9p6 consistently activated cross-reactive and equally functional CTL as measured by cytotoxicity, cytokine production and suppression of HIV replication in vitro. Comparison of TCRB gene usage between CTL primed from the same donors with TV9 or TV9p6 revealed a degree of clonal overlap in some cases and an example of a conserved TCRB sequence encoded distinctly at the nucleotide level between individuals (a "public" TCR); however, in the main, distinct clonotypes were recruited by each peptide antigen. These findings indicate that mimotopes can mobilize functional cross-reactive clonotypes that are less readily recruited from the naïve T-cell pool by the corresponding WT epitope. Mimotope-induced repertoire diversification could potentially override subdominance under certain circumstances and enhance vaccine-induced responses to conserved but poorly immunogenic determinants within the HIV proteome.
Asunto(s)
Vacunas contra el SIDA , Linfocitos T CD8-positivos/metabolismo , ADN/análisis , VIH-1/inmunología , Receptores de Antígenos de Linfocitos T alfa-beta/genética , Linfocitos T CD8-positivos/inmunología , Linfocitos T CD8-positivos/patología , Proliferación Celular , Células Clonales , Secuencia Conservada/genética , Mapeo Epitopo , Proteína p24 del Núcleo del VIH/química , Proteína p24 del Núcleo del VIH/metabolismo , Antígenos HLA-A/metabolismo , Antígeno HLA-A2 , Humanos , Fragmentos de Péptidos/química , Fragmentos de Péptidos/metabolismo , Biblioteca de Péptidos , Unión ProteicaRESUMEN
Pseudoknots have been recognized to be an important type of RNA secondary structures responsible for many biological functions. PseudoBase, a widely used database of pseudoknot secondary structures developed at Leiden University, contains over 250 records of pseudoknots obtained in the past 25 years through crystallography, NMR, mutational experiments and sequence comparisons. To promptly address the growing analysis requests of the researchers on RNA structures and bring together information from multiple sources across the Internet to a single platform, we designed and implemented PseudoBase++, an extension of PseudoBase for easy searching, formatting and visualization of pseudoknots. PseudoBase++ (http://pseudobaseplusplus.utep.edu) maps the PseudoBase dataset into a searchable relational database including additional functionalities such as pseudoknot type. PseudoBase++ links each pseudoknot in PseudoBase to the GenBank record of the corresponding nucleotide sequence and allows scientists to automatically visualize RNA secondary structures with PseudoViewer. It also includes the capabilities of fine-grained reference searching and collecting new pseudoknot information.
Asunto(s)
Bases de Datos de Ácidos Nucleicos , ARN/química , Emparejamiento Base , Gráficos por Computador , Integración de SistemasRESUMEN
Colorectal cancer (CRC) is the third most common cancer that contributes to cancer-related morbidity. However, the differential expression of genes in different phases of CRC is largely unknown. Moreover, very little is known about the role of stress-survival pathways in CRC. We sought to discover the hub genes and identify their roles in several key pathways, including oxidative stress and apoptosis in the different stages of CRC. To identify the hub genes that may be involved in the different stages of CRC, gene expression datasets were obtained from the gene expression omnibus (GEO) database. The differentially expressed genes (DEGs) common among the different datasets for each group were obtained using the robust rank aggregation method. Then, gene enrichment analysis was carried out with Gene Ontology and Kyoto Encyclopedia of Genes and Genomes databases. Finally, the protein-protein interaction networks were constructed using the Cytoscape software. We identified 40 hub genes and performed enrichment analysis for each group. We also used the Oncomine database to identify the DEGs related to stress-survival and apoptosis pathways involved in different stages of CRC. In conclusion, the hub genes were found to be enriched in several key pathways, including the cell cycle and p53 signaling pathway. Some of the hub genes were also reported in the stress-survival and apoptosis pathways. The hub DEGs revealed from our study may be used as biomarkers and may explain CRC development and progression mechanisms.
Asunto(s)
Neoplasias Colorrectales , Biología Computacional , Biomarcadores de Tumor , Neoplasias Colorrectales/genética , Perfilación de la Expresión Génica , Regulación Neoplásica de la Expresión Génica , Ontología de Genes , HumanosRESUMEN
G protein-coupled receptors (GPCRs) constitute the largest group of membrane receptor proteins in eukaryotes. Due to their significant roles in various physiological processes such as vision, smell and inflammation, GPCRs are the targets of many prescription drugs. However, the functional and sequence diversity of GPCRs has kept their prediction and classification based on amino acid sequence data as a challenging bioinformatics problem. There are existing computational approaches, mainly using machine learning and statistical methods, to predict and classify GPCRs based on amino acid sequence and sequence derived features. In this paper, we describe a searchable MySQL database, named GPCR-PEnDB (GPCR Prediction Ensemble Database), of confirmed GPCRs and non-GPCRs. It was constructed with the goal of allowing users to conveniently access useful information of GPCRs in a wide range of organisms and to compile reliable training and testing datasets for different combinations of computational tools. This database currently contains 3129 confirmed GPCR and 3575 non-GPCR sequences collected from the UniProtKB/Swiss-Prot protein database, encompassing over 1200 species. The non-GPCR entries include transmembrane proteins for evaluating various prediction programs' abilities to distinguish GPCRs from other transmembrane proteins. Each protein is linked to information about its source organism, classification, sequence lengths and composition, and other derived sequence features. We present examples of using this database along with its graphical user interface, to query for GPCRs with specific sequence properties and to compare the accuracies of five tools for GPCR prediction. This initial version of GPCR-PEnDB will provide a framework for future extensions to include additional sequence and feature data to facilitate the design and assessment of software tools and experimental studies to help understand the functional roles of GPCRs. Database URL: gpcr.utep.edu/database.
Asunto(s)
Algoritmos , Análisis de Secuencia de Proteína , Secuencia de Aminoácidos , Bases de Datos de Proteínas , Receptores Acoplados a Proteínas G/genéticaRESUMEN
As ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation, their secondary structures have been the focus of many recent studies. Despite the computing power of supercomputers, computationally predicting secondary structures with thermodynamic methods is still not feasible when the RNA molecules have long nucleotide sequences and include complex motifs such as pseudoknots. This paper presents RNAVLab (RNA Virtual Laboratory), a virtual laboratory for studying RNA secondary structures including pseudoknots that allows scientists to address this challenge. Two important case studies show the versatility and functionalities of RNAVLab. The first study quantifies its capability to rebuild longer secondary structures from motifs found in systematically sampled nucleotide segments. The extensive sampling and predictions are made feasible in a short turnaround time because of the grid technology used. The second study shows how RNAVLab allows scientists to study the viral RNA genome replication mechanisms used by members of the virus family Nodaviridae.
RESUMEN
BACKGROUND: Replication origins are considered important sites for understanding the molecular mechanisms involved in DNA replication. Many computational methods have been developed for predicting their locations in archaeal, bacterial and eukaryotic genomes. However, a prediction method designed for a particular kind of genomes might not work well for another. In this paper, we propose the AT excursion method, which is a score-based approach, to quantify local AT abundance in genomic sequences and use the identified high scoring segments for predicting replication origins. This method has the advantages of requiring no preset window size and having rigorous criteria to evaluate statistical significance of high scoring segments. RESULTS: We have evaluated the AT excursion method by checking its predictions against known replication origins in herpesviruses and comparing its performance with an existing base weighted score method (BWS1). Out of 43 known origins, 39 are predicted by either one or the other method and 26 origins are predicted by both. The excursion method identifies six origins not predicted by BWS1, showing that the AT excursion method is a valuable complement to BWS1. We have also applied the AT excursion method to two other families of double stranded DNA viruses, the poxviruses and iridoviruses, of which very few replication origins are documented in the public domain. The prediction results are made available as supplementary materials at 1. Preliminary investigation shows that the proposed method works well on some larger genomes too. CONCLUSION: The AT excursion method will be a useful computational tool for identifying replication origins in a variety of genomic sequences.
Asunto(s)
Secuencia Rica en At/genética , Algoritmos , Mapeo Cromosómico/métodos , Genoma Viral/genética , Herpesviridae/genética , Origen de Réplica/genética , Análisis de Secuencia de ADN/métodos , Secuencia de Bases , Datos de Secuencia MolecularRESUMEN
Many empirical studies show that there are unusual clusters of palindromes, closely spaced direct and inverted repeats around the replication origins of herpesviruses. In this paper, we introduce two new scoring schemes to quantify the spatial abundance of palindromes in a genomic sequence. Based on these scoring schemes, a computational method to predict the locations of replication origins is developed. When our predictions are compared with 39 known or annotated replication origins in 19 herpesviruses, close to 80% of the replication origins are located within 2% of the genome length. A list of predicted locations of replication origins in all the known herpesviruses with complete genome sequences is reported.
Asunto(s)
Biología Computacional/métodos , Genoma Viral , Genómica/métodos , Herpesviridae/genética , Origen de Réplica , Algoritmos , ADN Viral/química , Interpretación Estadística de Datos , Reproducibilidad de los ResultadosRESUMEN
The cattle tick of Australia, Rhipicephalus australis, is a vector for microbial parasites that cause serious bovine diseases. The Haller's organ, located in the tick's forelegs, is crucial for host detection and mating. To facilitate the development of new technologies for better control of this agricultural pest, we aimed to sequence and annotate the transcriptome of the R. australis forelegs and associated tissues, including the Haller's organ. As G protein-coupled receptors (GPCRs) are an important family of eukaryotic proteins studied as pharmaceutical targets in humans, we prioritized the identification and classification of the GPCRs expressed in the foreleg tissues. The two forelegs from adult R. australis were excised, RNA extracted, and pyrosequenced with 454 technology. Reads were assembled into unigenes and annotated by sequence similarity. Python scripts were written to find open reading frames (ORFs) from each unigene. These ORFs were analyzed by different GPCR prediction approaches based on sequence alignments, support vector machines, hidden Markov models, and principal component analysis. GPCRs consistently predicted by multiple methods were further studied by phylogenetic analysis and 3D homology modeling. From 4,782 assembled unigenes, 40,907 possible ORFs were predicted. Using Blastp, Pfam, GPCRpred, TMHMM, and PCA-GPCR, a basic set of 46 GPCR candidates were compiled and a phylogenetic tree was constructed. With further screening of tertiary structures predicted by RaptorX, 6 likely GPCRs emerged and the strongest candidate was classified by PCA-GPCR to be a GABAB receptor.
Asunto(s)
Bovinos/parasitología , Receptores Acoplados a Proteínas G/genética , Rhipicephalus/genética , Transcriptoma , Animales , Femenino , Genómica/métodos , Masculino , Sistemas de Lectura Abierta , Filogenia , ARN/genética , ARN/aislamiento & purificaciónRESUMEN
A total of 750 images of individual ultra-high molecular weight polyethylene (UHMWPE) particles isolated from periprosthetic failed hip, knee, and shoulder arthroplasties were extracted from archival scanning electron micrographs. Particle size and morphology was subsequently analyzed using computerized image analysis software utilizing five descriptors found in ASTM F1877-98, a standard for quantitative description of wear debris. An online survey application was developed to display particle images, and allowed ten respondents to classify particle morphologies according to commonly used terminology as fibers, flakes, or granules. Particles were categorized based on a simple majority of responses. All descriptors were evaluated using a one-way ANOVA and Tukey-Kramer test for all-pairs comparison among each class of particles. A logistic regression model using half of the particles included in the survey was then used to develop a mathematical scheme to predict whether a given particle should be classified as a fiber, flake, or granule based on its quantitative measurements. The validity of the model was then assessed using the other half of the survey particles and compared with human responses. Comparison of the quantitative measurements of isolated particles showed that the morphologies of each particle type classified by respondents were statistically different from one another (p<0.05). The average agreement between mathematical prediction and human respondents was 83.5% (standard error 0.16%). These data suggest that computerized descriptors can be feasibly correlated with subjective terminology, thus providing a basis for a common vocabulary for particle description which can be translated into quantitative dimensions.
Asunto(s)
Prótesis Articulares , Polietilenos/química , Tamaño de la Partícula , Polietilenos/clasificaciónRESUMEN
The cattle tick, Rhipicephalus (Boophilus) microplus, is a pest which causes multiple health complications in cattle. The G protein-coupled receptor (GPCR) super-family presents a candidate target for developing novel tick control methods. However, GPCRs share limited sequence similarity among orthologous family members, and there is no reference genome available for R. microplus. This limits the effectiveness of alignment-dependent methods such as BLAST and Pfam for identifying GPCRs from R. microplus. However, GPCRs share a common structure consisting of seven transmembrane helices. We present an analysis of the R. microplus synganglion transcriptome using a combination of structurally-based and alignment-free methods which supplement the identification of GPCRs by sequence similarity. TMHMM predicts the number of transmembrane helices in a protein sequence. GPCRpred is a support vector machine-based method developed to predict and classify GPCRs using the dipeptide composition of a query amino acid sequence. These two bioinformatic tools were applied to our transcriptome assembly of the cattle tick synganglion. Together, BLAST and Pfam identified 85 unique contigs as encoding partial or full length candidate cattle tick GPCRs. Collectively, TMHMM and GPCRpred identified 27 additional GPCR candidates that BLAST and Pfam missed. This demonstrates that the addition of structurally-based and alignment-free bioinformatic approaches to transcriptome annotation and analysis produces a greater collection of prospective GPCRs than an analysis based solely upon methodologies dependent upon sequence alignment and similarity.
Asunto(s)
Proteínas de Artrópodos/genética , Biología Computacional/métodos , Ganglión/genética , Receptores Acoplados a Proteínas G/genética , Rhipicephalus/genética , Transcriptoma , Animales , Proteínas de Artrópodos/química , Estudios Prospectivos , Conformación Proteica , Receptores Acoplados a Proteínas G/química , Análisis de Secuencia de ADN , Homología de SecuenciaRESUMEN
Palindromes are symmetrical words of DNA in the sense that they read exactly the same as their reverse complementary sequences. Representing the occurrences of palindromes in a DNA molecule as points on the unit interval, the scan statistics can be used to identify regions of unusually high concentration of palindromes. These regions have been associated with the replication origins on a few herpesviruses in previous studies. However, the use of scan statistics requires the assumption that the points representing the palindromes are independently and uniformly distributed on the unit interval. In this paper, we provide a mathematical basis for this assumption by showing that in randomly generated DNA sequences, the occurrences of palindromes can be approximated by a Poisson process. An easily computable upper bound on the Wasserstein distance between the palindrome process and the Poisson process is obtained. This bound is then used as a guide to choose an optimal palindrome length in the analysis of a collection of 16 herpesvirus genomes. Regions harboring significant palindrome clusters are identified and compared to known locations of replication origins. This analysis brings out a few interesting extensions of the scan statistics that can help formulate an algorithm for more accurate prediction of replication origins.
Asunto(s)
Genoma Viral , Herpesviridae/genética , Secuencias Repetitivas de Ácidos Nucleicos , Interpretación Estadística de Datos , Humanos , Análisis de Secuencia de ADNRESUMEN
Secondary structures of ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Experimental observations and computing limitations suggest that we can approach the secondary structure prediction problem for long RNA sequences by segmenting them into shorter chunks, predicting the secondary structures of each chunk individually using existing prediction programs, and then assembling the results to give the structure of the original sequence. The selection of cutting points is a crucial component of the segmenting step. Noting that stem-loops and pseudoknots always contain an inversion, i.e., a stretch of nucleotides followed closely by its inverse complementary sequence, we developed two cutting methods for segmenting long RNA sequences based on inversion excursions: the centered and optimized method. Each step of searching for inversions, chunking, and predictions can be performed in parallel. In this paper we use a MapReduce framework, i.e., Hadoop, to extensively explore meaningful inversion stem lengths and gap sizes for the segmentation and identify correlations between chunking methods and prediction accuracy. We show that for a set of long RNA sequences in the RFAM database, whose secondary structures are known to contain pseudoknots, our approach predicts secondary structures more accurately than methods that do not segment the sequence, when the latter predictions are possible computationally. We also show that, as sequences exceed certain lengths, some programs cannot computationally predict pseudoknots while our chunking methods can. Overall, our predicted structures still retain the accuracy level of the original prediction programs when compared with known experimental secondary structure.
RESUMEN
BACKGROUND: Logic minimization is the application of algebraic axioms to a binary dataset with the purpose of reducing the number of digital variables and/or rules needed to express it. Although logic minimization techniques have been applied to bioinformatics datasets before, they have not been used in classification and rule discovery problems. In this paper, we propose a method based on logic minimization to extract predictive rules for two bioinformatics problems involving the identification of functional sites in molecular sequences: transcription factor binding sites (TFBS) in DNA and O-glycosylation sites in proteins. TFBS are important in various developmental processes and glycosylation is a posttranslational modification critical to protein functions. METHODS: In the present study, we first transformed the original biological dataset into a suitable binary form. Logic minimization was then applied to generate sets of simple rules to describe the transformed dataset. These rules were used to predict TFBS and O-glycosylation sites. The TFBS dataset is obtained from the TRANSFAC database, while the glycosylation dataset was compiled using information from OGLYCBASE and the Swiss-Prot Database.We performed the same predictions using two standard classification techniques, Artificial Neural Networks (ANN) and Support Vector Machines (SVM), and used their sensitivities and positive predictive values as benchmarks for the performance of our proposed algorithm. SVM were also used to reduce the number of variables included in the logic minimization approach. RESULTS: For both TFBS and O-glycosylation sites, the prediction performance of the proposed logic minimization method was generally comparable and, in some cases, superior to the standard ANN and SVM classification methods with the advantage of providing intelligible rules to describe the datasets. In TFBS prediction, logic minimization produced a very small set of simple rules. In glycosylation site prediction, the rules produced were also interpretable and the most popular rules generated appeared to correlate well with recently reported hydrophilic/hydrophobic enhancement values of amino acids around possible O-glycosylation sites. Experiments with Self-Organizing Neural Networks corroborate the practical worth of the logic minimization method for these case studies. CONCLUSIONS: The proposed logic minimization algorithm provides sets of rules that can be used to predict TFBS and O-glycosylation sites with sensitivity and positive predictive value comparable to those from ANN and SVM. Moreover, the logic minimization method has the additional capability of generating interpretable rules that allow biological scientists to correlate the predictions with other experimental results and to form new hypotheses for further investigation. Additional experiments with alternative rule-extraction techniques demonstrate that the logic minimization method is able to produce accurate rules from datasets with large numbers of variables and limited numbers of positive examples.
RESUMEN
In this paper, we present a dynamic programming algorithm that runs in polynomial time and allows us to achieve the optimal, non-overlapping segmentation of a long RNA sequence into segments (chunks). The secondary structure of each chunk is predicted independently, then combined with the structures predicted for the other chunks, to generate a complete secondary structure prediction that is thus a combination of local energy minima. The proposed approach not only is more efficient and accurate than other traditionally used methods that are based on global energy minimizations, but it also allows scientists to overcome computing and storage constraints when trying to predict the secondary structure of long RNA sequences.
RESUMEN
Nodamura virus (NoV; family Nodaviridae) contains a bipartite positive-strand RNA genome that replicates via negative-strand intermediates. The specific structural and sequence determinants for initiation of nodavirus RNA replication have not yet been identified. For the related nodavirus Flock House virus (FHV) undefined sequences within the 3'-terminal 50 nucleotides (nt) of FHV RNA2 are essential for its replication. We previously showed that a conserved stem-loop structure (3'SL) is predicted to form near the 3' end of the RNA2 segments of seven nodaviruses, including NoV. We hypothesized that the 3'SL structure from NoV RNA2 is an essential cis-acting element for RNA replication. To determine whether the structure can actually form within RNA2, we analyzed the secondary structure of NoV RNA2 in vitro transcripts using nuclease mapping. The resulting nuclease maps were 86% consistent with the predicted 3'SL structure, suggesting that it can form in solution. We used a well-defined reverse genetic system for launch of NoV replication in yeast cells to test the function of the 3'SL in the viral life cycle. Deletion of the nucleotides that comprise the 3'SL from a NoV2-GFP chimeric replicon resulted in a severe defect in RNA2 replication. A minimal replicon containing the 5'-terminal 17 nt and the 3'-terminal 54 nt of RNA2 (including the predicted 3'SL) retained the ability to replicate in yeast, suggesting that this region is able to direct replication of a heterologous mRNA. These data suggest that the 3'SL plays an essential role in replication of NoV RNA2. The conservation of the predicted 3'SL suggests that this common motif may play a role in RNA replication for the other members of the Nodaviridae.