RESUMEN
MOTIVATION: Machine-learning-based prediction of compound-protein interactions (CPIs) is important for drug design, screening and repurposing. Despite numerous recent publication with increasing methodological sophistication claiming consistent improvements in predictive accuracy, we have observed a number of fundamental issues in experiment design that produce overoptimistic estimates of model performance. RESULTS: We systematically analyze the impact of several factors affecting generalization performance of CPI predictors that are overlooked in existing work: (i) similarity between training and test examples in cross-validation; (ii) synthesizing negative examples in absence of experimentally verified negative examples and (iii) alignment of evaluation protocol and performance metrics with real-world use of CPI predictors in screening large compound libraries. Using both state-of-the-art approaches by other researchers as well as a simple kernel-based baseline, we have found that effective assessment of generalization performance of CPI predictors requires careful control over similarity between training and test examples. We show that, under stringent performance assessment protocols, a simple kernel-based approach can exceed the predictive performance of existing state-of-the-art methods. We also show that random pairing for generating synthetic negative examples for training and performance evaluation results in models with better generalization in comparison to more sophisticated strategies used in existing studies. Our analyses indicate that using proposed experiment design strategies can offer significant improvements for CPI prediction leading to effective target compound screening for drug repurposing and discovery of putative chemical ligands of SARS-CoV-2-Spike and Human-ACE2 proteins. AVAILABILITY AND IMPLEMENTATION: Code and supplementary material available at https://github.com/adibayaseen/HKRCPI. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Enzima Convertidora de Angiotensina 2 , Aprendizaje Automático , Humanos , Ligandos , SARS-CoV-2RESUMEN
Deep learning has demonstrated its predictive power in modeling complex biological phenomena such as gene expression. The value of these models hinges not only on their accuracy, but also on the ability to extract biologically relevant information from the trained models. While there has been much recent work on developing feature attribution methods that discover the most important features for a given sequence, inferring cooperativity between regulatory elements, which is the hallmark of phenomena such as gene expression, remains an open problem. We present SATORI, a Self-ATtentiOn based model to detect Regulatory element Interactions. Our approach combines convolutional layers with a self-attention mechanism that helps us capture a global view of the landscape of interactions between regulatory elements in a sequence. A comprehensive evaluation demonstrates the ability of SATORI to identify numerous statistically significant TF-TF interactions, many of which have been previously reported. Our method is able to detect higher numbers of experimentally verified TF-TF interactions than existing methods, and has the advantage of not requiring a computationally expensive post-processing step. Finally, SATORI can be used for detection of any type of feature interaction in models that use a similar attention mechanism, and is not limited to the detection of TF-TF interactions.
Asunto(s)
Aprendizaje Profundo , Genómica/métodos , Elementos Reguladores de la Transcripción , Factores de Transcripción/metabolismo , Arabidopsis/genética , Línea Celular , Secuenciación de Inmunoprecipitación de Cromatina , Humanos , Motivos de Nucleótidos , Regiones Promotoras GenéticasRESUMEN
BACKGROUND: Despite recent progress in basecalling of Oxford nanopore DNA sequencing data, its wide adoption is still being hampered by its relatively low accuracy compared to short read technologies. Furthermore, very little of the recent research was focused on basecalling of RNA data, which has different characteristics than its DNA counterpart. RESULTS: We fill this gap by benchmarking a fully convolutional deep learning basecalling architecture with improved performance compared to Oxford nanopore's RNA basecallers. AVAILABILITY: The source code for our basecaller is available at: https://github.com/biodlab/RODAN .
Asunto(s)
Secuenciación de Nanoporos , Nanoporos , ADN , Secuenciación de Nucleótidos de Alto Rendimiento , ARN , Análisis de Secuencia de ADN , Análisis de Secuencia de ARNRESUMEN
MOTIVATION: Deep learning architectures have recently demonstrated their power in predicting DNA- and RNA-binding specificity. Existing methods fall into three classes: Some are based on convolutional neural networks (CNNs), others use recurrent neural networks (RNNs) and others rely on hybrid architectures combining CNNs and RNNs. However, based on existing studies the relative merit of the various architectures remains unclear. RESULTS: In this study we present a systematic exploration of deep learning architectures for predicting DNA- and RNA-binding specificity. For this purpose, we present deepRAM, an end-to-end deep learning tool that provides an implementation of a wide selection of architectures; its fully automatic model selection procedure allows us to perform a fair and unbiased comparison of deep learning architectures. We find that deeper more complex architectures provide a clear advantage with sufficient training data, and that hybrid CNN/RNN architectures outperform other methods in terms of accuracy. Our work provides guidelines that can assist the practitioner in choosing an appropriate network architecture, and provides insight on the difference between the models learned by convolutional and recurrent networks. In particular, we find that although recurrent networks improve model accuracy, this comes at the expense of a loss in the interpretability of the features learned by the model. AVAILABILITY AND IMPLEMENTATION: The source code for deepRAM is available at https://github.com/MedChaabane/deepRAM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Aprendizaje Profundo , Redes Neurales de la Computación , Secuencia de Bases , ADN , ARN , Sensibilidad y EspecificidadRESUMEN
Next-generation sequencing (NGS) technologies - Illumina RNA-seq, Pacific Biosciences isoform sequencing (PacBio Iso-seq), and Oxford Nanopore direct RNA sequencing (DRS) - have revealed the complexity of plant transcriptomes and their regulation at the co-/post-transcriptional level. Global analysis of mature mRNAs, transcripts from nuclear run-on assays, and nascent chromatin-bound mRNAs using short as well as full-length and single-molecule DRS reads have uncovered potential roles of different forms of RNA polymerase II during the transcription process, and the extent of co-transcriptional pre-mRNA splicing and polyadenylation. These tools have also allowed mapping of transcriptome-wide start sites in cap-containing RNAs, poly(A) site choice, poly(A) tail length, and RNA base modifications. The emerging theme from recent studies is that reprogramming of gene expression in response to developmental cues and stresses at the co-/post-transcriptional level likely plays a crucial role in eliciting appropriate responses for optimal growth and plant survival under adverse conditions. Although the mechanisms by which developmental cues and different stresses regulate co-/post-transcriptional splicing are largely unknown, a few recent studies indicate that the external cues target spliceosomal and splicing regulatory proteins to modulate alternative splicing. In this review, we provide an overview of recent discoveries on the dynamics and complexities of plant transcriptomes, mechanistic insights into splicing regulation, and discuss critical gaps in co-/post-transcriptional research that need to be addressed using diverse genomic and biochemical approaches.
Asunto(s)
Proteínas de Plantas/metabolismo , Transcriptoma , Empalme Alternativo , Arabidopsis/genética , Secuencia de Bases , Cromatina/química , Cromatina/metabolismo , Perfilación de la Expresión Génica , Genes de Plantas , Proteínas Fluorescentes Verdes/metabolismo , Secuenciación de Nucleótidos de Alto Rendimiento , Isoformas de Proteínas , Procesamiento Postranscripcional del ARN , Empalme del ARN , RNA-Seq , Análisis de Secuencia de ARNRESUMEN
Drought is a major limiting factor of crop yields. In response to drought, plants reprogram their gene expression, which ultimately regulates a multitude of biochemical and physiological processes. The timing of this reprogramming and the nature of the drought-regulated genes in different genotypes are thought to confer differential tolerance to drought stress. Sorghum is a highly drought-tolerant crop and has been increasingly used as a model cereal to identify genes that confer tolerance. Also, there is considerable natural variation in resistance to drought in different sorghum genotypes. Here, we evaluated drought resistance in four genotypes to polyethylene glycol (PEG)-induced drought stress at the seedling stage and performed transcriptome analysis in seedlings of sorghum genotypes that are either drought-resistant or drought-sensitive to identify drought-regulated changes in gene expression that are unique to drought-resistant genotypes of sorghum. Our analysis revealed that about 180 genes are differentially regulated in response to drought stress only in drought-resistant genotypes and most of these (over 70%) are up-regulated in response to drought. Among these, about 70 genes are novel with no known function and the remaining are transcription factors, signaling and stress-related proteins implicated in drought tolerance in other crops. This study revealed a set of drought-regulated genes, including many genes encoding uncharacterized proteins that are associated with drought tolerance at the seedling stage.
Asunto(s)
Perfilación de la Expresión Génica , Regulación de la Expresión Génica de las Plantas/efectos de los fármacos , Genotipo , Polietilenglicoles/farmacología , Sorghum/metabolismo , Transcripción Genética/efectos de los fármacos , Transcriptoma/efectos de los fármacos , Deshidratación/genética , Deshidratación/metabolismo , Sorghum/genéticaRESUMEN
Uniqprimer, a software pipeline developed in Python, was deployed as a user-friendly internet tool in Rice Galaxy for comparative genome analyses to design primer sets for PCRassays capable of detecting target bacterial taxa. The pipeline was trialed with Dickeya dianthicola, a destructive broad-host-range bacterial pathogen found in most potato-growing regions. Dickeya is a highly variable genus, and some primers available to detect this genus and species exhibit common diagnostic failures. Upon uploading a selection of target and nontarget genomes, six primer sets were rapidly identified with Uniqprimer, of which two were specific and sensitive when tested with D. dianthicola. The remaining four amplified a minority of the nontarget strains tested. The two promising candidate primer sets were trialed with DNA isolated from 116 field samples from across the United States that were previously submitted for testing. D. dianthicola was detected in 41 samples, demonstrating the applicability of our detection primers and suggesting widespread occurrence of D. dianthicola in North America.
Asunto(s)
Agricultura , Técnicas Bacteriológicas , Cartilla de ADN , Enterobacteriaceae , Solanum tuberosum , Agricultura/métodos , Técnicas Bacteriológicas/métodos , Cartilla de ADN/genética , Enterobacteriaceae/genética , América del Norte , Enfermedades de las Plantas/microbiología , Solanum tuberosum/microbiologíaRESUMEN
BACKGROUND: Determining protein-protein interactions and their binding affinity are important in understanding cellular biological processes, discovery and design of novel therapeutics, protein engineering, and mutagenesis studies. Due to the time and effort required in wet lab experiments, computational prediction of binding affinity from sequence or structure is an important area of research. Structure-based methods, though more accurate than sequence-based techniques, are limited in their applicability due to limited availability of protein structure data. RESULTS: In this study, we propose a novel machine learning method for predicting binding affinity that uses protein 3D structure as privileged information at training time while expecting only protein sequence information during testing. Using the method, which is based on the framework of learning using privileged information (LUPI), we have achieved improved performance over corresponding sequence-based binding affinity prediction methods that do not have access to privileged information during training. Our experiments show that with the proposed framework which uses structure only during training, it is possible to achieve classification performance comparable to that which is obtained using structure-based features. Evaluation on an independent test set shows improved performance over the PPA-Pred2 method as well. CONCLUSIONS: The proposed method outperforms several baseline learners and a state-of-the-art binding affinity predictor not only in cross-validation, but also on an additional validation dataset, demonstrating the utility of the LUPI framework for problems that would benefit from classification using structure-based features. The implementation of LUPI developed for this work is expected to be useful in other areas of bioinformatics as well.
Asunto(s)
Algoritmos , Biología Computacional/métodos , Aprendizaje Automático , Proteínas/metabolismo , Secuencia de Aminoácidos , Ligandos , Unión Proteica , Proteínas/química , Curva ROC , Reproducibilidad de los Resultados , Máquina de Vectores de SoporteRESUMEN
BACKGROUND: Intron retention (IR) is the most prevalent form of alternative splicing in plants. IR, like other forms of alternative splicing, has an important role in increasing gene product diversity and regulating transcript functionality. Splicing is known to occur co-transcriptionally and is influenced by the speed of transcription which in turn, is affected by chromatin structure. It follows that chromatin structure may have an important role in the regulation of splicing, and there is preliminary evidence in metazoans to suggest that this is indeed the case; however, nothing is known about the role of chromatin structure in regulating IR in plants. DNase I-seq is a useful experimental tool for genome-wide interrogation of chromatin accessibility, providing information on regions of chromatin with very high likelihood of cleavage by the enzyme DNase I, known as DNase I Hypersensitive Sites (DHSs). While it is well-established that promoter regions are highly accessible and are over-represented with DHSs, not much is known about DHSs in the bodies of genes, and their relationship to splicing in general, and IR in particular. RESULTS: In this study we use publicly available DNase I-seq data in arabidopsis and rice to investigate the relationship between IR and chromatin structure. We find that IR events are highly enriched in DHSs in both species. This implies that chromatin is more open in retained introns, which is consistent with a kinetic model of the process whereby higher speeds of transcription in those regions give less time for the spliceosomal machinery to recognize and splice out those introns co-transcriptionally. The more open chromatin in IR can also be the result of regulation mediated by DNA-binding proteins. To test this, we performed an exhaustive search for footprints left by DNA-binding proteins that are associated with IR. We identified several hundred short sequence elements that exhibit footprints in their DNase I-seq coverage, the telltale sign for binding events of a regulatory protein, protecting its binding site from cleavage by DNase I. A highly significant fraction of those sequence elements are conserved between arabidopsis and rice, a strong indication of their functional importance. CONCLUSIONS: In this study we have established an association between IR and chromatin accessibility, and presented a mechanistic hypothesis that explains the observed association from the perspective of the co-transcriptional nature of splicing. Furthermore, we identified conserved sequence elements for DNA-binding proteins that affect splicing.
Asunto(s)
Arabidopsis/genética , Cromatina/química , Intrones , Oryza/genética , Empalme Alternativo , Cromatina/metabolismo , Proteínas de Unión al ADN/metabolismo , Desoxirribonucleasa I , Huella de ProteínaRESUMEN
Plant SR45 and its metazoan ortholog RNPS1 are serine/arginine-rich (SR)-like RNA binding proteins that function in splicing/postsplicing events and regulate diverse processes in eukaryotes. Interactions of SR45 with both RNAs and proteins are crucial for regulating RNA processing. However, in vivo RNA targets of SR45 are currently unclear. Using RNA immunoprecipitation followed by high-throughput sequencing, we identified over 4000 Arabidopsis thaliana RNAs that directly or indirectly associate with SR45, designated as SR45-associated RNAs (SARs). Comprehensive analyses of these SARs revealed several roles for SR45. First, SR45 associates with and regulates the expression of 30% of abscisic acid (ABA) signaling genes at the postsplicing level. Second, although most SARs are derived from intron-containing genes, surprisingly, 340 SARs are derived from intronless genes. Expression analysis of the SARs suggests that SR45 differentially regulates intronless and intron-containing SARs. Finally, we identified four overrepresented RNA motifs in SARs that likely mediate SR45's recognition of its targets. Therefore, SR45 plays an unexpected role in mRNA processing of intronless genes, and numerous ABA signaling genes are targeted for regulation at the posttranscriptional level. The diverse molecular functions of SR45 uncovered in this study are likely applicable to other species in view of its conservation across eukaryotes.
Asunto(s)
Ácido Abscísico/metabolismo , Proteínas de Arabidopsis/metabolismo , Arabidopsis/genética , Reguladores del Crecimiento de las Plantas/metabolismo , Proteínas de Unión al ARN/metabolismo , Transducción de Señal , Transcriptoma , Arabidopsis/metabolismo , Proteínas de Arabidopsis/genética , Arginina/metabolismo , Intrones/genética , Motivos de Nucleótidos , Empalme del ARN , ARN Mensajero/genética , ARN Mensajero/metabolismo , ARN de Planta/genética , ARN de Planta/metabolismo , Proteínas de Unión al ARN/genética , Análisis de Secuencia de ARN , Serina/metabolismoRESUMEN
Many prion-forming proteins contain glutamine/asparagine (Q/N) rich domains, and there are conflicting opinions as to the role of primary sequence in their conversion to the prion form: is this phenomenon driven primarily by amino acid composition, or, as a recent computational analysis suggested, dependent on the presence of short sequence elements with high amyloid-forming potential. The argument for the importance of short sequence elements hinged on the relatively-high accuracy obtained using a method that utilizes a collection of length-six sequence elements with known amyloid-forming potential. We weigh in on this question and demonstrate that when those sequence elements are permuted, even higher accuracy is obtained; we also propose a novel multiple-instance machine learning method that uses sequence composition alone, and achieves better accuracy than all existing prion prediction approaches. While we expect there to be elements of primary sequence that affect the process, our experiments suggest that sequence composition alone is sufficient for predicting protein sequences that are likely to form prions. A web-server for the proposed method is available at http://faculty.pieas.edu.pk/fayyaz/prank.html, and the code for reproducing our experiments is available at http://doi.org/10.5281/zenodo.167136.
Asunto(s)
Secuencia de Aminoácidos , Asparagina/química , Biología Computacional/métodos , Glutamina/química , Aprendizaje Automático , Priones/química , Amiloide/química , Humanos , Priones/metabolismo , LevadurasRESUMEN
Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.
Asunto(s)
Biología Computacional/métodos , Biología Molecular/métodos , Anotación de Secuencia Molecular , Proteínas/fisiología , Algoritmos , Animales , Bases de Datos de Proteínas , Exorribonucleasas/clasificación , Exorribonucleasas/genética , Exorribonucleasas/fisiología , Predicción , Humanos , Proteínas/química , Proteínas/clasificación , Proteínas/genética , Especificidad de la EspecieRESUMEN
Chimeric RNAs that comprise two or more different transcripts have been identified in many cancers and among the Expressed Sequence Tags (ESTs) isolated from different organisms; they might represent functional proteins and produce different disease phenotypes. The ChiTaRS database of Chimeric Transcripts and RNA-Sequencing data (http://chitars.bioinfo.cnio.es/) collects more than 16 000 chimeric RNAs from humans, mice and fruit flies, 233 chimeras confirmed by RNA-seq reads and â¼2000 cancer breakpoints. The database indicates the expression and tissue specificity of these chimeras, as confirmed by RNA-seq data, and it includes mass spectrometry results for some human entries at their junctions. Moreover, the database has advanced features to analyze junction consistency and to rank chimeras based on the evidence of repeated junction sites. Finally, 'Junction Search' screens through the RNA-seq reads found at the chimeras' junction sites to identify putative junctions in novel sequences entered by users. Thus, ChiTaRS is an extensive catalog of human, mouse and fruit fly chimeras that will extend our understanding of the evolution of chimeric transcripts in eukaryotes and can be advantageous in the analysis of human cancer breakpoints.
Asunto(s)
Bases de Datos Genéticas , Proteínas Mutantes Quiméricas/genética , ARN/química , Animales , Puntos de Rotura del Cromosoma , Gráficos por Computador , Drosophila/genética , Fusión Génica , Humanos , Internet , Ratones , Proteínas Mutantes Quiméricas/metabolismo , Neoplasias/genética , ARN/metabolismo , Análisis de Secuencia de ARNRESUMEN
Prions are important disease agents and epigenetic regulatory elements. Prion formation involves the structural conversion of proteins from a soluble form into an insoluble amyloid form. In many cases, this structural conversion is driven by a glutamine/asparagine (Q/N)-rich prion-forming domain. However, our understanding of the sequence requirements for prion formation and propagation by Q/N-rich domains has been insufficient for accurate prion propensity prediction or prion domain design. By focusing exclusively on amino acid composition, we have developed a prion aggregation prediction algorithm (PAPA), specifically designed to predict prion propensity of Q/N-rich proteins. Here, we show not only that this algorithm is far more effective than traditional amyloid prediction algorithms at predicting prion propensity of Q/N-rich proteins, but remarkably, also that PAPA is capable of rationally designing protein domains that function as prions in vivo.
Asunto(s)
Priones/química , Algoritmos , Secuencia de Aminoácidos , Datos de Secuencia Molecular , Homología de Secuencia de Aminoácido , SolubilidadRESUMEN
We present a novel partner-specific protein-protein interaction site prediction method called PAIRpred. Unlike most existing machine learning binding site prediction methods, PAIRpred uses information from both proteins in a protein complex to predict pairs of interacting residues from the two proteins. PAIRpred captures sequence and structure information about residue pairs through pairwise kernels that are used for training a support vector machine classifier. As a result, PAIRpred presents a more detailed model of protein binding, and offers state of the art accuracy in predicting binding sites at the protein level as well as inter-protein residue contacts at the complex level. We demonstrate PAIRpred's performance on Docking Benchmark 4.0 and recent CAPRI targets. We present a detailed performance analysis outlining the contribution of different sequence and structure features, together with a comparison to a variety of existing interface prediction techniques. We have also studied the impact of binding-associated conformational change on prediction accuracy and found PAIRpred to be more robust to such structural changes than existing schemes. As an illustration of the potential applications of PAIRpred, we provide a case study in which PAIRpred is used to analyze the nature and specificity of the interface in the interaction of human ISG15 protein with NS1 protein from influenza A virus. Python code for PAIRpred is available at http://combi.cs.colostate.edu/supplements/pairpred/.
Asunto(s)
Sitios de Unión , Unión Proteica , Proteínas/química , Proteínas/metabolismo , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Biología Computacional , Humanos , Modelos Moleculares , Conformación Proteica , Máquina de Vectores de SoporteRESUMEN
Disruptions in spatiotemporal gene expression can result in atypical brain function. Specifically, autism spectrum disorder (ASD) is characterized by abnormalities in pre-mRNA splicing. Abnormal splicing patterns have been identified in the brains of individuals with ASD, and mutations in splicing factors have been found to contribute to neurodevelopmental delays associated with ASD. Here we review studies that shed light on the importance of splicing observed in ASD and that explored the intricate relationship between splicing factors and ASD, revealing how disruptions in pre-mRNA splicing may underlie ASD pathogenesis. We provide an overview of the research regarding all splicing factors associated with ASD and place a special emphasis on five specific splicing factors-HNRNPH2, NOVA2, WBP4, SRRM2, and RBFOX1-known to impact the splicing of ASD-related genes. In the discussion of the molecular mechanisms influenced by these splicing factors, we lay the groundwork for a deeper understanding of ASD's complex etiology. Finally, we discuss the potential benefit of unraveling the connection between splicing and ASD for the development of more precise diagnostic tools and targeted therapeutic interventions. This article is categorized under: RNA in Disease and Development > RNA in Disease RNA Evolution and Genomics > RNA and Ribonucleoprotein Evolution RNA Evolution and Genomics > Computational Analyses of RNA RNA-Based Catalysis > RNA Catalysis in Splicing and Translation.
Asunto(s)
Trastorno del Espectro Autista , Trastorno Autístico , Humanos , Trastorno del Espectro Autista/genética , Trastorno del Espectro Autista/metabolismo , Trastorno Autístico/genética , Precursores del ARN/genética , Precursores del ARN/metabolismo , Empalme del ARN/genética , Factores de Empalme de ARN/metabolismo , Antígeno Ventral Neuro-OncológicoRESUMEN
Microbial breakdown of organic matter is one of the most important processes on Earth, yet the controls of decomposition are poorly understood. Here we track 36 terrestrial human cadavers in three locations and show that a phylogenetically distinct, interdomain microbial network assembles during decomposition despite selection effects of location, climate and season. We generated a metagenome-assembled genome library from cadaver-associated soils and integrated it with metabolomics data to identify links between taxonomy and function. This universal network of microbial decomposers is characterized by cross-feeding to metabolize labile decomposition products. The key bacterial and fungal decomposers are rare across non-decomposition environments and appear unique to the breakdown of terrestrial decaying flesh, including humans, swine, mice and cattle, with insects as likely important vectors for dispersal. The observed lockstep of microbial interactions further underlies a robust microbial forensic tool with the potential to aid predictions of the time since death.
Asunto(s)
Consorcios Microbianos , Microbiología del Suelo , Ratones , Humanos , Animales , Porcinos , Bovinos , Cadáver , Metagenoma , BacteriasRESUMEN
Combining heterogeneous sources of data is essential for accurate prediction of protein function. The task is complicated by the fact that while sequence-based features can be readily compared across species, most other data are species-specific. In this paper, we present a multi-view extension to GOstruct, a structured-output framework for function annotation of proteins. The extended framework can learn from disparate data sources, with each data source provided to the framework in the form of a kernel. Our empirical results demonstrate that the multi-view framework is able to utilize all available information, yielding better performance than sequence-based models trained across species and models trained from collections of data within a given species. This version of GOstruct participated in the recent Critical Assessment of Functional Annotations (CAFA) challenge; since then we have significantly improved the natural language processing component of the method, which now provides performance that is on par with that provided by sequence information. The GOstruct framework is available for download at http://strut.sourceforge.net.
Asunto(s)
Anotación de Secuencia Molecular , Proteínas/fisiología , Algoritmos , Animales , Biología Computacional/métodos , Expresión Génica , Ratones , Mapeo de Interacción de Proteínas , Proteínas/genética , Proteínas/metabolismo , Programas Informáticos , Vocabulario ControladoRESUMEN
In Arabidopsis, pre-mRNAs of serine/arginine-rich (SR) proteins undergo extensive alternative splicing (AS). However, little is known about the cis-elements and trans-acting proteins involved in regulating AS. Using a splicing reporter (GFP-intron-GFP), consisting of the GFP coding sequence interrupted by an alternatively spliced intron of SCL33, we investigated whether cis-elements within this intron are sufficient for AS, and which SR proteins are necessary for regulated AS. Expression of the splicing reporter in protoplasts faithfully produced all splice variants from the intron, suggesting that cis-elements required for AS reside within the intron. To determine which SR proteins are responsible for AS, the splicing pattern of the GFP-intron-GFP reporter was investigated in protoplasts of three single and three double mutants of SR genes. These analyses revealed that SCL33 and a closely related paralog, SCL30a, are functionally redundant in generating specific splice variants from this intron. Furthermore, SCL33 protein bound to a conserved sequence in this intron, indicating auto-regulation of AS. Mutations in four GAAG repeats within the conserved region impaired generation of the same splice variants that are affected in the scl33 scl30a double mutant. In conclusion, we have identified the first intronic cis-element involved in AS of a plant SR gene, and elucidated a mechanism for auto-regulation of AS of this intron.
Asunto(s)
Empalme Alternativo , Proteínas de Arabidopsis/genética , Arabidopsis/genética , Precursores del ARN/genética , Arabidopsis/citología , Arabidopsis/metabolismo , Arginina , Secuencia de Bases , Secuencia Conservada , Análisis Mutacional de ADN , Genes Reporteros , Homeostasis , Intrones/genética , Datos de Secuencia Molecular , Mutación , Protoplastos , ARN de Planta/genética , Proteínas Recombinantes , Secuencias Reguladoras de Ácidos Nucleicos/genética , Alineación de Secuencia , SerinaRESUMEN
MOTIVATION: Calmodulin (CaM) is a ubiquitously conserved protein that acts as a calcium sensor, and interacts with a large number of proteins. Detection of CaM binding proteins and their interaction sites experimentally requires a significant effort, so accurate methods for their prediction are important. RESULTS: We present a novel algorithm (MI-1 SVM) for binding site prediction and evaluate its performance on a set of CaM-binding proteins extracted from the Calmodulin Target Database. Our approach directly models the problem of binding site prediction as a large-margin classification problem, and is able to take into account uncertainty in binding site location. We show that the proposed algorithm performs better than the standard SVM formulation, and illustrate its ability to recover known CaM binding motifs. A highly accurate cascaded classification approach using the proposed binding site prediction method to predict CaM binding proteins in Arabidopsis thaliana is also presented. AVAILABILITY: Matlab code for training MI-1 SVM and the cascaded classification approach is available on request. CONTACT: fayyazafsar@gmail.com or asa@cs.colostate.edu.