RESUMO
The DynaSig-ML ('Dynamical Signatures-Machine Learning') Python package allows the efficient, user-friendly exploration of 3D dynamics-function relationships in biomolecules, using datasets of experimental measures from large numbers of sequence variants. It does so by predicting 3D structural dynamics for every variant using the Elastic Network Contact Model (ENCoM), a sequence-sensitive coarse-grained normal mode analysis model. Dynamical Signatures represent the fluctuation at every position in the biomolecule and are used as features fed into machine learning models of the user's choice. Once trained, these models can be used to predict experimental outcomes for theoretical variants. The whole pipeline can be run with just a few lines of Python and modest computational resources. The compute-intensive steps are easily parallelized in the case of either large biomolecules or vast amounts of sequence variants. As an example application, we use the DynaSig-ML package to predict the maturation efficiency of human microRNA miR-125a variants from high-throughput enzymatic assays. AVAILABILITY AND IMPLEMENTATION: DynaSig-ML is open-source software available at https://github.com/gregorpatof/dynasigml_package.
Assuntos
Aprendizado de Máquina , Software , HumanosRESUMO
The Elastic Network Contact Model (ENCoM) is a coarse-grained normal mode analysis (NMA) model unique in its all-atom sensitivity to the sequence of the studied macromolecule and thus to the effect of mutations. We adapted ENCoM to simulate the dynamics of ribonucleic acid (RNA) molecules, benchmarked its performance against other popular NMA models and used it to study the 3D structural dynamics of human microRNA miR-125a, leveraging high-throughput experimental maturation efficiency data of over 26 000 sequence variants. We also introduce a novel way of using dynamical information from NMA to train multivariate linear regression models, with the purpose of highlighting the most salient contributions of dynamics to function. ENCoM has a similar performance profile on RNA than on proteins when compared to the Anisotropic Network Model (ANM), the most widely used coarse-grained NMA model; it has the advantage on predicting large-scale motions while ANM performs better on B-factors prediction. A stringent benchmark from the miR-125a maturation dataset, in which the training set contains no sequence information in common with the testing set, reveals that ENCoM is the only tested model able to capture signal beyond the sequence. This ability translates to better predictive power on a second benchmark in which sequence features are shared between the train and test sets. When training the linear regression model using all available data, the dynamical features identified as necessary for miR-125a maturation point to known patterns but also offer new insights into the biogenesis of microRNAs. Our novel approach combining NMA with multivariate linear regression is generalizable to any macromolecule for which relatively high-throughput mutational data is available.
Assuntos
MicroRNAs , Humanos , MicroRNAs/química , Movimento (Física) , Conformação Proteica , Proteínas/química , Modelos LinearesRESUMO
RNA-Puzzles is a collective endeavor dedicated to the advancement and improvement of RNA 3D structure prediction. With agreement from crystallographers, the RNA structures are predicted by various groups before the publication of the crystal structures. We now report the prediction of 3D structures for six RNA sequences: four nucleolytic ribozymes and two riboswitches. Systematic protocols for comparing models and crystal structures are described and analyzed. In these six puzzles, we discuss (i) the comparison between the automated web servers and human experts; (ii) the prediction of coaxial stacking; (iii) the prediction of structural details and ligand binding; (iv) the development of novel prediction methods; and (v) the potential improvements to be made. We show that correct prediction of coaxial stacking and tertiary contacts is essential for the prediction of RNA architecture, while ligand binding modes can only be predicted with low resolution and simultaneous prediction of RNA structure with accurate ligand binding still remains out of reach. All the predicted models are available for the future development of force field parameters and the improvement of comparison and assessment tools.
Assuntos
Aptâmeros de Nucleotídeos/química , RNA Catalítico/química , RNA/química , Sequência de Bases , Ligantes , Conformação de Ácido Nucleico , Riboswitch/genéticaRESUMO
MHC-I associated peptides (MAPs) play a central role in the elimination of virus-infected and neoplastic cells by CD8 T cells. However, accurately predicting the MAP repertoire remains difficult, because only a fraction of the transcriptome generates MAPs. In this study, we investigated whether codon arrangement (usage and placement) regulates MAP biogenesis. We developed an artificial neural network called Codon Arrangement MAP Predictor (CAMAP), predicting MAP presentation solely from mRNA sequences flanking the MAP-coding codons (MCCs), while excluding the MCC per se. CAMAP predictions were significantly more accurate when using original codon sequences than shuffled codon sequences which reflect amino acid usage. Furthermore, predictions were independent of mRNA expression and MAP binding affinity to MHC-I molecules and applied to several cell types and species. Combining MAP ligand scores, transcript expression level and CAMAP scores was particularly useful to increase MAP prediction accuracy. Using an in vitro assay, we showed that varying the synonymous codons in the regions flanking the MCCs (without changing the amino acid sequence) resulted in significant modulation of MAP presentation at the cell surface. Taken together, our results demonstrate the role of codon arrangement in the regulation of MAP presentation and support integration of both translational and post-translational events in predictive algorithms to ameliorate modeling of the immunopeptidome.
Assuntos
Códon , Biologia Computacional/métodos , Antígenos de Histocompatibilidade Classe I , Redes Neurais de Computação , Algoritmos , Sequência de Aminoácidos , Códon/química , Códon/genética , Códon/metabolismo , Antígenos de Histocompatibilidade Classe I/química , Antígenos de Histocompatibilidade Classe I/genética , Antígenos de Histocompatibilidade Classe I/metabolismo , HumanosRESUMO
MicroRNAs (miRNAs) are ribonucleic acids (RNAs) of â¼21 nucleotides that interfere with the translation of messenger RNAs (mRNAs) and play significant roles in development and diseases. In bilaterian animals, the specificity of miRNA targeting is determined by sequence complementarity involving the seed. However, the role of the remaining nucleotides (non-seed) is only vaguely defined, impacting negatively on our ability to efficiently use miRNAs exogenously to control gene expression. Here, using reporter assays, we deciphered the role of the base pairs formed between the non-seed region and target mRNA. We used molecular modeling to reveal that this mechanism corresponds to the formation of base pairs mediated by ordered motions of the miRNA-induced silencing complex. Subsequently, we developed an algorithm based on this distinctive recognition to predict from sequence the levels of mRNA downregulation with high accuracy (r2 > 0.5, P-value < 10-12). Overall, our discovery improves the design of miRNA-guide sequences used to simultaneously downregulate the expression of multiple predetermined target genes.
Assuntos
Proteínas Argonautas/genética , MicroRNAs/genética , Nucleotídeos/genética , RNA Mensageiro/genética , Regulação da Expressão Gênica/genética , Inativação Gênica , Humanos , Modelos Moleculares , Nucleotídeos/química , Conformação ProteicaRESUMO
RNA-Puzzles is a collective experiment in blind 3D RNA structure prediction. We report here a third round of RNA-Puzzles. Five puzzles, 4, 8, 12, 13, 14, all structures of riboswitch aptamers and puzzle 7, a ribozyme structure, are included in this round of the experiment. The riboswitch structures include biological binding sites for small molecules (S-adenosyl methionine, cyclic diadenosine monophosphate, 5-amino 4-imidazole carboxamide riboside 5'-triphosphate, glutamine) and proteins (YbxF), and one set describes large conformational changes between ligand-free and ligand-bound states. The Varkud satellite ribozyme is the most recently solved structure of a known large ribozyme. All puzzles have established biological functions and require structural understanding to appreciate their molecular mechanisms. Through the use of fast-track experimental data, including multidimensional chemical mapping, and accurate prediction of RNA secondary structure, a large portion of the contacts in 3D have been predicted correctly leading to similar topologies for the top ranking predictions. Template-based and homology-derived predictions could predict structures to particularly high accuracies. However, achieving biological insights from de novo prediction of RNA 3D structures still depends on the size and complexity of the RNA. Blind computational predictions of RNA structures already appear to provide useful structural information in many cases. Similar to the previous RNA-Puzzles Round II experiment, the prediction of non-Watson-Crick interactions and the observed high atomic clash scores reveal a notable need for an algorithm of improvement. All prediction models and assessment results are available at http://ahsoka.u-strasbg.fr/rnapuzzles/.
Assuntos
RNA Catalítico/química , Riboswitch , Aminoimidazol Carboxamida/química , Aminoimidazol Carboxamida/metabolismo , Aptâmeros de Nucleotídeos/química , Aptâmeros de Nucleotídeos/metabolismo , Fosfatos de Dinucleosídeos/metabolismo , Endorribonucleases/química , Endorribonucleases/metabolismo , Glutamina/química , Glutamina/metabolismo , Ligantes , Modelos Moleculares , Conformação de Ácido Nucleico , RNA Catalítico/metabolismo , Ribonucleotídeos/química , Ribonucleotídeos/metabolismo , S-Adenosilmetionina/química , S-Adenosilmetionina/metabolismoRESUMO
Hyperconnectivity of neuronal circuits due to increased synaptic protein synthesis is thought to cause autism spectrum disorders (ASDs). The mammalian target of rapamycin (mTOR) is strongly implicated in ASDs by means of upstream signalling; however, downstream regulatory mechanisms are ill-defined. Here we show that knockout of the eukaryotic translation initiation factor 4E-binding protein 2 (4E-BP2)-an eIF4E repressor downstream of mTOR-or eIF4E overexpression leads to increased translation of neuroligins, which are postsynaptic proteins that are causally linked to ASDs. Mice that have the gene encoding 4E-BP2 (Eif4ebp2) knocked out exhibit an increased ratio of excitatory to inhibitory synaptic inputs and autistic-like behaviours (that is, social interaction deficits, altered communication and repetitive/stereotyped behaviours). Pharmacological inhibition of eIF4E activity or normalization of neuroligin 1, but not neuroligin 2, protein levels restores the normal excitation/inhibition ratio and rectifies the social behaviour deficits. Thus, translational control by eIF4E regulates the synthesis of neuroligins, maintaining the excitation-to-inhibition balance, and its dysregulation engenders ASD-like phenotypes.
Assuntos
Transtorno Autístico/genética , Transtorno Autístico/fisiopatologia , Fator de Iniciação 4E em Eucariotos/metabolismo , Biossíntese de Proteínas , Animais , Moléculas de Adesão Celular Neuronais/genética , Moléculas de Adesão Celular Neuronais/metabolismo , Fator de Iniciação 4E em Eucariotos/antagonistas & inibidores , Fatores de Iniciação em Eucariotos/deficiência , Fatores de Iniciação em Eucariotos/genética , Fatores de Iniciação em Eucariotos/metabolismo , Masculino , Camundongos , Camundongos Knockout , Fenótipo , Sinapses/metabolismoRESUMO
RNA structures are hierarchically organized. The secondary structure is articulated around sophisticated local three-dimensional (3D) motifs shaping the full 3D architecture of the molecule. Recent contributions have identified and organized recurrent local 3D motifs, but applications of this knowledge for predictive purposes is still in its infancy. We recently developed a computational framework, named RNA-MoIP, to reconcile RNA secondary structure and local 3D motif information available in databases. In this paper, we introduce a web service using our software for predicting RNA hybrid 2D-3D structures from sequence data only. Optionally, it can be used for (i) local 3D motif prediction or (ii) the refinement of user-defined secondary structures. Importantly, our web server automatically generates a script for the MC-Sym software, which can be immediately used to quickly predict all-atom RNA 3D models. The web server is available at http://rnamoip.cs.mcgill.ca.
Assuntos
Motivos de Nucleotídeos , RNA/química , Software , Sequência de Bases , Internet , Modelos Moleculares , Conformação de Ácido NucleicoRESUMO
MicroRNAs (miRNAs) are crucial gene expression regulators and first-order suspects in the development and progression of many diseases. Comparative analysis of cancer cell expression data highlights many deregulated miRNAs. Low expression of miR-125a was related to poor breast cancer prognosis. Interestingly, a single nucleotide polymorphism (SNP) in miR-125a was located within a minor allele expressed by breast cancer patients. The SNP is not predicted to affect the ground state structure of the primary transcript or precursor, but neither the precursor nor mature product is detected by RT-qPCR. How this SNP modulates the maturation of miR-125a is poorly understood. Here, building upon a model of RNA dynamics derived from nuclear magnetic resonance studies, we developed a quantitative model enabling the visualization and comparison of networks of transient structures. We observed a high correlation between the distances between networks of variants with that of their respective wild types and their relative degrees of maturation to the latter, suggesting an important role of transient structures in miRNA homeostasis. We classified the human miRNAs according to pairwise distances between their networks of transient structures.
Assuntos
MicroRNAs/química , MicroRNAs/genética , Conformação de Ácido Nucleico , Processamento Pós-Transcricional do RNA , Transcrição Gênica , Pareamento de Bases , Linhagem Celular , Humanos , Espectroscopia de Ressonância Magnética , MicroRNAs/metabolismo , Polimorfismo de Nucleotídeo Único , Relação Estrutura-AtividadeRESUMO
This paper is a report of a second round of RNA-Puzzles, a collective and blind experiment in three-dimensional (3D) RNA structure prediction. Three puzzles, Puzzles 5, 6, and 10, represented sequences of three large RNA structures with limited or no homology with previously solved RNA molecules. A lariat-capping ribozyme, as well as riboswitches complexed to adenosylcobalamin and tRNA, were predicted by seven groups using RNAComposer, ModeRNA/SimRNA, Vfold, Rosetta, DMD, MC-Fold, 3dRNA, and AMBER refinement. Some groups derived models using data from state-of-the-art chemical-mapping methods (SHAPE, DMS, CMCT, and mutate-and-map). The comparisons between the predictions and the three subsequently released crystallographic structures, solved at diffraction resolutions of 2.5-3.2 Å, were carried out automatically using various sets of quality indicators. The comparisons clearly demonstrate the state of present-day de novo prediction abilities as well as the limitations of these state-of-the-art methods. All of the best prediction models have similar topologies to the native structures, which suggests that computational methods for RNA structure prediction can already provide useful structural information for biological problems. However, the prediction accuracy for non-Watson-Crick interactions, key to proper folding of RNAs, is low and some predicted models had high Clash Scores. These two difficulties point to some of the continuing bottlenecks in RNA structure prediction. All submitted models are available for download at http://ahsoka.u-strasbg.fr/rnapuzzles/.
Assuntos
Biologia Computacional/métodos , RNA/química , Cristalografia por Raios X , Modelos Moleculares , Conformação de Ácido Nucleico , RNA Mensageiro/química , RNA de Transferência/química , SoftwareRESUMO
In eucaryotes, gene expression is regulated by microRNAs (miRNAs) which bind to messenger RNAs (mRNAs) and interfere with their translation into proteins, either by promoting their degradation or inducing their repression. We study the effect of miRNA interference on each gene using experimental methods, such as microarrays and RNA-seq at the mRNA level, or luciferase reporter assays and variations of SILAC at the protein level. Alternatively, computational predictions would provide clear benefits. However, no algorithm toward this task has ever been proposed. Here, we introduce a new algorithm to predict genome-wide expression data from initial transcriptome abundance. The algorithm simulates the miRNA and mRNA hybridization competition that occurs in given cellular conditions, and derives the whole set of miRNA::mRNA interactions at equilibrium (microtargetome). Interestingly, solving the competition improves the accuracy of miRNA target predictions. Furthermore, this model implements a previously reported and fundamental property of the microtargetome: the binding between a miRNA and a mRNA depends on their sequence complementarity, but also on the abundance of all RNAs expressed in the cell, i.e. the stoichiometry of all the miRNA sites and all the miRNAs given their respective abundance. This model generalizes the miRNA-induced synchronistic silencing previously observed, and described as sponges and competitive endogenous RNAs.
Assuntos
Algoritmos , Inativação Gênica , MicroRNAs/metabolismo , Linhagem Celular , Humanos , MicroRNAs/química , RNA Mensageiro/química , RNA Mensageiro/metabolismo , TranscriptomaRESUMO
Anti-infection drugs target vital functions of infectious agents, including their ribosome and other essential non-coding RNAs. One of the reasons infectious agents become resistant to drugs is due to mutations that eliminate drug-binding affinity while maintaining vital elements. Identifying these elements is based on the determination of viable and lethal mutants and associated structures. However, determining the structure of enough mutants at high resolution is not always possible. Here, we introduce a new computational method, MC-3DQSAR, to determine the vital elements of target RNA structure from mutagenesis and available high-resolution data. We applied the method to further characterize the structural determinants of the bacterial 23S ribosomal RNA sarcin-ricin loop (SRL), as well as those of the lead-activated and hammerhead ribozymes. The method was accurate in confirming experimentally determined essential structural elements and predicting the viability of new SRL variants, which were either observed in bacteria or validated in bacterial growth assays. Our results indicate that MC-3DQSAR could be used systematically to evaluate the drug-target potentials of any RNA sites using current high-resolution structural data.
Assuntos
Relação Quantitativa Estrutura-Atividade , RNA/química , Biologia Computacional/métodos , Modelos Moleculares , RNA Bacteriano/química , RNA Bacteriano/metabolismo , RNA Catalítico/química , RNA Catalítico/metabolismo , RNA Ribossômico 23S/química , RNA Ribossômico 23S/metabolismoRESUMO
ADARs (Adenosine deaminases that act on RNA) "edit" RNA by converting adenosines to inosines within double-stranded regions. The primary targets of ADARs are long duplexes present within noncoding regions of mRNAs, such as introns and 3' untranslated regions (UTRs). Because adenosine and inosine have different base-pairing properties, editing within these regions can alter splicing and recognition by small RNAs. However, despite numerous studies identifying multiple editing sites in these genomic regions, little is known about the extent to which editing sites co-occur on individual transcripts or the functional output of these combinatorial editing events. To begin to address these questions, we performed an ultra-deep sequencing analysis of 4 Caenorhabditis elegans 3' UTRs that are known ADAR targets. Synchronous editing events were determined for the long duplexes in vivo. Furthermore, the validity of each editing event was confirmed by sequencing the same regions of mRNA from worms that lack A-to-I editing. This analysis identified a large number of editing sites that can occur within each 3' UTR, but interestingly, each individual transcript contained only a small fraction of these A-to-I editing events. In addition, editing patterns were not random, indicating that an editing event can affect the efficiency of editing at subsequent adenosines. Furthermore, we identified specific sites that can be both positively and negatively correlated with additional sites leading to mutually exclusive editing patterns. These results suggest that editing in noncoding regions is selective and hyper-editing of cellular RNAs is rare.
Assuntos
Adenosina Desaminase/metabolismo , Adenosina/metabolismo , Proteínas de Caenorhabditis elegans/metabolismo , Caenorhabditis elegans/metabolismo , Inosina/metabolismo , Edição de RNA , RNA de Helmintos/metabolismo , Regiões 3' não Traduzidas , Adenosina Desaminase/genética , Animais , Pareamento de Bases , Sequência de Bases , Caenorhabditis elegans/genética , Proteínas de Caenorhabditis elegans/genética , Desaminação , Éxons , Sequenciamento de Nucleotídeos em Larga Escala , Íntrons , Dados de Sequência Molecular , Conformação de Ácido Nucleico , Fases de Leitura Aberta , RNA de Helmintos/genéticaRESUMO
We report the results of a first, collective, blind experiment in RNA three-dimensional (3D) structure prediction, encompassing three prediction puzzles. The goals are to assess the leading edge of RNA structure prediction techniques; compare existing methods and tools; and evaluate their relative strengths, weaknesses, and limitations in terms of sequence length and structural complexity. The results should give potential users insight into the suitability of available methods for different applications and facilitate efforts in the RNA structure prediction community in ongoing efforts to improve prediction tools. We also report the creation of an automated evaluation pipeline to facilitate the analysis of future RNA structure prediction exercises.
Assuntos
Conformação de Ácido Nucleico , RNA/química , Sequência de Bases , Dimerização , Modelos Moleculares , Dados de Sequência MolecularRESUMO
The classical RNA secondary structure model considers A.U and G.C Watson-Crick as well as G.U wobble base pairs. Here we substitute it for a new one, in which sets of nucleotide cyclic motifs define RNA structures. This model allows us to unify all base pairing energetic contributions in an effective scoring function to tackle the problem of RNA folding. We show how pipelining two computer algorithms based on nucleotide cyclic motifs, MC-Fold and MC-Sym, reproduces a series of experimentally determined RNA three-dimensional structures from the sequence. This demonstrates how crucial the consideration of all base-pairing interactions is in filling the gap between sequence and structure. We use the pipeline to define rules of precursor microRNA folding in double helices, despite the presence of a number of presumed mismatches and bulges, and to propose a new model of the human immunodeficiency virus-1 -1 frame-shifting element.
Assuntos
Biologia Computacional , Conformação de Ácido Nucleico , RNA/química , RNA/genética , Software , Algoritmos , Pareamento de Bases , Sequência de Bases , Mudança da Fase de Leitura do Gene Ribossômico , Genes gag/genética , Genes pol/genética , HIV-1/genética , Humanos , MicroRNAs/química , MicroRNAs/metabolismo , Modelos Genéticos , Modelos Moleculares , Dados de Sequência Molecular , Precursores de RNA/química , Precursores de RNA/metabolismo , RNA Viral/química , RNA Viral/genética , RNA Viral/metabolismo , TermodinâmicaRESUMO
The majority of cancer deaths are caused by solid tumors, where the four most prevalent cancers (breast, lung, colorectal and prostate) account for more than 60% of all cases (1). Tumor cell heterogeneity driven by variable cancer microenvironments, such as hypoxia, is a key determinant of therapeutic outcome. We developed a novel culture protocol, termed the Long-Term Hypoxia (LTHY) time course, to recapitulate the gradual development of severe hypoxia seen in vivo to mimic conditions observed in primary tumors. Cells subjected to LTHY underwent a non-canonical epithelial to mesenchymal transition (EMT) based on miRNA and mRNA signatures as well as displayed EMT-like morphological changes. Concomitant to this, we report production of a novel truncated isoform of WT1 transcription factor (tWt1), a non-canonical EMT driver, with expression driven by a yet undescribed intronic promoter through hypoxia-responsive elements (HREs). We further demonstrated that tWt1 initiates translation from an intron-derived start codon, retains proper subcellular localization and DNA binding. A similar tWt1 is also expressed in LTHY-cultured human cancer cell lines as well as primary cancers and predicts long-term patient survival. Our study not only demonstrates the importance of culture conditions that better mimic those observed in primary cancers, especially with regards to hypoxia, but also identifies a novel isoform of WT1 which correlates with poor long-term survival in ovarian cancer.
Assuntos
Transição Epitelial-Mesenquimal , Isoformas de Proteínas , Proteínas WT1 , Humanos , Transição Epitelial-Mesenquimal/genética , Proteínas WT1/metabolismo , Proteínas WT1/genética , Isoformas de Proteínas/genética , Isoformas de Proteínas/metabolismo , Linhagem Celular Tumoral , Neoplasias/metabolismo , Neoplasias/genética , Neoplasias/patologia , Regulação Neoplásica da Expressão GênicaRESUMO
The NMR solution structure is reported of a duplex, 5'GUGAAGCCCGU/3'UCACAGGAGGC, containing a 4 × 4 nucleotide internal loop from an R2 retrotransposon RNA. The loop contains three sheared purine-purine pairs and reveals a structural element found in other RNAs, which we refer to as the 3RRs motif. Optical melting measurements of the thermodynamics of the duplex indicate that the internal loop is 1.6 kcal/mol more stable at 37°C than predicted. The results identify the 3RRs motif as a common structural element that can facilitate prediction of 3D structure. Known examples include internal loops having the pairings: 5'GAA/3'AGG, 5'GAG/3'AGG, 5'GAA/3'AAG, and 5'AAG/3'AGG. The structural information is compared with predictions made with the MC-Sym program.
Assuntos
Ressonância Magnética Nuclear Biomolecular/métodos , Conformação de Ácido Nucleico , Nucleotídeos de Purina/química , RNA/química , Retroelementos , Adenina/química , Motivos de Aminoácidos , Pareamento de Bases , Domínios e Motivos de Interação entre Proteínas , RNA/genética , Análise de Sequência de RNA , TermodinâmicaRESUMO
MOTIVATION: The prediction of RNA 3D structures from its sequence only is a milestone to RNA function analysis and prediction. In recent years, many methods addressed this challenge, ranging from cycle decomposition and fragment assembly to molecular dynamics simulations. However, their predictions remain fragile and limited to small RNAs. To expand the range and accuracy of these techniques, we need to develop algorithms that will enable to use all the structural information available. In particular, the energetic contribution of secondary structure interactions is now well documented, but the quantification of non-canonical interactions-those shaping the tertiary structure-is poorly understood. Nonetheless, even if a complete RNA tertiary structure energy model is currently unavailable, we now have catalogues of local 3D structural motifs including non-canonical base pairings. A practical objective is thus to develop techniques enabling us to use this knowledge for robust RNA tertiary structure predictors. RESULTS: In this work, we introduce RNA-MoIP, a program that benefits from the progresses made over the last 30 years in the field of RNA secondary structure prediction and expands these methods to incorporate the novel local motif information available in databases. Using an integer programming framework, our method refines predicted secondary structures (i.e. removes incorrect canonical base pairs) to accommodate the insertion of RNA 3D motifs (i.e. hairpins, internal loops and k-way junctions). Then, we use predictions as templates to generate complete 3D structures with the MC-Sym program. We benchmarked RNA-MoIP on a set of 9 RNAs with sizes varying from 53 to 128 nucleotides. We show that our approach (i) improves the accuracy of canonical base pair predictions; (ii) identifies the best secondary structures in a pool of suboptimal structures; and (iii) predicts accurate 3D structures of large RNA molecules. AVAILABILITY: RNA-MoIP is publicly available at: http://csb.cs.mcgill.ca/RNAMoIP.
Assuntos
Algoritmos , Conformação de Ácido Nucleico , Motivos de Nucleotídeos , RNA/química , Software , Pareamento de Bases , Bases de Dados de Ácidos Nucleicos , Modelos Teóricos , RNA/genéticaRESUMO
Introduction: Prediction of RNA secondary structure from single sequences still needs substantial improvements. The application of machine learning (ML) to this problem has become increasingly popular. However, ML algorithms are prone to overfitting, limiting the ability to learn more about the inherent mechanisms governing RNA folding. It is natural to use high-capacity models when solving such a difficult task, but poor generalization is expected when too few examples are available. Methods: Here, we report the relation between capacity and performance on a fundamental related problem: determining whether two sequences are fully complementary. Our analysis focused on the impact of model architecture and capacity as well as dataset size and nature on classification accuracy. Results: We observed that low-capacity models are better suited for learning with mislabelled training examples, while large capacities improve the ability to generalize to structurally dissimilar data. It turns out that neural networks struggle to grasp the fundamental concept of base complementarity, especially in lengthwise extrapolation context. Discussion: Given a more complex task like RNA folding, it comes as no surprise that the scarcity of useable examples hurdles the applicability of machine learning techniques to this field.