RESUMO
MOTIVATION: Up to 75% of the human genome encodes RNAs. The function of many non-coding RNAs relies on their ability to fold into 3D structures. Specifically, nucleotides inside secondary structure loops form non-canonical base pairs that help stabilize complex local 3D structures. These RNA 3D motifs can promote specific interactions with other molecules or serve as catalytic sites. RESULTS: We introduce PERFUMES, a computational pipeline to identify 3D motifs that can be associated with observable features. Given a set of RNA sequences with associated binary experimental measurements, PERFUMES searches for RNA 3D motifs using BayesPairing2 and extracts those that are over-represented in the set of positive sequences. It also conducts a thermodynamics analysis of the structural context that can support the interpretation of the predictions. We illustrate PERFUMES' usage on the SNRPA protein binding site, for which the tool retrieved both previously known binder motifs and new ones. AVAILABILITY AND IMPLEMENTATION: PERFUMES is an open-source Python package (https://jwgitlab.cs.mcgill.ca/arnaud_chol/perfumes).
Assuntos
Perfumes , Humanos , Conformação de Ácido Nucleico , Motivos de Nucleotídeos , Pareamento de Bases , RNA/químicaRESUMO
Over the past two decades, scientists have increasingly realized the importance of the three-dimensional (3D) genome organization in regulating cellular activity. Hi-C and related experiments yield 2D contact matrices that can be used to infer 3D models of chromosome structure. Visualizing and analyzing genomes in 3D space remains challenging. Here, we present ARGV, an augmented reality 3D Genome Viewer. ARGV contains more than 350 pre-computed and annotated genome structures inferred from Hi-C and imaging data. It offers interactive and collaborative visualization of genomes in 3D space, using standard mobile phones or tablets. A user study comparing ARGV to existing tools demonstrates its benefits.
Assuntos
Realidade Aumentada , Genoma , Imageamento Tridimensional/métodos , Software , Humanos , Genômica/métodosRESUMO
SUMMARY: RNA 3D architectures are stabilized by sophisticated networks of (non-canonical) base pair interactions, which can be conveniently encoded as multi-relational graphs and efficiently exploited by graph theoretical approaches and recent progresses in machine learning techniques. RNAglib is a library that eases the use of this representation, by providing clean data, methods to load it in machine learning pipelines and graph-based deep learning models suited for this representation. RNAglib also offers other utilities to model RNA with 2.5 D graphs, such as drawing tools, comparison functions or baseline performances on RNA applications. AVAILABILITY AND IMPLEMENTATION: The method is distributed as a pip package, RNAglib. Data are available in a repository and can be accessed on rnaglib's web page. The source code, data and documentation are available at https://rnaglib.cs.mcgill.ca. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Bibliotecas , Software , Aprendizado de Máquina , Documentação , Biblioteca GênicaRESUMO
MOTIVATION: RNA 3D motifs are recurrent substructures, modeled as networks of base pair interactions, which are crucial for understanding structure-function relationships. The task of automatically identifying such motifs is computationally hard, and remains a key challenge in the field of RNA structural biology and network analysis. State-of-the-art methods solve special cases of the motif problem by constraining the structural variability in occurrences of a motif, and narrowing the substructure search space. RESULTS: Here, we relax these constraints by posing the motif finding problem as a graph representation learning and clustering task. This framing takes advantage of the continuous nature of graph representations to model the flexibility and variability of RNA motifs in an efficient manner. We propose a set of node similarity functions, clustering methods and motif construction algorithms to recover flexible RNA motifs. Our tool, Vernal can be easily customized by users to desired levels of motif flexibility, abundance and size. We show that Vernal is able to retrieve and expand known classes of motifs, as well as to propose novel motifs. AVAILABILITY AND IMPLEMENTATION: The source code, data and a webserver are available at vernal.cs.mcgill.ca. We also provide a flexible interface and a user-friendly webserver to browse and download our results. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , RNA , RNA/química , Motivos de Nucleotídeos , Software , Pareamento de Bases , Biologia ComputacionalRESUMO
RNA tertiary structure is crucial to its many non-coding molecular functions. RNA architecture is shaped by its secondary structure composed of stems, stacked canonical base pairs, enclosing loops. While stems are precisely captured by free-energy models, loops composed of non-canonical base pairs are not. Nor are distant interactions linking together those secondary structure elements (SSEs). Databases of conserved 3D geometries (a.k.a. modules) not captured by energetic models are leveraged for structure prediction and design, but the computational complexity has limited their study to local elements, loops. Representing the RNA structure as a graph has recently allowed to expend this work to pairs of SSEs, uncovering a hierarchical organization of these 3D modules, at great computational cost. Systematically capturing recurrent patterns on a large scale is a main challenge in the study of RNA structures. In this paper, we present an efficient algorithm to compute maximal isomorphisms in edge colored graphs. We extend this algorithm to a framework well suited to identify RNA modules, and fast enough to considerably generalize previous approaches. To exhibit the versatility of our framework, we first reproduce results identifying all common modules spanning more than 2 SSEs, in a few hours instead of weeks. The efficiency of our new algorithm is demonstrated by computing the maximal modules between any pair of entire RNA in the non-redundant corpus of known RNA 3D structures. We observe that the biggest modules our method uncovers compose large shared sub-structure spanning hundreds of nucleotides and base pairs between the ribosomes of Thermus thermophilus, Escherichia Coli, and Pseudomonas aeruginosa.
Assuntos
Conformação de Ácido Nucleico , RNA/química , Algoritmos , Pareamento de Bases , Biologia Computacional/métodosRESUMO
RNA-small molecule binding is a key regulatory mechanism which can stabilize 3D structures and activate molecular functions. The discovery of RNA-targeting compounds is thus a current topic of interest for novel therapies. Our work is a first attempt at bringing the scalability and generalization abilities of machine learning methods to the problem of RNA drug discovery, as well as a step towards understanding the interactions which drive binding specificity. Our tool, RNAmigos, builds and encodes a network representation of RNA structures to predict likely ligands for novel binding sites. We subject ligand predictions to virtual screening and show that we are able to place the true ligand in the 71st-73rd percentile in two decoy libraries, showing a significant improvement over several baselines, and a state of the art method. Furthermore, we observe that augmenting structural networks with non-canonical base pairing data is the only representation able to uncover a significant signal, suggesting that such interactions are a necessary source of binding specificity. We also find that pre-training with an auxiliary graph representation learning task significantly boosts performance of ligand prediction. This finding can serve as a general principle for RNA structure-function prediction when data is scarce. RNAmigos shows that RNA binding data contains structural patterns with potential for drug discovery, and provides methodological insights for possible applications to other structure-function learning tasks. The source code, data and a Web server are freely available at http://rnamigos.cs.mcgill.ca.
Assuntos
RNA/química , Software , Pareamento de Bases , Sítios de Ligação , Ligantes , Conformação de Ácido NucleicoRESUMO
The RNA world hypothesis relies on the ability of ribonucleic acids to spontaneously acquire complex structures capable of supporting essential biological functions. Multiple sophisticated evolutionary models have been proposed for their emergence, but they often assume specific conditions. In this work, we explore a simple and parsimonious scenario describing the emergence of complex molecular structures at the early stages of life. We show that at specific GC content regimes, an undirected replication model is sufficient to explain the apparition of multibranched RNA secondary structures-a structural signature of many essential ribozymes. We ran a large-scale computational study to map energetically stable structures on complete mutational networks of 50-nt-long RNA sequences. Our results reveal that the sequence landscape with stable structures is enriched with multibranched structures at a length scale coinciding with the appearance of complex structures in RNA databases. A random replication mechanism preserving a 50% GC content may suffice to explain a natural enrichment of stable complex structures in populations of functional RNAs. In contrast, an evolutionary mechanism eliciting the most stable folds at each generation appears to help reaching multibranched structures at highest GC content.
Assuntos
Conformação de Ácido Nucleico , RNA/química , Composição de Bases , Sequência de Bases , Evolução Molecular , Mutação , RNA/genética , Dobramento de RNA , Estabilidade de RNA , Relação Estrutura-Atividade , Transcrição GênicaRESUMO
MOTIVATION: Protein folding is a dynamic process through which polypeptide chains reach their native 3D structures. Although the importance of this mechanism is widely acknowledged, very few high-throughput computational methods have been developed to study it. RESULTS: In this paper, we report a computational platform named P3Fold that combines statistical and evolutionary information for predicting and analyzing protein folding routes. P3Fold uses coarse-grained modeling and efficient combinatorial schemes to predict residue contacts and evaluate the folding routes of a protein sequence within minutes or hours. To facilitate access to this technology, we devise graphical representations and implement an interactive web interface that allows end-users to leverage P3Fold predictions. Finally, we use P3Fold to conduct large and short scale experiments on the human proteome that reveal the broad conservation and variations of structural intermediates within protein families. AVAILABILITY AND IMPLEMENTATION: A Web server of P3Fold is freely available at http://csb.cs.mcgill.ca/P3Fold. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Dobramento de Proteína , Software , Sequência de Aminoácidos , Computadores , Humanos , ProteomaRESUMO
SUMMARY: RNA design has conceptually evolved from the inverse RNA folding problem. In the classical inverse RNA problem, the user inputs an RNA secondary structure and receives an output RNA sequence that folds into it. Although modern RNA design methods are based on the same principle, a finer control over the resulting sequences is sought. As an important example, a substantial number of non-coding RNA families show high preservation in specific regions, while being more flexible in others and this information should be utilized in the design. By using the additional information, RNA design tools can help solve problems of practical interest in the growing fields of synthetic biology and nanotechnology. incaRNAfbinv 2.0 utilizes a fragment-based approach, enabling a control of specific RNA secondary structure motifs. The new version allows significantly more control over the general RNA shape, and also allows to express specific restrictions over each motif separately, in addition to other advanced features. AVAILABILITY AND IMPLEMENTATION: incaRNAfbinv 2.0 is available through a standalone package and a web-server at https://www.cs.bgu.ac.il/incaRNAfbinv. Source code, command-line and GUI wrappers can be found at https://github.com/matandro/RNAsfbinv. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
RNA , Software , Motivos de Nucleotídeos , RNA/genética , Dobramento de RNA , Análise de Sequência de RNARESUMO
RNA structures possess multiple levels of structural organization. A secondary structure, made of Watson-Crick helices connected by loops, forms a scaffold for the tertiary structure. The 3D structures adopted by these loops are therefore critical determinants shaping the global 3D architecture. Earlier studies showed that these local 3D structures can be described as conserved sets of ordered non-Watson-Crick base pairs called RNA structural modules. Unfortunately, the computational efficiency and scope of the current 3D module identification methods are too limited yet to benefit from all the knowledge accumulated in the module databases. We present BayesPairing, an automated, efficient and customizable tool for (i) building Bayesian networks representing RNA 3D modules and (ii) rapid identification of 3D modules in sequences. BayesPairing uses a flexible definition of RNA 3D modules that allows us to consider complex architectures such as multi-branched loops and features multiple algorithmic improvements. We benchmarked our methods using cross-validation techniques on 3409 RNA chains and show that BayesPairing achieves up to â¼70% identification accuracy on module positions and base pair interactions. BayesPairing can handle a broader range of motifs (versatility) and offers considerable running time improvements (efficiency), opening the door to a broad range of large-scale applications.
Assuntos
Pareamento de Bases , Teorema de Bayes , RNA/química , Automação , Bases de Dados Genéticas , Conjuntos de Dados como Assunto , Reprodutibilidade dos Testes , Fatores de TempoRESUMO
Computational programs for predicting RNA sequences with desired folding properties have been extensively developed and expanded in the past several years. Given a secondary structure, these programs aim to predict sequences that fold into a target minimum free energy secondary structure, while considering various constraints. This procedure is called inverse RNA folding. Inverse RNA folding has been traditionally used to design optimized RNAs with favorable properties, an application that is expected to grow considerably in the future in light of advances in the expanding new fields of synthetic biology and RNA nanostructures. Moreover, it was recently demonstrated that inverse RNA folding can successfully be used as a valuable preprocessing step in computational detection of novel noncoding RNAs. This review describes the most popular freeware programs that have been developed for such purposes, starting from RNAinverse that was devised when formulating the inverse RNA folding problem. The most recently published ones that consider RNA secondary structure as input are antaRNA, RNAiFold and incaRNAfbinv, each having different features that could be beneficial to specific biological problems in practice. The various programs also use distinct approaches, ranging from ant colony optimization to constraint programming, in addition to adaptive walk, simulated annealing and Boltzmann sampling. This review compares between the various programs and provides a simple description of the various possibilities that would benefit practitioners in selecting the most suitable program. It is geared for specific tasks requiring RNA design based on input secondary structure, with an outlook toward the future of RNA design programs.
Assuntos
Algoritmos , Conformação de Ácido Nucleico , Dobramento de RNA , RNA/química , Software , Animais , Biologia Computacional/métodos , Humanos , Modelos MolecularesRESUMO
Ligand-based drug design has recently benefited from the development of deep generative models. These models enable extensive explorations of the chemical space and provide a platform for molecular optimization. However, the vast majority of current methods does not leverage the structure of the binding target, which potentiates the binding of small molecules and plays a key role in the interaction. We propose an optimization pipeline that leverages complementary structure-based and ligand-based methods. Instead of performing docking on a fixed chemical library, we iteratively select promising compounds in the full chemical space using a ligand-centered generative model. Molecular docking is then used as an oracle to guide compound optimization. This allows for iterative generation of compounds that fit the target structure better and better, without prior knowledge about bioactives. For this purpose, we introduce a new graph to Selfies Variational Autoencoder (VAE) which benefits from an 18-fold faster decoding than the graph to graph state of the art, while achieving a similar performance. We then successfully optimize the generation of molecules toward high docking scores, enabling a 10-fold enrichment of high-scoring compounds found with a fixed computational cost.
Assuntos
Descoberta de Drogas , Timolol , Desenho de Fármacos , Ligantes , Simulação de Acoplamento MolecularRESUMO
The wealth of the combinatorics of nucleotide base pairs enables RNA molecules to assemble into sophisticated interaction networks, which are used to create complex 3D substructures. These interaction networks are essential to shape the 3D architecture of the molecule, and also to provide the key elements to carry molecular functions such as protein or ligand binding. They are made of organised sets of long-range tertiary interactions which connect distinct secondary structure elements in 3D structures. Here, we present a de novo data-driven approach to extract automatically from large data sets of full RNA 3D structures the recurrent interaction networks (RINs). Our methodology enables us for the first time to detect the interaction networks connecting distinct components of the RNA structure, highlighting their diversity and conservation through non-related functional RNAs. We use a graphical model to perform pairwise comparisons of all RNA structures available and to extract RINs and modules. Our analysis yields a complete catalog of RNA 3D structures available in the Protein Data Bank and reveals the intricate hierarchical organization of the RNA interaction networks and modules. We assembled our results in an online database (http://carnaval.lri.fr) which will be regularly updated. Within the site, a tool allows users with a novel RNA structure to detect automatically whether the novel structure contains previously observed RINs.
Assuntos
Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Conformação de Ácido Nucleico , RNA/química , Algoritmos , Pareamento de Bases , Biologia Computacional/métodos , Mineração de Dados/métodos , Bases de Dados de Proteínas/estatística & dados numéricos , Modelos Moleculares , Dobramento de RNA , SoftwareRESUMO
Over the past two decades, interests in DNA and RNA as drug targets have been growing rapidly. Following the trends observed with protein drug targets, computational approaches for drug design have been developed for this new class of molecules. Our efforts toward the development of a universal docking program, Fitted, led us to focus on nucleic acids. Throughout the development of this docking program, efforts were directed toward displaceable water molecules which must be accurately located for optimal docking-based drug discovery. However, although there is a plethora of methods to place water molecules in and around protein structures, there is, to the best of our knowledge, no such fully automated method for nucleic acids, which are significantly more polar and solvated than proteins. We report herein a new method, Splash'Em (Solvation Potential Laid around Statistical Hydration on Entire Macromolecules) developed to place water molecules within the binding cavity of nucleic acids. This fast method was shown to have high agreement with water positions in crystal structures and will therefore provide essential information to medicinal chemists.
Assuntos
DNA/química , DNA/metabolismo , RNA/química , RNA/metabolismo , Água/química , Ligação de Hidrogênio , Ligantes , Modelos Moleculares , Conformação de Ácido NucleicoRESUMO
The field of 3D genomics grew at increasing rates in the last decade. The volume and complexity of 2D and 3D data produced is progressively outpacing the capacities of the technology previously used for distributing genome sequences. The emergence of new technologies provides also novel opportunities for the development of innovative approaches. In this paper, we review the state-of-the-art computing technology, as well as the solutions adopted by the platforms currently available.
Assuntos
Big Data , Mapeamento Cromossômico , Análise de Dados , Genoma/genética , Imageamento Tridimensional , Computação em Nuvem , DNA/química , DNA/genética , Bases de Dados Genéticas , Genômica/métodos , Conformação de Ácido NucleicoRESUMO
RNA structures are hierarchically organized. The secondary structure is articulated around sophisticated local three-dimensional (3D) motifs shaping the full 3D architecture of the molecule. Recent contributions have identified and organized recurrent local 3D motifs, but applications of this knowledge for predictive purposes is still in its infancy. We recently developed a computational framework, named RNA-MoIP, to reconcile RNA secondary structure and local 3D motif information available in databases. In this paper, we introduce a web service using our software for predicting RNA hybrid 2D-3D structures from sequence data only. Optionally, it can be used for (i) local 3D motif prediction or (ii) the refinement of user-defined secondary structures. Importantly, our web server automatically generates a script for the MC-Sym software, which can be immediately used to quickly predict all-atom RNA 3D models. The web server is available at http://rnamoip.cs.mcgill.ca.
Assuntos
Motivos de Nucleotídeos , RNA/química , Software , Sequência de Bases , Internet , Modelos Moleculares , Conformação de Ácido NucleicoRESUMO
Systematic structure probing experiments (e.g. SHAPE) of RNA mutants such as the mutate-and-map (MaM) protocol give us a direct access into the genetic robustness of ncRNA structures. Comparative studies of homologous sequences provide a distinct, yet complementary, approach to analyze structural and functional properties of non-coding RNAs. In this paper, we introduce a formal framework to combine the biochemical signal collected from MaM experiments, with the evolutionary information available in multiple sequence alignments. We apply neutral theory principles to detect complex long-range dependencies between nucleotides of a single stranded RNA, and implement these ideas into a software called aRNhAck We illustrate the biological significance of this signal and show that the nucleotides networks calculated with aRNhAck are correlated with nucleotides located in RNA-RNA, RNA-protein, RNA-DNA and RNA-ligand interfaces. aRNhAck is freely available at http://csb.cs.mcgill.ca/arnhack.
Assuntos
Evolução Molecular , Mutação , Conformação de Ácido Nucleico , RNA/genética , Software , Algoritmos , Sítios de Ligação , Biologia Computacional/métodos , DNA/química , Modelos Moleculares , Ligação Proteica , Conformação Proteica , Proteínas/química , RNA/química , NavegadorRESUMO
In recent years, new methods for computational RNA design have been developed and applied to various problems in synthetic biology and nanotechnology. Lately, there is considerable interest in incorporating essential biological information when solving the inverse RNA folding problem. Correspondingly, RNAfbinv aims at including biologically meaningful constraints and is the only program to-date that performs a fragment-based design of RNA sequences. In doing so it allows the design of sequences that do not necessarily exactly fold into the target, as long as the overall coarse-grained tree graph shape is preserved. Augmented by the weighted sampling algorithm of incaRNAtion, our web server called incaRNAfbinv implements the method devised in RNAfbinv and offers an interactive environment for the inverse folding of RNA using a fragment-based design approach. It takes as input: a target RNA secondary structure; optional sequence and motif constraints; optional target minimum free energy, neutrality and GC content. In addition to the design of synthetic regulatory sequences, it can be used as a pre-processing step for the detection of novel natural occurring RNAs. The two complementary methodologies RNAfbinv and incaRNAtion are merged together and fully implemented in our web server incaRNAfbinv, available at http://www.cs.bgu.ac.il/incaRNAfbinv.
Assuntos
Conformação de Ácido Nucleico , Dobramento de RNA , RNA/química , Software , Algoritmos , Composição de Bases , Pareamento de Bases , Sequência de Bases , Gráficos por Computador , Internet , Mutação , RNA/genética , Análise de Sequência de RNA , TermodinâmicaRESUMO
Recent releases of genome three-dimensional (3D) structures have the potential to transform our understanding of genomes. Nonetheless, the storage technology and visualization tools need to evolve to offer to the scientific community fast and convenient access to these data. We introduce simultaneously a database system to store and query 3D genomic data (3DBG), and a 3D genome browser to visualize and explore 3D genome structures (3DGB). We benchmark 3DBG against state-of-the-art systems and demonstrate that it is faster than previous solutions, and importantly gracefully scales with the size of data. We also illustrate the usefulness of our 3D genome Web browser to explore human genome structures. The 3D genome browser is available at http://3dgb.cs.mcgill.ca/.
Assuntos
Bases de Dados Genéticas , Genômica , Gráficos por Computador , Genes , Genes do Retinoblastoma , Genoma Humano , Humanos , Internet , Modelos Moleculares , Polimorfismo de Nucleotídeo ÚnicoRESUMO
BACKGROUND: Secondary structures form the scaffold of multiple sequence alignment of non-coding RNA (ncRNA) families. An accurate reconstruction of ancestral ncRNAs must use this structural signal. However, the inference of ancestors of a single ncRNA family with a single consensus structure may bias the results towards sequences with high affinity to this structure, which are far from the true ancestors. METHODS: In this paper, we introduce achARNement, a maximum parsimony approach that, given two alignments of homologous ncRNA families with consensus secondary structures and a phylogenetic tree, simultaneously calculates ancestral RNA sequences for these two families. RESULTS: We test our methodology on simulated data sets, and show that achARNement outperforms classical maximum parsimony approaches in terms of accuracy, but also reduces by several orders of magnitude the number of candidate sequences. To conclude this study, we apply our algorithms on the Glm clan and the FinP-traJ clan from the Rfam database. CONCLUSIONS: Our results show that our methods reconstruct small sets of high-quality candidate ancestors with better agreement to the two target structures than with classical approaches. Our program is freely available at: http://csb.cs.mcgill.ca/acharnement .