Your browser doesn't support javascript.
loading
RNA3DB: A structurally-dissimilar dataset split for training and benchmarking deep learning models for RNA structure prediction.
Szikszai, Marcell; Magnus, Marcin; Sanghi, Siddhant; Kadyan, Sachin; Bouatta, Nazim; Rivas, Elena.
Afiliação
  • Szikszai M; Department of Molecular and Cellular Biology, Harvard University, Cambridge, 02138, MA, USA.
  • Magnus M; Department of Molecular and Cellular Biology, Harvard University, Cambridge, 02138, MA, USA.
  • Sanghi S; Department of Systems Biology, Columbia University, New York 10027, NY, USA; College of Biological Sciences, UC Davis, Davis 95616, CA, USA.
  • Kadyan S; Department of Systems Biology, Columbia University, New York 10027, NY, USA.
  • Bouatta N; Laboratory of Systems Pharmacology, Harvard Medical School, Boston 02115, MA, USA.
  • Rivas E; Department of Molecular and Cellular Biology, Harvard University, Cambridge, 02138, MA, USA.
J Mol Biol ; 436(17): 168552, 2024 Sep 01.
Article em En | MEDLINE | ID: mdl-38552946
ABSTRACT
With advances in protein structure prediction thanks to deep learning models like AlphaFold, RNA structure prediction has recently received increased attention from deep learning researchers. RNAs introduce substantial challenges due to the sparser availability and lower structural diversity of the experimentally resolved RNA structures in comparison to protein structures. These challenges are often poorly addressed by the existing literature, many of which report inflated performance due to using training and testing sets with significant structural overlap. Further, the most recent Critical Assessment of Structure Prediction (CASP15) has shown that deep learning models for RNA structure are currently outperformed by traditional methods. In this paper we present RNA3DB, a dataset of structured RNAs, derived from the Protein Data Bank (PDB), that is designed for training and benchmarking deep learning models. The RNA3DB method arranges the RNA 3D chains into distinct groups (Components) that are non-redundant both with regard to sequence as well as structure, providing a robust way of dividing training, validation, and testing sets. Any split of these structurally-dissimilar Components are guaranteed to produce test and validations sets that are distinct by sequence and structure from those in the training set. We provide the RNA3DB dataset, a particular train/test split of the RNA3DB Components (in an approximate 70/30 ratio) that will be updated periodically. We also provide the RNA3DB methodology along with the source-code, with the goal of creating a reproducible and customizable tool for producing structurally-dissimilar dataset splits for structural RNAs.
Assuntos
Palavras-chave

Texto completo: 1 Base de dados: MEDLINE Assunto principal: RNA / Benchmarking / Aprendizado Profundo / Conformação de Ácido Nucleico Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo: 1 Base de dados: MEDLINE Assunto principal: RNA / Benchmarking / Aprendizado Profundo / Conformação de Ácido Nucleico Idioma: En Ano de publicação: 2024 Tipo de documento: Article