RESUMEN
RNA molecules are known to fold into specific structures which often play a central role in their functions and regulation. In silico folding of RNA transcripts, especially when assisted with structure profiling (SP) data, is capable of accurately elucidating relevant structural conformations. However, such methods scale poorly to the swaths of SP data generated by transcriptome-wide experiments, which are becoming more commonplace and advancing our understanding of RNA structure and its regulation at global and local levels. This has created a need for tools capable of rapidly deriving structural assessments from SP data in a scalable manner. One such tool we previously introduced that aims to process such data is patteRNA, a statistical learning algorithm capable of rapidly mining big SP datasets for structural elements. Here, we present a reformulation of patteRNA's pattern recognition scheme that sees significantly improved precision without major compromises to computational overhead. Specifically, we developed a data-driven logistic classifier which interprets patteRNA's statistical characterizations of SP data in addition to local sequence properties as measured with a nearest neighbour thermodynamic model. Application of the classifier to human structurome data reveals a marked association between detected stem-loops and RNA binding protein (RBP) footprints. The results of our application demonstrate that upwards of 30% of RBP footprints occur within loops of stable stem-loop elements. Overall, our work arrives at a rapid and accurate method for automatically detecting families of RNA structure motifs and demonstrates the functional relevance of identifying them transcriptome-wide.
Asunto(s)
Algoritmos , Biología Computacional/métodos , Conformación de Ácido Nucleico , Motivos de Nucleótidos , Proteínas de Unión al ARN/metabolismo , ARN/química , ARN/metabolismo , Sitios de Unión , Células Hep G2 , Humanos , Células K562 , Unión Proteica , ARN/genética , Análisis de Secuencia de ARN , TranscriptomaRESUMEN
In single stranded (+)-sense RNA viruses, RNA structural elements (SEs) play essential roles in the infection process from replication to encapsidation. Using selective 2'-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq) and covariation analysis, we explore the structural features of the third genome segment of cucumber mosaic virus (CMV), RNA3 (2216 nt), both in vitro and in plant cell lysates. Comparing SHAPE-Seq and covariation analysis results revealed multiple SEs in the coat protein open reading frame and 3' untranslated region. Four of these SEs were mutated and serially passaged in Nicotiana tabacum plants to identify biologically selected changes to the original mutated sequences. After passaging, loop mutants showed partial reversion to their wild-type sequence and SEs that were structurally disrupted by mutations were restored to wild-type-like structures via synonymous mutations in planta. These results support the existence and selection of virus open reading frame SEs in the host organism and provide a framework for further studies on the role of RNA structure in viral infection. Additionally, this work demonstrates the applicability of high-throughput chemical probing in plant cell lysates and presents a new method for calculating SHAPE reactivities from overlapping reverse transcriptase priming sites.
Asunto(s)
Cucumovirus/genética , ARN Viral/química , Mutación , Conformación de Ácido NucleicoRESUMEN
Structure dictates the function of many RNAs, but secondary RNA structure analysis is either labor intensive and costly or relies on computational predictions that are often inaccurate. These limitations are alleviated by integration of structure probing data into prediction algorithms. However, existing algorithms are optimized for a specific type of probing data. Recently, new chemistries combined with advances in sequencing have facilitated structure probing at unprecedented scale and sensitivity. These novel technologies and anticipated wealth of data highlight a need for algorithms that readily accommodate more complex and diverse input sources. We implemented and investigated a recently outlined probabilistic framework for RNA secondary structure prediction and extended it to accommodate further refinement of structural information. This framework utilizes direct likelihood-based calculations of pseudo-energy terms per considered structural context and can readily accommodate diverse data types and complex data dependencies. We use real data in conjunction with simulations to evaluate performances of several implementations and to show that proper integration of structural contexts can lead to improvements. Our tests also reveal discrepancies between real data and simulations, which we show can be alleviated by refined modeling. We then propose statistical preprocessing approaches to standardize data interpretation and integration into such a generic framework. We further systematically quantify the information content of data subsets, demonstrating that high reactivities are major drivers of SHAPE-directed predictions and that better understanding of less informative reactivities is key to further improvements. Finally, we provide evidence for the adaptive capability of our framework using mock probe simulations.
Asunto(s)
Modelos Químicos , Conformación de Ácido Nucleico , Probabilidad , ARN/química , Funciones de VerosimilitudRESUMEN
Summary: To serve numerous functional roles, RNA must fold into specific structures. Determining these structures is thus of paramount importance. The recent advent of high-throughput sequencing-based structure profiling experiments has provided important insights into RNA structure and widened the scope of RNA studies. However, as a broad range of approaches continues to emerge, a universal framework is needed to quantitatively ensure consistent and high-quality data. We present SEQualyzer, a visual and interactive application that makes it easy and efficient to gauge data quality, screen for transcripts with high-quality information and identify discordant replicates in structure profiling experiments. Our methods rely on features common to a wide range of protocols and can serve as standards for quality control and analyses. Availability and Implementation: SEQualyzer is written in R, is platform-independent, and is freely available at http://bme.ucdavis.edu/aviranlab/SEQualyzer. Contact: saviran@ucdavis.edu Supplementary Informantion: Supplementary data are available at Bioinformatics online.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Control de Calidad , Análisis de Secuencia de ARN/métodos , Programas Informáticos , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Análisis de Secuencia de ARN/normasRESUMEN
Most RNA molecules form internal base pairs, leading to a folded secondary structure. Some of these structures have been demonstrated to be functionally significant. High-throughput RNA structure chemical probing methods generate millions of sequencing reads to provide structural constraints for RNA secondary structure prediction. At present, processed data from these experiments are difficult to access without computational expertise. Here we present FoldAtlas, a web interface for accessing raw and processed structural data across thousands of transcripts. FoldAtlas allows a researcher to easily locate, view, and retrieve probing data for a given RNA molecule. We also provide in silico and in vivo secondary structure predictions for comparison, visualized in the browser as circle plots and topology diagrams. Data currently integrated into FoldAtlas are from a new high-depth Structure-seq data analysis in Arabidopsis thaliana, released with this work. AVAILABILITY AND IMPLEMENTATION: The FoldAtlas website can be accessed at www.foldatlas.com Source code is freely available at github.com/mnori/foldatlas under the MIT license. Raw reads data are available under the NCBI SRA accession SRP066985. CONTACT: yiliang.ding@jic.ac.uk or matthew.norris@jic.ac.ukSupplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Bases de Datos de Ácidos Nucleicos , ARN/metabolismo , Arabidopsis/metabolismo , Simulación por Computador , Conformación de Ácido Nucleico , ARN/química , ARN de Planta/química , ARN de Planta/metabolismo , Análisis de Secuencia de ARNRESUMEN
MOTIVATION: The diverse functionalities of RNA can be attributed to its capacity to form complex and varied structures. The recent proliferation of new structure probing techniques coupled with high-throughput sequencing has helped RNA studies expand in both scope and depth. Despite differences in techniques, most experiments face similar challenges in reproducibility due to the stochastic nature of chemical probing and sequencing. As these protocols expand to transcriptome-wide studies, quality control becomes a more daunting task. General and efficient methodologies are needed to quantify variability and quality in the wide range of current and emerging structure probing experiments. RESULTS: We develop metrics to rapidly and quantitatively evaluate data quality from structure probing experiments, demonstrating their efficacy on both small synthetic libraries and transcriptome-wide datasets. We use a signal-to-noise ratio concept to evaluate replicate agreement, which has the capacity to identify high-quality data. We also consider and compare two methods to assess variability inherent in probing experiments, which we then utilize to evaluate the coverage adjustments needed to meet desired quality. The developed metrics and tools will be useful in summarizing large-scale datasets and will help standardize quality control in the field. AVAILABILITY AND IMPLEMENTATION: The data and methods used in this article are freely available at: http://bme.ucdavis.edu/aviranlab/SPEQC_software CONTACT: saviran@ucdavis.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Biología Computacional/métodos , ARN/química , Análisis de Secuencia de ARN/métodos , Modelos Estadísticos , Control de Calidad , Reproducibilidad de los Resultados , Análisis de Secuencia de ARN/normas , Relación Señal-RuidoRESUMEN
Structure mapping is a classic experimental approach for determining nucleic acid structure that has gained renewed interest in recent years following advances in chemistry, genomics, and informatics. The approach encompasses numerous techniques that use different means to introduce nucleotide-level modifications in a structure-dependent manner. Modifications are assayed via cDNA fragment analysis, using electrophoresis or next-generation sequencing (NGS). The recent advent of NGS has dramatically increased the throughput, multiplexing capacity, and scope of RNA structure mapping assays, thereby opening new possibilities for genome-scale, de novo, and in vivo studies. From an informatics standpoint, NGS is more informative than prior technologies by virtue of delivering direct molecular measurements in the form of digital sequence counts. Motivated by these new capabilities, we introduce a novel model-based in silico approach for quantitative design of large-scale multiplexed NGS structure mapping assays, which takes advantage of the direct and digital nature of NGS readouts. We use it to characterize the relationship between controllable experimental parameters and the precision of mapping measurements. Our results highlight the complexity of these dependencies and shed light on relevant tradeoffs and pitfalls, which can be difficult to discern by intuition alone. We demonstrate our approach by quantitatively assessing the robustness of SHAPE-Seq measurements, obtained by multiplexing SHAPE (selective 2'-hydroxyl acylation analyzed by primer extension) chemistry in conjunction with NGS. We then utilize it to elucidate design considerations in advanced genome-wide approaches for probing the transcriptome, which recently obtained in vivo information using dimethyl sulfate (DMS) chemistry.
Asunto(s)
Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Conformación de Ácido Nucleico , Transcriptoma/genética , Biología Computacional/métodos , ADN Complementario/genética , Análisis de Secuencia de ARNRESUMEN
New regulatory roles continue to emerge for both natural and engineered noncoding RNAs, many of which have specific secondary and tertiary structures essential to their function. Thus there is a growing need to develop technologies that enable rapid characterization of structural features within complex RNA populations. We have developed a high-throughput technique, SHAPE-Seq, that can simultaneously measure quantitative, single nucleotide-resolution secondary and tertiary structural information for hundreds of RNA molecules of arbitrary sequence. SHAPE-Seq combines selective 2'-hydroxyl acylation analyzed by primer extension (SHAPE) chemistry with multiplexed paired-end deep sequencing of primer extension products. This generates millions of sequencing reads, which are then analyzed using a fully automated data analysis pipeline, based on a rigorous maximum likelihood model of the SHAPE-Seq experiment. We demonstrate the ability of SHAPE-Seq to accurately infer secondary and tertiary structural information, detect subtle conformational changes due to single nucleotide point mutations, and simultaneously measure the structures of a complex pool of different RNA molecules. SHAPE-Seq thus represents a powerful step toward making the study of RNA secondary and tertiary structures high throughput and accessible to a wide array of scientific pursuits, from fundamental biological investigations to engineering RNA for synthetic biological systems.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Conformación de Ácido Nucleico , ARN/química , ARN/genética , Análisis de Secuencia de ARN/métodos , Bacillus subtilis/enzimología , Bacillus subtilis/genética , Secuencia de Bases , Biología Computacional , Código de Barras del ADN Taxonómico , Cartilla de ADN/genética , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Modelos Moleculares , Sondas Moleculares , Datos de Secuencia Molecular , Estructura Molecular , Mutación Puntual , ARN Catalítico/química , ARN Catalítico/genética , Ribonucleasa P/química , Ribonucleasa P/genética , Análisis de Secuencia de ARN/estadística & datos numéricosRESUMEN
Sequence census methods reduce molecular measurements such as transcript abundance and protein-nucleic acid interactions to counting problems via DNA sequencing. We focus on a novel assay utilizing this approach, called selective 2'-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq), that can be used to characterize RNA secondary and tertiary structure. We describe a fully automated data analysis pipeline for SHAPE-Seq analysis that includes read processing, mapping, and structural inference based on a model of the experiment. Our methods rely on the solution of a series of convex optimization problems for which we develop efficient and effective numerical algorithms. Our results can be easily extended to other chemical probes of RNA structure, and also generalized to modeling polymerase drop-off in other sequence census-based experiments.
Asunto(s)
Conformación de Ácido Nucleico , ARN/química , ARN/genética , Análisis de Secuencia de ARN/métodos , Algoritmos , Automatización , Biología Computacional , Funciones de Verosimilitud , Modelos Moleculares , Plásmidos/química , Plásmidos/genética , ARN Bacteriano/química , ARN Bacteriano/genética , Análisis de Secuencia de ARN/estadística & datos numéricos , Staphylococcus aureus/química , Staphylococcus aureus/genéticaRESUMEN
RNA structure probing experiments have emerged over the last decade as a straightforward way to determine the structure of RNA molecules in a number of different contexts. Although powerful, the ability of RNA to dynamically interconvert between, and to simultaneously populate, alternative structural configurations, poses a nontrivial challenge to the interpretation of data derived from these experiments. Recent efforts aimed at developing computational methods for the reconstruction of coexisting alternative RNA conformations from structure probing data are paving the way to the study of RNA structure ensembles, even in the context of living cells. In this review, we critically discuss these methods, their limitations and possible future improvements.
Asunto(s)
Biología Computacional , Conformación de Ácido Nucleico , ARN , Biología Computacional/métodos , ARN/química , ARN/genética , TermodinámicaRESUMEN
Background: Although combination immunotherapies incorporating local and systemic components have shown promising results in treating solid tumors, varied tumor microenvironments (TMEs) can impact immunotherapeutic efficacy. Method: We designed and evaluated treatment strategies for breast and pancreatic cancer combining magnetic resonance-guided focused ultrasound (MRgFUS) ablation and antibody therapies. With a combination of single-cell sequencing, spectral flow cytometry, and histological analyses, we profiled an immune-suppressed KPC (Kras+/LSL-G12D; Trp53+/LSL-R172H; Pdx1-Cre) pancreatic adenocarcinoma (MT4) model and a dense epithelial neu deletion (NDL) HER2+ mammary adenocarcinoma model with a greater fraction of lymphocytes, natural killer cells and activated dendritic cells. We then performed gene ontology analysis, spectral and digital cytometry to assess the immune response to combination immunotherapies and correlation with survival studies. Result: Based on gene ontology analysis, adding ablation to immunotherapy enriched immune cell migration pathways in the pancreatic cancer model and extensively enriched wound healing pathways in the breast cancer model. With CIBERSORTx digital cytometry, aCD40 + aPD-1 immunotherapy combinations enhanced dendritic cell activation in both models. In the MT4 TME, adding the combination of aCD40 antibody and checkpoint inhibitors (aPD-1 and aCTLA-4) with ablation was synergistic, increasing activated natural killer cells and T cells in distant tumors. Furthermore, ablation with immunotherapy upregulated critical Ly6c myeloid remodeling phenotypes that enhance T-cell effector function and increased granzyme and protease encoding genes by as much as 100-fold. Ablation combined with immunotherapy then extended survival in the MT4 model to a greater extent than immunotherapy alone. Conclusion: In summary, TME profiling informed a successful multicomponent treatment protocol incorporating ablation and facilitated differentiation of TMEs in which ablation is most effective.
Asunto(s)
Adenocarcinoma , Neoplasias Pancreáticas , Ratones , Animales , Neoplasias Pancreáticas/terapia , Inmunoterapia , Factores Inmunológicos , Microambiente Tumoral , Neoplasias PancreáticasRESUMEN
Gene therapy is an emerging alternative to conventional anti-HIV-1 drugs, and can potentially control the virus while alleviating major limitations of current approaches. Yet, HIV-1's ability to rapidly acquire mutations and escape therapy presents a critical challenge to any novel treatment paradigm. Viral escape is thus a key consideration in the design of any gene-based technique. We develop a computational model of HIV's evolutionary dynamics in vivo in the presence of a genetic therapy to explore the impact of therapy parameters and strategies on the development of resistance. Our model is generic and captures the properties of a broad class of gene-based agents that inhibit early stages of the viral life cycle. We highlight the differences in viral resistance dynamics between gene and standard antiretroviral therapies, and identify key factors that impact long-term viral suppression. In particular, we underscore the importance of mutationally-induced viral fitness losses in cells that are not genetically modified, as these can severely constrain the replication of resistant virus. We also propose and investigate a novel treatment strategy that leverages upon gene therapy's unique capacity to deliver different genes to distinct cell populations, and we find that such a strategy can dramatically improve efficacy when used judiciously within a certain parametric regime. Finally, we revisit a previously-suggested idea of improving clinical outcomes by boosting the proliferation of the genetically-modified cells, but we find that such an approach has mixed effects on resistance dynamics. Our results provide insights into the short- and long-term effects of gene therapy and the role of its key properties in the evolution of resistance, which can serve as guidelines for the choice and optimization of effective therapeutic agents.
Asunto(s)
Terapia Genética/métodos , Infecciones por VIH/terapia , VIH-1/genética , Modelos Genéticos , Fármacos Anti-VIH/uso terapéutico , Terapia Antirretroviral Altamente Activa , Proliferación Celular , Farmacorresistencia Viral/efectos de los fármacos , Farmacorresistencia Viral/genética , Aptitud Genética , Infecciones por VIH/tratamiento farmacológico , Infecciones por VIH/virología , VIH-1/efectos de los fármacos , Humanos , Mutación , Tiempo , Replicación Viral/genéticaRESUMEN
The functions of RNA are often tied to its structure, hence analyzing structure is of significant interest when studying cellular processes. Recently, large-scale structure probing (SP) studies have enabled assessment of global structure-function relationships via standard data summarizations or local folding. Here, we approach structure quantification from a hairpin-centric perspective where putative hairpins are identified in SP datasets and used as a means to capture local structural effects. This has the advantage of rapid processing of big (e.g. transcriptome-wide) data as RNA folding is circumvented, yet it captures more information than simple data summarizations. We reformulate a statistical learning algorithm we previously developed to significantly improve precision of hairpin detection, then introduce a novel nucleotide-wise measure, termed the hairpin-derived structure level (HDSL), which captures local structuredness by accounting for the presence of likely hairpin elements. Applying HDSL to data from recent studies recapitulates, strengthens and expands on their findings which were obtained by more comprehensive folding algorithms, yet our analyses are orders of magnitude faster. These results demonstrate that hairpin detection is a promising avenue for global and rapid structure-function analysis, furthering our understanding of RNA biology and the principal features which drive biological insights from SP data.
RESUMEN
RNase P and MRP are highly conserved, multi-protein/RNA complexes with essential roles in processing ribosomal and tRNAs. Three proteins found in both complexes, Pop1, Pop6, and Pop7 are also telomerase-associated. Here, we determine how temperature sensitive POP1 and POP6 alleles affect yeast telomerase. At permissive temperatures, mutant Pop1/6 have little or no effect on cell growth, global protein levels, the abundance of Est1 and Est2 (telomerase proteins), and the processing of TLC1 (telomerase RNA). However, in pop mutants, TLC1 is more abundant, telomeres are short, and TLC1 accumulates in the cytoplasm. Although Est1/2 binding to TLC1 occurs at normal levels, Est1 (and hence Est3) binding is highly unstable. We propose that Pop-mediated stabilization of Est1 binding to TLC1 is a pre-requisite for formation and nuclear localization of the telomerase holoenzyme. Furthermore, Pop proteins affect TLC1 and the RNA subunits of RNase P/MRP in very different ways.
Asunto(s)
Ribonucleasa P/metabolismo , Ribonucleoproteínas/metabolismo , Proteínas de Saccharomyces cerevisiae/metabolismo , Saccharomyces cerevisiae/metabolismo , Telomerasa/metabolismo , Telómero/metabolismo , Núcleo Celular/genética , Núcleo Celular/metabolismo , Proteínas de Unión al ADN/genética , Proteínas de Unión al ADN/metabolismo , Metilación , Unión Proteica , ARN/metabolismo , Procesamiento de Término de ARN 3'/genética , Ribonucleasa P/genética , Ribonucleoproteínas/genética , Proteínas de Saccharomyces cerevisiae/genética , Telomerasa/genética , Telómero/químicaRESUMEN
RNA biology is revolutionized by recent developments of diverse high-throughput technologies for transcriptome-wide profiling of molecular RNA structures. RNA structurome profiling data can be used to identify differentially structured regions between groups of samples. Existing methods are limited in scope to specific technologies and/or do not account for biological variation. Here, we present dStruct which is the first broadly applicable method for differential analysis accounting for biological variation in structurome profiling data. dStruct is compatible with diverse profiling technologies, is validated with experimental data and simulations, and outperforms existing methods.
Asunto(s)
Genómica/métodos , ARN/química , ARN/metabolismo , Programas Informáticos , Estructura Molecular , Polimorfismo de Nucleótido Simple , ARN/genética , TranscriptomaRESUMEN
RNA helicases are a class of enzymes that unwind RNA duplexes in vitro but whose cellular functions are largely enigmatic. Here, we provide evidence that the DEAD-box protein Dbp2 remodels RNA-protein complex (RNP) structure to facilitate efficient termination of transcription in Saccharomyces cerevisiae via the Nrd1-Nab3-Sen1 (NNS) complex. First, we find that loss of DBP2 results in RNA polymerase II accumulation at the 3' ends of small nucleolar RNAs and a subset of mRNAs. In addition, Dbp2 associates with RNA sequence motifs and regions bound by Nrd1 and can promote its recruitment to NNS-targeted regions. Using Structure-seq, we find altered RNA/RNP structures in dbp2∆ cells that correlate with inefficient termination. We also show a positive correlation between the stability of structures in the 3' ends and a requirement for Dbp2 in termination. Taken together, these studies provide a role for RNA remodeling by Dbp2 and further suggests a mechanism whereby RNA structure is exploited for gene regulation.
Asunto(s)
ARN Helicasas DEAD-box/metabolismo , ARN Mensajero/metabolismo , ARN Nucleolar Pequeño/metabolismo , Proteínas de Unión al ARN/metabolismo , Proteínas de Saccharomyces cerevisiae/metabolismo , Saccharomyces cerevisiae/metabolismo , Terminación de la Transcripción Genética , ADN Helicasas/metabolismo , Regulación Fúngica de la Expresión Génica , Proteínas Nucleares/metabolismo , ARN Helicasas/metabolismo , ARN Polimerasa II/metabolismo , Saccharomyces cerevisiae/enzimología , Saccharomyces cerevisiae/genética , Análisis de Secuencia de ARNRESUMEN
The originally published version of this Article contained an error in Figure 2, due to a typesetting error. Panels d and e were positioned such that the locations of the mutations in panel d did not align correctly with the corresponding nucleotides in the reactivity profile in panel e. This has now been corrected in both the PDF and HTML versions of the Article.
RESUMEN
Establishing a link between RNA structure and function remains a great challenge in RNA biology. The emergence of high-throughput structure profiling experiments is revolutionizing our ability to decipher structure, yet principled approaches for extracting information on structural elements directly from these data sets are lacking. We present PATTERNA, an unsupervised pattern recognition algorithm that rapidly mines RNA structure motifs from profiling data. We demonstrate that PATTERNA detects motifs with an accuracy comparable to commonly used thermodynamic models and highlight its utility in automating data-directed structure modeling from large data sets. PATTERNA is versatile and compatible with diverse profiling techniques and experimental conditions.
Asunto(s)
Algoritmos , ARN/química , Transcriptoma , Modelos Estadísticos , Motivos de NucleótidosRESUMEN
RNA plays key regulatory roles in diverse cellular processes, where its functionality often derives from folding into and converting between structures. Many RNAs further rely on co-existence of alternative structures, which govern their response to cellular signals. However, characterizing heterogeneous landscapes is difficult, both experimentally and computationally. Recently, structure profiling experiments have emerged as powerful and affordable structure characterization methods, which improve computational structure prediction. To date, efforts have centered on predicting one optimal structure, with much less progress made on multiple-structure prediction. Here, we report a probabilistic modeling approach that predicts a parsimonious set of co-existing structures and estimates their abundances from structure profiling data. We demonstrate robust landscape reconstruction and quantitative insights into structural dynamics by analyzing numerous data sets. This work establishes a framework for data-directed characterization of structure landscapes to aid experimentalists in performing structure-function studies.
Asunto(s)
Modelos Químicos , Modelos Estadísticos , Estructura Molecular , ARN/química , RiboswitchRESUMEN
RNA SHAPE experiments have become important and successful sources of information for RNA structure prediction. In such experiments, chemical reagents are used to probe RNA backbone flexibility at the nucleotide level, which in turn provides information on base pairing and therefore secondary structure. Little is known, however, about the statistics of such SHAPE data. In this work, we explore different representations of noise in SHAPE data and propose a statistically sound framework for extracting reliable reactivity information from multiple SHAPE replicates. Our analyses of RNA SHAPE experiments underscore that a normal noise model is not adequate to represent their data. We propose instead a log-normal representation of noise and discuss its relevance. Under this assumption, we observe that processing simulated SHAPE data by directly averaging different replicates leads to bias. Such bias can be reduced by analyzing the data following a log transformation, either by log-averaging or Kalman filtering. Application of Kalman filtering has the additional advantage that a prior on the nucleotide reactivities can be introduced. We show that the performance of Kalman filtering is then directly dependent on the quality of that prior. We conclude the paper with guidelines on signal processing of RNA SHAPE data.