RESUMEN
The Long-read RNA-Seq Genome Annotation Assessment Project Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. Using different protocols and sequencing platforms, the consortium generated over 427 million long-read sequences from complementary DNA and direct RNA datasets, encompassing human, mouse and manatee species. Developers utilized these data to address challenges in transcript isoform detection, quantification and de novo transcript detection. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. Incorporating additional orthogonal data and replicate samples is advised when aiming to detect rare and novel transcripts or using reference-free approaches. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.
Asunto(s)
Perfilación de la Expresión Génica , RNA-Seq , Humanos , Animales , Ratones , RNA-Seq/métodos , Perfilación de la Expresión Génica/métodos , Transcriptoma , Análisis de Secuencia de ARN/métodos , Anotación de Secuencia Molecular/métodosRESUMEN
SUMMARY: Long non-coding RNAs (lncRNAs) can act as molecular sponge or decoys for an RNA-binding protein (RBP) through their RBP-binding sites, thereby modulating the expression of all target genes of the corresponding RBP of interest. Here, we present a web tool named RBPSponge to explore lncRNAs based on their potential to act as a sponge for an RBP of interest. RBPSponge identifies the occurrences of RBP-binding sites and CLIP peaks on lncRNAs, and enables users to run statistical analyses to investigate the regulatory network between lncRNAs, RBPs and targets of RBPs. AVAILABILITY AND IMPLEMENTATION: The web server is available at https://www.RBPSponge.com. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
ARN Largo no Codificante/genética , Sitios de Unión , Genoma , Unión Proteica , Proteínas de Unión al ARN , Programas InformáticosRESUMEN
Recent studies show that RNA-binding proteins (RBPs) and microRNAs (miRNAs) function in coordination with each other to control post-transcriptional regulation (PTR). Despite this, the majority of research to date has focused on the regulatory effect of individual RBPs or miRNAs. Here, we mapped both RBP and miRNA binding sites on human 3'UTRs and utilized this collection to better understand PTR. We show that the transcripts that lack competition for HuR binding are destabilized more after HuR depletion. We also confirm this finding for PUM1(2) by measuring genome-wide expression changes following the knockdown of PUM1(2) in HEK293 cells. Next, to find potential cooperative interactions, we identified the pairs of factors whose sites co-localize more often than expected by random chance. Upon examining these results for PUM1(2), we found that transcripts where the sites of PUM1(2) and its interacting miRNA form a stem-loop are more stabilized upon PUM1(2) depletion. Finally, using dinucleotide frequency and counts of regulatory sites as features in a regression model, we achieved an AU-ROC of 0.86 in predicting mRNA half-life in BEAS-2B cells. Altogether, our results suggest that future studies of PTR must consider the combined effects of RBPs and miRNAs, as well as their interactions.
Asunto(s)
MicroARNs/genética , Procesamiento Postranscripcional del ARN/genética , ARN Mensajero/genética , Proteínas de Unión al ARN/metabolismo , Regiones no Traducidas 3'/genética , Sitios de Unión/genética , Línea Celular Tumoral , Mapeo Cromosómico , Biología Computacional/métodos , Células HEK293 , Semivida , Células HeLa , Humanos , Células MCF-7 , Conformación de Ácido Nucleico , Interferencia de ARN , ARN Interferente Pequeño/genética , Proteínas de Unión al ARN/genética , Transcripción Genética/genéticaRESUMEN
Enabled by the explosion of data and substantial increase in computational power, deep learning has transformed fields such as computer vision and natural language processing (NLP) and it has become a successful method to be applied to many transcriptomic analysis tasks. A core advantage of deep learning is its inherent capability to incorporate feature computation within the machine learning models. This results in a comprehensive and machine-readable representation of sequences, facilitating the downstream classification and clustering tasks. Compared to machine translation problems in NLP, feature embedding is particularly challenging for transcriptomic studies as the sequences are string of thousands of nucleotides in length, which make the long-term dependencies between features from different parts of the sequence even more difficult to capture. This highlights the need for nucleotide sequence embedding methods that are capable of learning input sequence features implicitly. Here we introduce ntEmbd, a deep learning embedding tool that captures dependencies between different features of the sequences and learns a latent representation for given nucleotide sequences. We further provide two sample use cases, describing how learned RNA features can be used in downstream analysis. The first use case demonstrates ntEmbd's utility in classifying coding and noncoding RNA benchmarked against existing tools, and the second one explores the utility of learned representations in identifying adapter sequences in nanopore RNA-seq reads. The tool as well as the trained models are freely available on GitHub at https://github.com/bcgsc/ntEmbd.
RESUMEN
BACKGROUND: Nanopore sequencing is crucial to metagenomic studies as its kilobase-long reads can contribute to resolving genomic structural differences among microbes. However, sequencing platform-specific challenges, including high base-call error rate, nonuniform read lengths, and the presence of chimeric artifacts, necessitate specifically designed analytical algorithms. The use of simulated datasets with characteristics that are true to the sequencing platform under evaluation is a cost-effective way to assess the performance of bioinformatics tools with the ground truth in a controlled environment. RESULTS: Here, we present Meta-NanoSim, a fast and versatile utility that characterizes and simulates the unique properties of nanopore metagenomic reads. It improves upon state-of-the-art methods on microbial abundance estimation through a base-level quantification algorithm. Meta-NanoSim can simulate complex microbial communities composed of both linear and circular genomes and can stream reference genomes from online servers directly. Simulated datasets showed high congruence with experimental data in terms of read length, error profiles, and abundance levels. We demonstrate that Meta-NanoSim simulated data can facilitate the development of metagenomic algorithms and guide experimental design through a metagenome assembly benchmarking task. CONCLUSIONS: The Meta-NanoSim characterization module investigates read features, including chimeric information and abundance levels, while the simulation module simulates large and complex multisample microbial communities with different abundance profiles. All trained models and the software are freely accessible at GitHub: https://github.com/bcgsc/NanoSim.
Asunto(s)
Secuenciación de Nanoporos , Nanoporos , Metagenoma , Secuenciación de Nanoporos/métodos , Análisis de Secuencia de ADN/métodos , Simulación por Computador , Metagenómica/métodos , Programas Informáticos , Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento/métodosRESUMEN
Long-read sequencing technologies have improved significantly since their emergence. Their read lengths, potentially spanning entire transcripts, is advantageous for reconstructing transcriptomes. Existing long-read transcriptome assembly methods are primarily reference-based and to date, there is little focus on reference-free transcriptome assembly. We introduce "RNA-Bloom2 [ https://github.com/bcgsc/RNA-Bloom ]", a reference-free assembly method for long-read transcriptome sequencing data. Using simulated datasets and spike-in control data, we show that the transcriptome assembly quality of RNA-Bloom2 is competitive to those of reference-based methods. Furthermore, we find that RNA-Bloom2 requires 27.0 to 80.6% of the peak memory and 3.6 to 10.8% of the total wall-clock runtime of a competing reference-free method. Finally, we showcase RNA-Bloom2 in assembling a transcriptome sample of Picea sitchensis (Sitka spruce). Since our method does not rely on a reference, it further sets the groundwork for large-scale comparative transcriptomics where high-quality draft genome assemblies are not readily available.
Asunto(s)
ARN , Transcriptoma , Transcriptoma/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Perfilación de la Expresión Génica/métodos , Análisis de Secuencia de ARN/métodosRESUMEN
The Long-read RNA-Seq Genome Annotation Assessment Project (LRGASP) Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. The consortium generated over 427 million long-read sequences from cDNA and direct RNA datasets, encompassing human, mouse, and manatee species, using different protocols and sequencing platforms. These data were utilized by developers to address challenges in transcript isoform detection and quantification, as well as de novo transcript isoform identification. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. When aiming to detect rare and novel transcripts or when using reference-free approaches, incorporating additional orthogonal data and replicate samples are advised. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.
RESUMEN
Recent advances in single-cell RNA sequencing technologies have made detection of transcripts in single cells possible. The level of resolution provided by these technologies can be used to study changes in transcript usage across cell populations and help investigate new biology. Here, we introduce RNA-Scoop, an interactive cell cluster and transcriptome visualization tool to analyze transcript usage across cell categories and clusters. The tool allows users to examine differential transcript expression across clusters and investigate how usage of specific transcript expression mechanisms varies across cell groups.
RESUMEN
BACKGROUND: Compared with second-generation sequencing technologies, third-generation single-molecule RNA sequencing has unprecedented advantages; the long reads it generates facilitate isoform-level transcript characterization. In particular, the Oxford Nanopore Technology sequencing platforms have become more popular in recent years owing to their relatively high affordability and portability compared with other third-generation sequencing technologies. To aid the development of analytical tools that leverage the power of this technology, simulated data provide a cost-effective solution with ground truth. However, a nanopore sequence simulator targeting transcriptomic data is not available yet. FINDINGS: We introduce Trans-NanoSim, a tool that simulates reads with technical and transcriptome-specific features learnt from nanopore RNA-sequncing data. We comprehensively benchmarked Trans-NanoSim on direct RNA and complementary DNA datasets describing human and mouse transcriptomes. Through comparison against other nanopore read simulators, we show the unique advantage and robustness of Trans-NanoSim in capturing the characteristics of nanopore complementary DNA and direct RNA reads. CONCLUSIONS: As a cost-effective alternative to sequencing real transcriptomes, Trans-NanoSim will facilitate the rapid development of analytical tools for nanopore RNA-sequencing data. Trans-NanoSim and its pre-trained models are freely accessible at https://github.com/bcgsc/NanoSim.
Asunto(s)
Biología Computacional/métodos , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , Perfilación de la Expresión Génica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Reproducibilidad de los Resultados , Transcriptoma , Flujo de TrabajoRESUMEN
BACKGROUND: Neuroblastoma is a heterogeneous disease with diverse clinical outcomes. Current risk group models require improvement as patients within the same risk group can still show variable prognosis. Recently collected genome-wide datasets provide opportunities to infer neuroblastoma subtypes in a more unified way. Within this context, data integration is critical as different molecular characteristics can contain complementary signals. To this end, we utilized the genomic datasets available for the SEQC cohort patients to develop supervised and unsupervised models that can predict disease prognosis. RESULTS: Our supervised model trained on the SEQC cohort can accurately predict overall survival and event-free survival profiles of patients in two independent cohorts. We also performed extensive experiments to assess the prediction accuracy of high risk patients and patients without MYCN amplification. Our results from this part suggest that clinical endpoints can be predicted accurately across multiple cohorts. To explore the data in an unsupervised manner, we used an integrative clustering strategy named multi-view kernel k-means (MVKKM) that can effectively integrate multiple high-dimensional datasets with varying weights. We observed that integrating different gene expression datasets results in a better patient stratification compared to using these datasets individually. Also, our identified subgroups provide a better Cox regression model fit compared to the existing risk group definitions. CONCLUSION: Altogether, our results indicate that integration of multiple genomic characterizations enables the discovery of subtypes that improve over existing definitions of risk groups. Effective prediction of survival times will have a direct impact on choosing the right therapies for patients. REVIEWERS: This article was reviewed by Susmita Datta, Wenzhong Xiao and Ziv Shkedy.