|

1.

RNA-clique: a method for computing genetic distances from RNA-seq data.

Tapia, Andrew C; Jaromczyk, Jerzy W; Moore, Neil; Schardl, Christopher L.

BMC Bioinformatics ; 25(1): 205, 2024 Jun 04.

Article En | MEDLINE | ID: mdl-38834962

BACKGROUND: Although RNA-seq data are traditionally used for quantifying gene expression levels, the same data could be useful in an integrated approach to compute genetic distances as well. Challenges to using mRNA sequences for computing genetic distances include the relatively high conservation of coding sequences and the presence of paralogous and, in some species, homeologous genes. RESULTS: We developed a new computational method, RNA-clique, for calculating genetic distances using assembled RNA-seq data and assessed the efficacy of the method using biological and simulated data. The method employs reciprocal BLASTn followed by graph-based filtering to ensure that only orthologous genes are compared. Each vertex in the graph constructed for filtering represents a gene in a specific sample under comparison, and an edge connects a pair of vertices if the genes they represent are best matches for each other in their respective samples. The distance computation is a function of the BLAST alignment statistics and the constructed graph and incorporates only those genes that are present in some complete connected component of this graph. As a biological testbed we used RNA-seq data of tall fescue (Lolium arundinaceum), an allohexaploid plant ( 2 n = 14 Gb ), and bluehead wrasse (Thalassoma bifasciatum), a teleost fish. RNA-clique reliably distinguished individual tall fescue plants by genotype and distinguished bluehead wrasse RNA-seq samples by individual. In tests with simulated RNA-seq data, the ground truth phylogeny was accurately recovered from the computed distances. Moreover, tests of the algorithm parameters indicated that, even with stringent filtering for orthologs, sufficient sequence data were retained for the distance computations. Although comparisons with an alternative method revealed that RNA-clique has relatively high time and memory requirements, the comparisons also showed that RNA-clique's results were at least as reliable as the alternative's for tall fescue data and were much more reliable for the bluehead wrasse data. CONCLUSION: Results of this work indicate that RNA-clique works well as a way of deriving genetic distances from RNA-seq data, thus providing a methodological integration of functional and genetic diversity studies.

RNA-Seq , RNA-Seq/methods , Sequence Analysis, RNA/methods , Computational Biology/methods , Algorithms

2.

Differential effects of follicle-stimulating hormone glycoforms on the transcriptome profile of cultured rat granulosa cells as disclosed by RNA-seq.

Zariñán, Teresa; Espinal-Enriquez, Jesús; De Anda-Jáuregui, Guillermo; Lira-Albarrán, Saúl; Hernández-Montes, Georgina; Gutiérrez-Sagal, Rubén; Rebollar-Vega, Rosa G; Bousfield, George R; Butnev, Viktor Y; Hernández-Lemus, Enrique; Ulloa-Aguirre, Alfredo.

PLoS One ; 19(6): e0293688, 2024.

Article En | MEDLINE | ID: mdl-38843139

It has been documented that variations in glycosylation on glycoprotein hormones, confer distinctly different biological features to the corresponding glycoforms when multiple in vitro biochemical readings are analyzed. We here applied next generation RNA sequencing to explore changes in the transcriptome of rat granulosa cells exposed for 0, 6, and 12 h to 100 ng/ml of four highly purified follicle-stimulating hormone (FSH) glycoforms, each exhibiting different glycosylation patterns: a. human pituitary FSH18/21 (hypo-glycosylated); b. human pituitary FSH24 (fully glycosylated); c. Equine FSH (eqFSH) (hypo-glycosylated); and d. Chinese-hamster ovary cell-derived human recombinant FSH (recFSH) (fully-glycosylated). Total RNA from triplicate incubations was prepared from FSH glycoform-exposed cultured granulosa cells obtained from DES-pretreated immature female rats, and RNA libraries were sequenced in a HighSeq 2500 sequencer (2 x 125 bp paired-end format, 10-15 x 106 reads/sample). The computational workflow focused on investigating differences among the four FSH glycoforms at three levels: gene expression, enriched biological processes, and perturbed pathways. Among the top 200 differentially expressed genes, only 4 (0.6%) were shared by all 4 glycoforms at 6 h, whereas 118 genes (40%) were shared at 12 h. Follicle-stimulating hormone glycocoforms stimulated different patterns of exclusive and associated up regulated biological processes in a glycoform and time-dependent fashion with more shared biological processes after 12 h of exposure and fewer treatment-specific ones, except for recFSH, which exhibited stronger responses with more specifically associated processes at this time. Similar results were found for down-regulated processes, with a greater number of processes at 6 h or 12 h, depending on the particular glycoform. In general, there were fewer downregulated than upregulated processes at both 6 h and 12 h, with FSH18/21 exhibiting the largest number of down-regulated associated processes at 6 h while eqFSH exhibited the greatest number at 12 h. Signaling cascades, largely linked to cAMP-PKA, MAPK, and PI3/AKT pathways were detected as differentially activated by the glycoforms, with each glycoform exhibiting its own molecular signature. These data extend previous observations demonstrating glycosylation-dependent distinctly different regulation of gene expression and intracellular signaling pathways triggered by FSH in granulosa cells. The results also suggest the importance of individual FSH glycoform glycosylation for the conformation of the ligand-receptor complex and induced signalling pathways.

Follicle Stimulating Hormone , Granulosa Cells , Transcriptome , Animals , Female , Granulosa Cells/metabolism , Granulosa Cells/drug effects , Follicle Stimulating Hormone/pharmacology , Follicle Stimulating Hormone/metabolism , Rats , Glycosylation , Transcriptome/drug effects , Humans , Cells, Cultured , RNA-Seq/methods , CHO Cells , Cricetulus

3.

A single-cell and spatial RNA-seq database for Alzheimer's disease (ssREAD).

Wang, Cankun; Acosta, Diana; McNutt, Megan; Bian, Jiang; Ma, Anjun; Fu, Hongjun; Ma, Qin.

Nat Commun ; 15(1): 4710, 2024 Jun 06.

Article En | MEDLINE | ID: mdl-38844475

Alzheimer's Disease (AD) pathology has been increasingly explored through single-cell and single-nucleus RNA-sequencing (scRNA-seq & snRNA-seq) and spatial transcriptomics (ST). However, the surge in data demands a comprehensive, user-friendly repository. Addressing this, we introduce a single-cell and spatial RNA-seq database for Alzheimer's disease (ssREAD). It offers a broader spectrum of AD-related datasets, an optimized analytical pipeline, and improved usability. The database encompasses 1,053 samples (277 integrated datasets) from 67 AD-related scRNA-seq & snRNA-seq studies, totaling 7,332,202 cells. Additionally, it archives 381 ST datasets from 18 human and mouse brain studies. Each dataset is annotated with details such as species, gender, brain region, disease/control status, age, and AD Braak stages. ssREAD also provides an analysis suite for cell clustering, identification of differentially expressed and spatially variable genes, cell-type-specific marker genes and regulons, and spot deconvolution for integrative analysis. ssREAD is freely available at https://bmblx.bmi.osumc.edu/ssread/ .

Alzheimer Disease , RNA-Seq , Single-Cell Analysis , Alzheimer Disease/genetics , Humans , Single-Cell Analysis/methods , Animals , Mice , RNA-Seq/methods , Brain/metabolism , Brain/pathology , Databases, Genetic , Transcriptome , Sequence Analysis, RNA/methods , Gene Expression Profiling/methods , Male

4.

Systematic evaluation with practical guidelines for single-cell and spatially resolved transcriptomics data simulation under multiple scenarios.

Duo, Hongrui; Li, Yinghong; Lan, Yang; Tao, Jingxin; Yang, Qingxia; Xiao, Yingxue; Sun, Jing; Li, Lei; Nie, Xiner; Zhang, Xiaoxi; Liang, Guizhao; Liu, Mingwei; Hao, Youjin; Li, Bo.

Genome Biol ; 25(1): 145, 2024 Jun 03.

Article En | MEDLINE | ID: mdl-38831386

BACKGROUND: Single-cell RNA sequencing (scRNA-seq) and spatially resolved transcriptomics (SRT) have led to groundbreaking advancements in life sciences. To develop bioinformatics tools for scRNA-seq and SRT data and perform unbiased benchmarks, data simulation has been widely adopted by providing explicit ground truth and generating customized datasets. However, the performance of simulation methods under multiple scenarios has not been comprehensively assessed, making it challenging to choose suitable methods without practical guidelines. RESULTS: We systematically evaluated 49 simulation methods developed for scRNA-seq and/or SRT data in terms of accuracy, functionality, scalability, and usability using 152 reference datasets derived from 24 platforms. SRTsim, scDesign3, ZINB-WaVE, and scDesign2 have the best accuracy performance across various platforms. Unexpectedly, some methods tailored to scRNA-seq data have potential compatibility for simulating SRT data. Lun, SPARSim, and scDesign3-tree outperform other methods under corresponding simulation scenarios. Phenopath, Lun, Simple, and MFA yield high scalability scores but they cannot generate realistic simulated data. Users should consider the trade-offs between method accuracy and scalability (or functionality) when making decisions. Additionally, execution errors are mainly caused by failed parameter estimations and appearance of missing or infinite values in calculations. We provide practical guidelines for method selection, a standard pipeline Simpipe ( https://github.com/duohongrui/simpipe ; https://doi.org/10.5281/zenodo.11178409 ), and an online tool Simsite ( https://www.ciblab.net/software/simshiny/ ) for data simulation. CONCLUSIONS: No method performs best on all criteria, thus a good-yet-not-the-best method is recommended if it solves problems effectively and reasonably. Our comprehensive work provides crucial insights for developers on modeling gene expression data and fosters the simulation process for users.

Gene Expression Profiling , Single-Cell Analysis , Single-Cell Analysis/methods , Gene Expression Profiling/methods , Humans , Software , Computer Simulation , Transcriptome , Computational Biology/methods , Sequence Analysis, RNA/methods , RNA-Seq/methods , RNA-Seq/standards

5.

Comparative RNA-seq analysis of Arabidopsis thaliana response to AtPep1 and flg22, reveals the identification of PP2-B13 and ACLP1 as new members in pattern-triggered immunity.

Safaeizadeh, Mehdi; Boller, Thomas; Becker, Claude.

PLoS One ; 19(6): e0297124, 2024.

Article En | MEDLINE | ID: mdl-38833485

In this research, a high-throughput RNA sequencing-based transcriptome analysis technique (RNA-Seq) was used to evaluate differentially expressed genes (DEGs) in the wild type Arabidopsis seedlings in response to AtPep1, a well-known peptide representing an endogenous damage-associated molecular pattern (DAMP), and flg22, a well-known microbe-associated molecular pattern (MAMP). We compared and dissected the global transcriptional landscape of Arabidopsis thaliana in response to AtPep1 and flg22 and could identify shared and unique DEGs in response to these elicitors. We found that while a remarkable number of flg22 up-regulated genes were also induced by AtPep1, 256 genes were exclusively up-regulated in response to flg22, and 328 were exclusively up-regulated in response to AtPep1. Furthermore, among down-regulated DEGs upon flg22 treatment, 107 genes were exclusively down-regulated by flg22 treatment, while 411 genes were exclusively down-regulated by AtPep1. We found a number of hitherto overlooked genes to be induced upon treatment with either flg22 or with AtPep1, indicating their possible involvement general pathways in innate immunity. Here, we characterized two of them, namely PP2-B13 and ACLP1. pp2-b13 and aclp1 mutants showed increased susceptibility to infection by the virulent pathogen Pseudomonas syringae DC3000 and its mutant Pst DC3000 hrcC (lacking the type III secretion system), as evidenced by increased proliferation of the two pathogens in planta. Further, we present evidence that the aclp1 mutant is deficient in ethylene production upon flg22 treatment, while the pp2-b13 mutant is deficient in the production of reactive oxygen species (ROS). The results from this research provide new information for a better understanding of the immune system in Arabidopsis.

Arabidopsis Proteins , Arabidopsis , Gene Expression Regulation, Plant , Arabidopsis/genetics , Arabidopsis/immunology , Arabidopsis/microbiology , Arabidopsis Proteins/genetics , Arabidopsis Proteins/metabolism , Plant Immunity/genetics , RNA-Seq/methods , Pseudomonas syringae/pathogenicity , Gene Expression Profiling , Innate Immunity Recognition

6.

A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies.

Van, Richard; Alvarez, Daniel; Mize, Travis; Gannavarapu, Sravani; Chintham Reddy, Lohitha; Nasoz, Fatma; Han, Mira V.

BMC Bioinformatics ; 25(1): 181, 2024 May 08.

Article En | MEDLINE | ID: mdl-38720247

BACKGROUND: RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins. RESULTS: We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer. CONCLUSION: By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.

Machine Learning , Neoplasms , RNA-Seq , Humans , RNA-Seq/methods , Neoplasms/genetics , Transcriptome/genetics , Sequence Analysis, RNA/methods , Gene Expression Profiling/methods , Computational Biology/methods

7.

scCRT: a contrastive-based dimensionality reduction model for scRNA-seq trajectory inference.

Shi, Yuchen; Wan, Jian; Zhang, Xin; Liang, Tingting; Yin, Yuyu.

Brief Bioinform ; 25(3)2024 Mar 27.

Article En | MEDLINE | ID: mdl-38701412

Trajectory inference is a crucial task in single-cell RNA-sequencing downstream analysis, which can reveal the dynamic processes of biological development, including cell differentiation. Dimensionality reduction is an important step in the trajectory inference process. However, most existing trajectory methods rely on cell features derived from traditional dimensionality reduction methods, such as principal component analysis and uniform manifold approximation and projection. These methods are not specifically designed for trajectory inference and fail to fully leverage prior information from upstream analysis, limiting their performance. Here, we introduce scCRT, a novel dimensionality reduction model for trajectory inference. In order to utilize prior information to learn accurate cells representation, scCRT integrates two feature learning components: a cell-level pairwise module and a cluster-level contrastive module. The cell-level module focuses on learning accurate cell representations in a reduced-dimensionality space while maintaining the cell-cell positional relationships in the original space. The cluster-level contrastive module uses prior cell state information to aggregate similar cells, preventing excessive dispersion in the low-dimensional space. Experimental findings from 54 real and 81 synthetic datasets, totaling 135 datasets, highlighted the superior performance of scCRT compared with commonly used trajectory inference methods. Additionally, an ablation study revealed that both cell-level and cluster-level modules enhance the model's ability to learn accurate cell features, facilitating cell lineage inference. The source code of scCRT is available at https://github.com/yuchen21-web/scCRT-for-scRNA-seq.

Algorithms , Single-Cell Analysis , Single-Cell Analysis/methods , Humans , RNA-Seq/methods , Computational Biology/methods , Software , Sequence Analysis, RNA/methods , Animals , Single-Cell Gene Expression Analysis

8.

scEWE: high-order element-wise weighted ensemble clustering for heterogeneity analysis of single-cell RNA-sequencing data.

Huang, Yixiang; Jiang, Hao; Ching, Wai-Ki.

Brief Bioinform ; 25(3)2024 Mar 27.

Article En | MEDLINE | ID: mdl-38701413

With the emergence of large amount of single-cell RNA sequencing (scRNA-seq) data, the exploration of computational methods has become critical in revealing biological mechanisms. Clustering is a representative for deciphering cellular heterogeneity embedded in scRNA-seq data. However, due to the diversity of datasets, none of the existing single-cell clustering methods shows overwhelming performance on all datasets. Weighted ensemble methods are proposed to integrate multiple results to improve heterogeneity analysis performance. These methods are usually weighted by considering the reliability of the base clustering results, ignoring the performance difference of the same base clustering on different cells. In this paper, we propose a high-order element-wise weighting strategy based self-representative ensemble learning framework: scEWE. By assigning different base clustering weights to individual cells, we construct and optimize the consensus matrix in a careful and exquisite way. In addition, we extracted the high-order information between cells, which enhanced the ability to represent the similarity relationship between cells. scEWE is experimentally shown to significantly outperform the state-of-the-art methods, which strongly demonstrates the effectiveness of the method and supports the potential applications in complex single-cell data analytical problems.

Sequence Analysis, RNA , Single-Cell Analysis , Single-Cell Analysis/methods , Cluster Analysis , Sequence Analysis, RNA/methods , Algorithms , Computational Biology/methods , Humans , RNA-Seq/methods

9.

Analysis of Bladder Cancer Staging Prediction Using Deep Residual Neural Network, Radiomics, and RNA-Seq from High-Definition CT Images.

Zhou, Yao; Zheng, Xingju; Sun, Zhucheng; Wang, Bo.

Genet Res (Camb) ; 2024: 4285171, 2024.

Article En | MEDLINE | ID: mdl-38715622

Bladder cancer has recently seen an alarming increase in global diagnoses, ascending as a predominant cause of cancer-related mortalities. Given this pressing scenario, there is a burgeoning need to identify effective biomarkers for both the diagnosis and therapeutic guidance of bladder cancer. This study focuses on evaluating the potential of high-definition computed tomography (CT) imagery coupled with RNA-sequencing analysis to accurately predict bladder tumor stages, utilizing deep residual networks. Data for this study, including CT images and RNA-Seq datasets for 82 high-grade bladder cancer patients, were sourced from the TCIA and TCGA databases. We employed Cox and lasso regression analyses to determine radiomics and gene signatures, leading to the identification of a three-factor radiomics signature and a four-gene signature in our bladder cancer cohort. ROC curve analyses underscored the strong predictive capacities of both these signatures. Furthermore, we formulated a nomogram integrating clinical features, radiomics, and gene signatures. This nomogram's AUC scores stood at 0.870, 0.873, and 0.971 for 1-year, 3-year, and 5-year predictions, respectively. Our model, leveraging radiomics and gene signatures, presents significant promise for enhancing diagnostic precision in bladder cancer prognosis, advocating for its clinical adoption.

Neoplasm Staging , Neural Networks, Computer , Tomography, X-Ray Computed , Urinary Bladder Neoplasms , Urinary Bladder Neoplasms/genetics , Urinary Bladder Neoplasms/diagnostic imaging , Urinary Bladder Neoplasms/pathology , Humans , Tomography, X-Ray Computed/methods , Male , Female , RNA-Seq/methods , Aged , Nomograms , Middle Aged , Biomarkers, Tumor/genetics , ROC Curve , Prognosis , Transcriptome , Radiomics

10.

Single-cell RNA sequencing data imputation using bi-level feature propagation.

Lee, Junseok; Yun, Sukwon; Kim, Yeongmin; Chen, Tianlong; Kellis, Manolis; Park, Chanyoung.

Brief Bioinform ; 25(3)2024 Mar 27.

Article En | MEDLINE | ID: mdl-38706317

Single-cell RNA sequencing (scRNA-seq) enables the exploration of cellular heterogeneity by analyzing gene expression profiles in complex tissues. However, scRNA-seq data often suffer from technical noise, dropout events and sparsity, hindering downstream analyses. Although existing works attempt to mitigate these issues by utilizing graph structures for data denoising, they involve the risk of propagating noise and fall short of fully leveraging the inherent data relationships, relying mainly on one of cell-cell or gene-gene associations and graphs constructed by initial noisy data. To this end, this study presents single-cell bilevel feature propagation (scBFP), two-step graph-based feature propagation method. It initially imputes zero values using non-zero values, ensuring that the imputation process does not affect the non-zero values due to dropout. Subsequently, it denoises the entire dataset by leveraging gene-gene and cell-cell relationships in the respective steps. Extensive experimental results on scRNA-seq data demonstrate the effectiveness of scBFP in various downstream tasks, uncovering valuable biological insights.

Sequence Analysis, RNA , Single-Cell Analysis , Single-Cell Analysis/methods , Sequence Analysis, RNA/methods , Humans , Algorithms , Gene Expression Profiling/methods , Computational Biology/methods , RNA-Seq/methods

11.

Dual RNA-seq Analysis of Patients' Cells and Viral Genome After Measles Infection.

Pogka, Vasiliki; Mentis, Andreas; Karamitros, Timokratis.

Methods Mol Biol ; 2808: 121-127, 2024.

Article En | MEDLINE | ID: mdl-38743366

During the infection of a host cell by an infectious agent, a series of gene expression changes occurs as a consequence of host-pathogen interactions. Unraveling this complex interplay is the key for understanding of microbial virulence and host response pathways, thus providing the basis for new molecular insights into the mechanisms of pathogenesis and the corresponding immune response. Dual RNA sequencing (dual RNA-seq) has been developed to simultaneously determine pathogen and host transcriptomes enabling both differential and coexpression analyses between the two partners as well as genome characterization in the case of RNA viruses. Here, we provide a detailed laboratory protocol and bioinformatics analysis guidelines for dual RNA-seq experiments focusing on - but not restricted to - measles virus (MeV) as a pathogen of interest. The application of dual RNA-seq technologies in MeV-infected patients can potentially provide valuable information on the structure of the viral RNA genome and on cellular innate immune responses and drive the discovery of new targets for antiviral therapy.

Genome, Viral , Host-Pathogen Interactions , Measles virus , Measles , RNA, Viral , Humans , Measles/virology , Measles/immunology , Measles/genetics , Measles virus/genetics , Measles virus/pathogenicity , RNA, Viral/genetics , Host-Pathogen Interactions/genetics , Host-Pathogen Interactions/immunology , Computational Biology/methods , Sequence Analysis, RNA/methods , RNA-Seq/methods , Transcriptome , Gene Expression Profiling/methods , High-Throughput Nucleotide Sequencing/methods

12.

GRouNdGAN: GRN-guided simulation of single-cell RNA-seq data using causal generative adversarial networks.

Zinati, Yazdan; Takiddeen, Abdulrahman; Emad, Amin.

Nat Commun ; 15(1): 4055, 2024 May 14.

Article En | MEDLINE | ID: mdl-38744843

We introduce GRouNdGAN, a gene regulatory network (GRN)-guided reference-based causal implicit generative model for simulating single-cell RNA-seq data, in silico perturbation experiments, and benchmarking GRN inference methods. Through the imposition of a user-defined GRN in its architecture, GRouNdGAN simulates steady-state and transient-state single-cell datasets where genes are causally expressed under the control of their regulating transcription factors (TFs). Training on six experimental reference datasets, we show that our model captures non-linear TF-gene dependencies and preserves gene identities, cell trajectories, pseudo-time ordering, and technical and biological noise, with no user manipulation and only implicit parameterization. GRouNdGAN can synthesize cells under new conditions to perform in silico TF knockout experiments. Benchmarking various GRN inference algorithms reveals that GRouNdGAN effectively bridges the existing gap between simulated and biological data benchmarks of GRN inference algorithms, providing gold standard ground truth GRNs and realistic cells corresponding to the biological system of interest.

Algorithms , Computer Simulation , Gene Regulatory Networks , RNA-Seq , Single-Cell Analysis , Single-Cell Analysis/methods , RNA-Seq/methods , Humans , Transcription Factors/metabolism , Transcription Factors/genetics , Computational Biology/methods , Benchmarking , Sequence Analysis, RNA/methods , Single-Cell Gene Expression Analysis

13.

Assessment of Gene Set Enrichment Analysis using curated RNA-seq-based benchmarks.

Candia, Julián; Ferrucci, Luigi.

PLoS One ; 19(5): e0302696, 2024.

Article En | MEDLINE | ID: mdl-38753612

Pathway enrichment analysis is a ubiquitous computational biology method to interpret a list of genes (typically derived from the association of large-scale omics data with phenotypes of interest) in terms of higher-level, predefined gene sets that share biological function, chromosomal location, or other common features. Among many tools developed so far, Gene Set Enrichment Analysis (GSEA) stands out as one of the pioneering and most widely used methods. Although originally developed for microarray data, GSEA is nowadays extensively utilized for RNA-seq data analysis. Here, we quantitatively assessed the performance of a variety of GSEA modalities and provide guidance in the practical use of GSEA in RNA-seq experiments. We leveraged harmonized RNA-seq datasets available from The Cancer Genome Atlas (TCGA) in combination with large, curated pathway collections from the Molecular Signatures Database to obtain cancer-type-specific target pathway lists across multiple cancer types. We carried out a detailed analysis of GSEA performance using both gene-set and phenotype permutations combined with four different choices for the Kolmogorov-Smirnov enrichment statistic. Based on our benchmarks, we conclude that the classic/unweighted gene-set permutation approach offered comparable or better sensitivity-vs-specificity tradeoffs across cancer types compared with other, more complex and computationally intensive permutation methods. Finally, we analyzed other large cohorts for thyroid cancer and hepatocellular carcinoma. We utilized a new consensus metric, the Enrichment Evidence Score (EES), which showed a remarkable agreement between pathways identified in TCGA and those from other sources, despite differences in cancer etiology. This finding suggests an EES-based strategy to identify a core set of pathways that may be complemented by an expanded set of pathways for downstream exploratory analysis. This work fills the existing gap in current guidelines and benchmarks for the use of GSEA with RNA-seq data and provides a framework to enable detailed benchmarking of other RNA-seq-based pathway analysis tools.

Benchmarking , RNA-Seq , Humans , RNA-Seq/methods , Computational Biology/methods , Neoplasms/genetics , Databases, Genetic , Gene Expression Profiling/methods

14.

Data-driven selection of analysis decisions in single-cell RNA-seq trajectory inference.

Dong, Xiaoru; Leary, Jack R; Yang, Chuanhao; Brusko, Maigan A; Brusko, Todd M; Bacher, Rhonda.

Brief Bioinform ; 25(3)2024 Mar 27.

Article En | MEDLINE | ID: mdl-38725155

Single-cell RNA sequencing (scRNA-seq) experiments have become instrumental in developmental and differentiation studies, enabling the profiling of cells at a single or multiple time-points to uncover subtle variations in expression profiles reflecting underlying biological processes. Benchmarking studies have compared many of the computational methods used to reconstruct cellular dynamics; however, researchers still encounter challenges in their analysis due to uncertainty with respect to selecting the most appropriate methods and parameters. Even among universal data processing steps used by trajectory inference methods such as feature selection and dimension reduction, trajectory methods' performances are highly dataset-specific. To address these challenges, we developed Escort, a novel framework for evaluating a dataset's suitability for trajectory inference and quantifying trajectory properties influenced by analysis decisions. Escort evaluates the suitability of trajectory analysis and the combined effects of processing choices using trajectory-specific metrics. Escort navigates single-cell trajectory analysis through these data-driven assessments, reducing uncertainty and much of the decision burden inherent to trajectory inference analyses. Escort is implemented in an accessible R package and R/Shiny application, providing researchers with the necessary tools to make informed decisions during trajectory analysis and enabling new insights into dynamic biological processes at single-cell resolution.

RNA-Seq , Single-Cell Analysis , Single-Cell Analysis/methods , RNA-Seq/methods , Humans , Computational Biology/methods , Sequence Analysis, RNA/methods , Software , Algorithms , Gene Expression Profiling/methods , Single-Cell Gene Expression Analysis

15.

Quantifying 3'UTR length from scRNA-seq data reveals changes independent of gene expression.

Fansler, Mervin M; Mitschka, Sibylle; Mayr, Christine.

Nat Commun ; 15(1): 4050, 2024 May 14.

Article En | MEDLINE | ID: mdl-38744866

Although more than half of all genes generate transcripts that differ in 3'UTR length, current analysis pipelines only quantify the amount but not the length of mRNA transcripts. 3'UTR length is determined by 3' end cleavage sites (CS). We map CS in more than 200 primary human and mouse cell types and increase CS annotations relative to the GENCODE database by 40%. Approximately half of all CS are used in few cell types, revealing that most genes only have one or two major 3' ends. We incorporate the CS annotations into a computational pipeline, called scUTRquant, for rapid, accurate, and simultaneous quantification of gene and 3'UTR isoform expression from single-cell RNA sequencing (scRNA-seq) data. When applying scUTRquant to data from 474 cell types and 2134 perturbations, we discover extensive 3'UTR length changes across cell types that are as widespread and coordinately regulated as gene expression changes but affect mostly different genes. Our data indicate that mRNA abundance and mRNA length are two largely independent axes of gene regulation that together determine the amount and spatial organization of protein synthesis.

3' Untranslated Regions , RNA, Messenger , Single-Cell Analysis , 3' Untranslated Regions/genetics , Humans , Animals , Mice , RNA, Messenger/genetics , RNA, Messenger/metabolism , Single-Cell Analysis/methods , Sequence Analysis, RNA/methods , Gene Expression Regulation , RNA-Seq/methods , Computational Biology/methods , Gene Expression Profiling/methods , Single-Cell Gene Expression Analysis

16.

Multi-omics integration of scRNA-seq time series data predicts new intervention points for Parkinson's disease.

Mihajlovic, Katarina; Ceddia, Gaia; Malod-Dognin, Noël; Novak, Gabriela; Kyriakis, Dimitrios; Skupin, Alexander; Przulj, Natasa.

Sci Rep ; 14(1): 10983, 2024 05 14.

Article En | MEDLINE | ID: mdl-38744869

Parkinson's disease (PD) is a complex neurodegenerative disorder without a cure. The onset of PD symptoms corresponds to 50% loss of midbrain dopaminergic (mDA) neurons, limiting early-stage understanding of PD. To shed light on early PD development, we study time series scRNA-seq datasets of mDA neurons obtained from patient-derived induced pluripotent stem cell differentiation. We develop a new data integration method based on Non-negative Matrix Tri-Factorization that integrates these datasets with molecular interaction networks, producing condition-specific "gene embeddings". By mining these embeddings, we predict 193 PD-related genes that are largely supported (49.7%) in the literature and are specific to the investigated PINK1 mutation. Enrichment analysis in Kyoto Encyclopedia of Genes and Genomes pathways highlights 10 PD-related molecular mechanisms perturbed during early PD development. Finally, investigating the top 20 prioritized genes reveals 12 previously unrecognized genes associated with PD that represent interesting drug targets.

Dopaminergic Neurons , Parkinson Disease , Parkinson Disease/genetics , Parkinson Disease/pathology , Humans , Dopaminergic Neurons/metabolism , Dopaminergic Neurons/pathology , RNA-Seq/methods , Induced Pluripotent Stem Cells/metabolism , Mesencephalon/metabolism , Mesencephalon/pathology , Gene Regulatory Networks , Mutation , Cell Differentiation/genetics , Multiomics , Single-Cell Gene Expression Analysis

17.

scCDC: a computational method for gene-specific contamination detection and correction in single-cell and single-nucleus RNA-seq data.

Wang, Weijian; Cen, Yihui; Lu, Zezhen; Xu, Yueqing; Sun, Tianyi; Xiao, Ying; Liu, Wanlu; Li, Jingyi Jessica; Wang, Chaochen.

Genome Biol ; 25(1): 136, 2024 May 23.

Article En | MEDLINE | ID: mdl-38783325

In droplet-based single-cell and single-nucleus RNA-seq assays, systematic contamination of ambient RNA molecules biases the quantification of gene expression levels. Existing methods correct the contamination for all genes globally. However, there lacks specific evaluation of correction efficacy for varying contamination levels. Here, we show that DecontX and CellBender under-correct highly contaminating genes, while SoupX and scAR over-correct lowly/non-contaminating genes. Here, we develop scCDC as the first method to detect the contamination-causing genes and only correct expression levels of these genes, some of which are cell-type markers. Compared with existing decontamination methods, scCDC excels in decontaminating highly contaminating genes while avoiding over-correction of other genes.

RNA-Seq , Single-Cell Analysis , Single-Cell Analysis/methods , RNA-Seq/methods , Humans , Computational Biology/methods , Sequence Analysis, RNA/methods , Cell Nucleus/genetics , Software , Animals

18.

Comparative Methods for Demystifying Spatial Transcriptomics.

Sammeth, Michael; Mudra, Susann; Bialdiga, Sina; Hartmannsberger, Beate; Kramer, Sofia; Rittner, Heike.

Methods Mol Biol ; 2802: 515-546, 2024.

Article En | MEDLINE | ID: mdl-38819570

Spatial Transcriptomics (ST), coined as the term for parallel RNA-Seq on cell populations ordered spatially on a histological tissue section, has recently become increasingly popular, especially in experiments where microfluidics-based single-cell sequencing fails, such as assays on neurons. ST platforms, like the 10x Visium technology investigated herein, therefore produce in a single experiment simultaneously thousands of RNA readouts, captured by an array of micrometer scale spots under the histological section. Therefore, a central challenge of analyzing ST experiments consists of analyzing the gene expression morphology of all spots to delineate clusters of similar cell mixtures, which are then compared to each other to identify up- or down-regulated marker genes. Moreover, another level of complexity in ST experiments, compared to traditional RNA-Seq, is imposed by staining the tissue section with protein markers of cells or cell components to identify spots providing relevant information afterward. The corresponding microscopy images need to be analyzed in addition to the RNA-Seq read mappings on the reference genome and transcriptome sequences. Focusing on the software suite provided by the Visium platform manufacturer, we break down the ST analysis pipeline into its four essential steps-the image analysis, the read alignment, the gene quantification, and the spot clustering-and compare results obtained when using reads from different subsets of spots and/or when employing alternative genome or transcriptome references. Our comparative analyses demonstrate the impact of spot selection and the choice of genome/transcriptome references on the analysis results when employing the manufacturer's pipeline.

Gene Expression Profiling , Software , Transcriptome , Gene Expression Profiling/methods , Single-Cell Analysis/methods , Image Processing, Computer-Assisted/methods , Humans , RNA-Seq/methods , Animals , Sequence Analysis, RNA/methods , Computational Biology/methods , High-Throughput Nucleotide Sequencing/methods

19.

RNA-Seq Data Analysis: A Practical Guide for Model and Non-Model Organisms.

Pola-Sánchez, Enrique; Hernández-Martínez, Karen Magdalena; Pérez-Estrada, Rafael; Sélem-Mójica, Nelly; Simpson, June; Abraham-Juárez, María Jazmín; Herrera-Estrella, Alfredo; Villalobos-Escobedo, José Manuel.

Curr Protoc ; 4(5): e1054, 2024 May.

Article En | MEDLINE | ID: mdl-38808970

RNA sequencing (RNA-seq) has emerged as a powerful tool for assessing genome-wide gene expression, revolutionizing various fields of biology. However, analyzing large RNA-seq datasets can be challenging, especially for students or researchers lacking bioinformatics experience. To address these challenges, we present a comprehensive guide to provide step-by-step workflows for analyzing RNA-seq data, from raw reads to functional enrichment analysis, starting with considerations for experimental design. This is designed to aid students and researchers working with any organism, irrespective of whether an assembled genome is available. Within this guide, we employ various recognized bioinformatics tools to navigate the landscape of RNA-seq analysis and discuss the advantages and disadvantages of different tools for the same task. Our protocol focuses on clarity, reproducibility, and practicality to enable users to navigate the complexities of RNA-seq data analysis easily and gain valuable biological insights from the datasets. Additionally, all scripts and a sample dataset are available in a GitHub repository to facilitate the implementation of the analysis pipeline. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: Analysis of data from a model plant with an available reference genome Basic Protocol 2: Gene ontology enrichment analysis Basic Protocol 3: De novo assembly of data from non-model plants.

RNA-Seq , RNA-Seq/methods , Computational Biology/methods , Sequence Analysis, RNA/methods , Software

20.

MultiRNAflow: integrated analysis of temporal RNA-seq data with multiple biological conditions.

Loubaton, Rodolphe; Champagnat, Nicolas; Vallois, Pierre; Vallat, Laurent.

Bioinformatics ; 40(5)2024 May 02.

Article En | MEDLINE | ID: mdl-38810104

MOTIVATION: The dynamic transcriptional mechanisms that govern eukaryotic cell function can now be analyzed by RNA sequencing. However, the packages currently available for the analysis of raw sequencing data do not provide automatic analysis of complex experimental designs with multiple biological conditions and multiple analysis time-points. RESULTS: The MultiRNAflow suite combines several packages in a unified framework allowing exploratory and supervised statistical analyses of temporal data for multiple biological conditions. AVAILABILITY AND IMPLEMENTATION: The R package MultiRNAflow is freely available on Bioconductor (https://bioconductor.org/packages/MultiRNAflow/), and the latest version of the source code is available on a GitHub repository (https://github.com/loubator/MultiRNAflow).

RNA-Seq , Software , RNA-Seq/methods , Sequence Analysis, RNA/methods , Computational Biology/methods