ABSTRACT
With the recent advent of single-cell level biological understanding, a growing interest is in identifying cell states or subtypes that are homogeneous in terms of gene expression and are also enriched in certain biological conditions, including disease samples versus normal samples (condition-specific cell subtype). Despite the importance of identifying condition-specific cell subtypes, existing methods have the following limitations: since they train models separately between gene expression and the biological condition information, (1) they do not consider potential interactions between them, and (2) the weights from both types of information are not properly controlled. Also, (3) they do not consider non-linear relationships in the gene expression and the biological condition. To address the limitations and accurately identify such condition-specific cell subtypes, we develop scDeepJointClust, the first method that jointly trains both types of information via a deep neural network. scDeepJointClust incorporates results from the power of state-of-the-art gene-expression-based clustering methods as an input, incorporating their sophistication and accuracy. We evaluated scDeepJointClust on both simulation data in diverse scenarios and biological data of different diseases (melanoma and non-small-cell lung cancer) and showed that scDeepJointClust outperforms existing methods in terms of sensitivity and specificity. scDeepJointClust exhibits significant promise in advancing our understanding of cellular states and their implications in complex biological systems.
Subject(s)
Carcinoma, Non-Small-Cell Lung , Lung Neoplasms , Humans , Lung Neoplasms/genetics , Neural Networks, ComputerABSTRACT
Alternative polyadenylation (APA) in breast tumor samples results in the removal/addition of cis-regulatory elements such as microRNA (miRNA) target sites in the 3'-untranslated region (3'-UTRs) of genes. Although previous computational APA studies focused on a subset of genes strongly affected by APA (APA genes), we identify miRNAs of which widespread APA events collectively increase or decrease the number of target sites [probabilistic inference of microRNA target site modification through APA (PRIMATA-APA)]. Using PRIMATA-APA on the cancer genome atlas (TCGA) breast cancer data, we found that the global APA events change the number of the target sites of particular microRNAs [target sites modified miRNA (tamoMiRNA)] enriched for cancer development and treatments. We also found that when knockdown (KD) of NUDT21 in HeLa cells induces a different set of widespread 3'-UTR shortening than TCGA breast cancer data, it changes the target sites of the common tamoMiRNAs. Since the NUDT21 KD experiment previously demonstrated the tumorigenic role of APA events in a miRNA dependent fashion, this result suggests that the APA-initiated tumorigenesis is attributable to the miRNA target site changes, not the APA events themselves. Further, we found that the miRNA target site changes identify tumor cell proliferation and immune cell infiltration to the tumor microenvironment better than the miRNA expression levels or the APA events themselves. Altogether, our computational analyses provide a proof-of-concept demonstration that the miRNA target site information indicates the effect of global APA events with a potential as predictive biomarker.
Subject(s)
3' Untranslated Regions/genetics , Breast Neoplasms/genetics , MicroRNAs/genetics , Polyadenylation/genetics , Tumor Escape/genetics , Algorithms , Binding Sites/genetics , Breast Neoplasms/metabolism , Cell Proliferation/genetics , Cleavage And Polyadenylation Specificity Factor/genetics , Cleavage And Polyadenylation Specificity Factor/metabolism , Gene Expression Regulation, Neoplastic , HeLa Cells , Humans , Models, Genetic , RNA-Seq/methods , Tumor Microenvironment/geneticsABSTRACT
BACKGROUND: Thrombotic microangiopathy-induced thrombocytopenia-associated multiple organ failure and hyperinflammatory macrophage activation syndrome are important causes of late pediatric sepsis mortality that are often missed or have delayed diagnosis. The National Institutes of General Medical Science sepsis research working group recommendations call for application of new research approaches in extant clinical data sets to improve efficiency of early trials of new sepsis therapies. Our objective is to apply machine learning approaches to derive computable 24-h sepsis phenotypes to facilitate personalized enrollment in early anti-inflammatory trials targeting these conditions. METHODS: We applied consensus, k-means clustering analysis to our extant PHENOtyping sepsis-induced Multiple organ failure Study (PHENOMS) dataset of 404 children. 24-hour computable phenotypes are derived using 25 available bedside variables including C-reactive protein and ferritin. RESULTS: Four computable phenotypes (PedSep-A, B, C, and D) are derived. Compared to all other phenotypes, PedSep-A patients (n = 135; 2% mortality) were younger and previously healthy, with the lowest C-reactive protein and ferritin levels, the highest lymphocyte and platelet counts, highest heart rate, and lowest creatinine (p < 0.05); PedSep-B patients (n = 102; 12% mortality) were most likely to be intubated and had the lowest Glasgow Coma Scale Score (p < 0.05); PedSep-C patients (n = 110; mortality 10%) had the highest temperature and Glasgow Coma Scale Score, least pulmonary failure, and lowest lymphocyte counts (p < 0.05); and PedSep-D patients (n = 56, 34% mortality) had the highest creatinine and number of organ failures, including renal, hepatic, and hematologic organ failure, with the lowest platelet counts (p < 0.05). PedSep-D had the highest likelihood of developing thrombocytopenia-associated multiple organ failure (Adj OR 47.51 95% CI [18.83-136.83], p < 0.0001) and macrophage activation syndrome (Adj OR 38.63 95% CI [13.26-137.75], p < 0.0001). CONCLUSIONS: Four computable phenotypes are derived, with PedSep-D being optimal for enrollment in early personalized anti-inflammatory trials targeting thrombocytopenia-associated multiple organ failure and macrophage activation syndrome in pediatric sepsis. A computer tool for identification of individual patient membership ( www.pedsepsis.pitt.edu ) is provided. Reproducibility will be assessed at completion of two ongoing pediatric sepsis studies.
Subject(s)
Macrophage Activation Syndrome , Sepsis , Thrombocytopenia , Anti-Inflammatory Agents , C-Reactive Protein , Child , Clinical Trials as Topic , Creatinine , Ferritins , Humans , Machine Learning , Macrophage Activation Syndrome/complications , Multiple Organ Failure/etiology , Organ Dysfunction Scores , Phenotype , Reproducibility of ResultsABSTRACT
The ALOG (Arabidopsis LSH1 and Oryza G1) family proteins, namely, DUF640 domain-containing proteins, have been reported to function as transcription factors in various plants. However, the understanding of the response and function of ALOG family genes during reproductive development and under abiotic stress is still largely limited. In this study, we comprehensively analyzed the structural characteristics of ALOG family proteins and their expression profiles during inflorescence development and under abiotic stress in rice. The results showed that OsG1/OsG1L1/2/3/4/5/6/7/8/9 all had four conserved helical structures and an inserted Zinc-Ribbon (ZnR), the other four proteins OsG1L10/11/12/13 lacked complete Helix-1 and Helix-2. In the ALOG gene promoters, there were abundant cis-acting elements, including ABA, MeJA, and drought-responsive elements. Most ALOG genes show a decrease in expression levels within 24 h under ABA and drought treatments, while OsG1L2 expression levels show an upregulated trend under ABA and drought treatments. The expression analysis at different stages of inflorescence development indicated that OsG1L1/2/3/8/11 were mainly expressed in the P1 stage; in the P4 stage, OsG1/OsG1L4/5/9/12 had a higher expression level. These results lay a good foundation for further studying the expression of rice ALOG family genes under abiotic stresses, and provide important experimental support for their functional research.
ABSTRACT
[This corrects the article DOI: 10.3389/fgene.2024.1381690.].
ABSTRACT
BACKGROUND: Learning the causal structure helps identify risk factors, disease mechanisms, and candidate therapeutics for complex diseases. However, although complex biological systems are characterized by nonlinear associations, existing bioinformatic methods of causal inference cannot identify the nonlinear relationships and estimate their effect size. RESULTS: To overcome these limitations, we developed the first computational method that explicitly learns nonlinear causal relations and estimates the effect size using a deep neural network approach coupled with the knockoff framework, named causal directed acyclic graphs using deep learning variable selection (DAG-deepVASE). Using simulation data of diverse scenarios and identifying known and novel causal relations in molecular and clinical data of various diseases, we demonstrated that DAG-deepVASE consistently outperforms existing methods in identifying true and known causal relations. In the analyses, we also illustrate how identifying nonlinear causal relations and estimating their effect size help understand the complex disease pathobiology, which is not possible using other methods. CONCLUSIONS: With these advantages, the application of DAG-deepVASE can help identify driver genes and therapeutic agents in biomedical studies and clinical trials.
Subject(s)
Neural Networks, Computer , Computer Simulation , CausalityABSTRACT
BACKGROUND: Alternative polyadenylation (APA) causes shortening or lengthening of the 3'-untranslated region (3'-UTR) of genes (APA genes) in diverse cellular processes such as cell proliferation and differentiation. To identify cell-type-specific APA genes in scRNA-Seq data, current bioinformatic methods have several limitations. First, they assume certain read coverage shapes in the scRNA-Seq data, which can be violated in multiple APA genes. Second, their identification is limited between 2 cell types and not directly applicable to the data of multiple cell types. Third, they do not control undesired source of variance, which potentially introduces noise to the cell-type-specific identification of APA genes. FINDINGS: We developed a combination of a computational change-point algorithm and a statistical model, single-cell Multi-group identification of APA (scMAPA). To avoid the assumptions on the read coverage shape, scMAPA formulates a change-point problem after transforming the 3' biased scRNA-Seq data to represent the full-length 3'-UTR signal. To identify cell-type-specific APA genes while adjusting for undesired source of variation, scMAPA models APA isoforms in consideration of the cell types and the undesired source. In our novel simulation data and data from human peripheral blood mononuclear cells, scMAPA outperforms existing methods in sensitivity, robustness, and stability. In mouse brain data consisting of multiple cell types sampled from multiple regions, scMAPA identifies cell-type-specific APA genes, elucidating novel roles of APA for dividing immune cells and differentiated neuron cells and in multiple brain disorders. CONCLUSIONS: scMAPA elucidates the cell-type-specific function of APA events and sheds novel insights into the functional roles of APA events in complex tissues.
Subject(s)
Leukocytes, Mononuclear , Polyadenylation , 3' Untranslated Regions , Animals , Cell Proliferation , Mice , Sequence Analysis, RNA/methodsABSTRACT
Shortening of 3'UTRs (3'US) through alternative polyadenylation is a post-transcriptional mechanism that regulates the expression of hundreds of genes in human cancers. In breast cancer, different subtypes of tumor samples, such as estrogen receptor positive and negative (ER+ and ER-), are characterized by distinct molecular mechanisms, suggesting possible differences in the post-transcriptional regulation between the subtype tumors. In this study, based on the profound tumorigenic role of 3'US interacting with competing-endogenous RNA (ceRNA) network (3'US-ceRNA effect), we hypothesize that the 3'US-ceRNA effect drives subtype-specific tumor growth. However, we found that the subtypes are available in different sample sizes, biasing the ceRNA network size and disabling the fair comparison of the 3'US-ceRNA effect. Using normalized Laplacian matrix eigenvalue distribution, we addressed this bias and built tumor ceRNA networks comparable between the subtypes. Based on the comparison, we identified a novel role of housekeeping (HK) genes as stable and strong miRNA sponges (sponge HK genes) that synchronize the ceRNA networks of normal samples (adjacent to ER+ and ER- tumor samples). We further found that distinct 3'US events in the ER- tumor break the stable sponge effect of HK genes in a subtype-specific fashion, especially in association with the aggressive and metastatic phenotypes. Knockdown of NUDT21 further suggested the role of 3'US-ceRNA effect in repressing HK genes for tumor growth. In this study, we identified 3'US-ceRNA effect on the sponge HK genes for subtype-specific growth of ER- tumors.
ABSTRACT
OBJECTIVE: Ovarian cancer (OC) is one of the most common types of cancer in women. Accurately prediction of benign ovarian tumors (BOT) and OC has important practical value. METHODS: Our dataset consists of 349 Chinese patients with 49 variables including demographics, blood routine test, general chemistry, and tumor markers. Machine learning Minimum Redundancy - Maximum Relevance (MRMR) feature selection method was applied on the 235 patients' data (89 BOT and 146 OC) to select the most relevant features, with which a simple decision tree model was constructed. The model was tested on the rest of 114 patients (89 BOT and 25 OC). The results were compared with the predictions produced by using the risk of ovarian malignancy algorithm (ROMA) and logistic regression model. RESULTS: Eight notable features were selected by MRMR, among which two were identified as the top features by the decision tree model: human epididymis protein 4 (HE4) and carcinoembryonic antigen (CEA). Particularly, CEA is a valuable marker for OC prediction in patients with low HE4. The model also yields better prediction result than ROMA. CONCLUSION: Machine learning approaches were able to accurately classify BOT and OC. Our goal is to derive a simple predictive model which also carries a good performance. Using our approach, we obtained a model that consists of just two biomarkers, HE4 and CEA. The model is simple to interpret and outperforms the existing OC prediction methods. It demonstrates that the machine learning approach has good potential in predictive modeling for the complex diseases.
Subject(s)
CA-125 Antigen , Ovarian Neoplasms , Algorithms , Biomarkers, Tumor , Carcinoma, Ovarian Epithelial , Female , Humans , Machine Learning , Ovarian Neoplasms/diagnosisABSTRACT
Cultured cell models are an essential complement to dissecting kidney proximal tubule (PT) function in health and disease but do not fully recapitulate key features of this nephron segment. We recently determined that culture of opossum kidney (OK) cells under continuous orbital shear stress (OSS) significantly augments their morphological and functional resemblance to PTs in vivo. Here we used RNASeq to identify temporal transcriptional changes upon cell culture under static or shear stress conditions. Comparison of gene expression in cells cultured under static or OSS conditions with a database of rat nephron segment gene expression confirms that OK cells cultured under OSS are more similar to the PT in vivo compared with cells maintained under static conditions. Both improved oxygenation and mechanosensitive stimuli contribute to the enhanced differentiation in these cells, and we identified temporal changes in gene expression of known mechanosensitive targets. We observed changes in mRNA and protein levels of membrane trafficking components that may contribute to the enhanced endocytic capacity of cells cultured under OSS. Our data reveal pathways that may be critical for PT differentiation in vivo and validate the utility of this improved cell culture model as a tool to study PT function.