Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 534
Filter
Add more filters

Publication year range
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38622357

ABSTRACT

Pseudouridine is an RNA modification that is widely distributed in both prokaryotes and eukaryotes, and plays a critical role in numerous biological activities. Despite its importance, the precise identification of pseudouridine sites through experimental approaches poses significant challenges, requiring substantial time and resources.Therefore, there is a growing need for computational techniques that can reliably and quickly identify pseudouridine sites from vast amounts of RNA sequencing data. In this study, we propose fuzzy kernel evidence Random Forest (FKeERF) to identify pseudouridine sites. This method is called PseU-FKeERF, which demonstrates high accuracy in identifying pseudouridine sites from RNA sequencing data. The PseU-FKeERF model selected four RNA feature coding schemes with relatively good performance for feature combination, and then input them into the newly proposed FKeERF method for category prediction. FKeERF not only uses fuzzy logic to expand the original feature space, but also combines kernel methods that are easy to interpret in general for category prediction. Both cross-validation tests and independent tests on benchmark datasets have shown that PseU-FKeERF has better predictive performance than several state-of-the-art methods. This new method not only improves the accuracy of pseudouridine site identification, but also provides a certain reference for disease control and related drug development in the future.


Subject(s)
Pseudouridine , Random Forest , Pseudouridine/genetics , RNA/genetics , Base Sequence
2.
J Biol Chem ; 300(1): 105564, 2024 Jan.
Article in English | MEDLINE | ID: mdl-38103644

ABSTRACT

The polysialyltransferases ST8SIA2 and ST8SIA4 and their product, polysialic acid (polySia), are known to be related to cancers and mental disorders. ST8SIA2 and ST8SIA4 have conserved amino acid (AA) sequence motifs essential for the synthesis of the polySia structures on the neural cell adhesion molecule. To search for a new motif in the polysialyltransferases, we adopted the in silico Individual Meta Random Forest program that can predict disease-related AA substitutions. The Individual Meta Random Forest program predicted a new eight-amino-acids sequence motif consisting of highly pathogenic AA residues, thus designated as the pathogenic (P) motif. A series of alanine point mutation experiments in the pathogenic motif (P motif) showed that most P motif mutants lost the polysialylation activity without changing the proper enzyme expression levels or localization in the Golgi. In addition, we evaluated the enzyme stability of the P motif mutants using newly established calculations of mutation energy, demonstrating that the subtle change of the conformational energy regulates the activity. In the AlphaFold2 model, we found that the P motif was a buried ß-strand underneath the known surface motifs unique to ST8SIA2 and ST8SIA4. Taken together, the P motif is a novel buried ß-strand that regulates the full activity of polysialyltransferases from the inside of the molecule.


Subject(s)
Mutation , Sialyltransferases , Humans , Amino Acid Motifs/genetics , Amino Acid Substitution , Computer Simulation , Golgi Apparatus/enzymology , Golgi Apparatus/metabolism , Neural Cell Adhesion Molecules/chemistry , Neural Cell Adhesion Molecules/metabolism , Point Mutation , Protein Conformation, beta-Strand , Protein Transport , Random Forest , Sialic Acids/metabolism , Sialyltransferases/chemistry , Sialyltransferases/genetics , Sialyltransferases/metabolism
3.
Brief Bioinform ; 24(2)2023 03 19.
Article in English | MEDLINE | ID: mdl-36653905

ABSTRACT

In longitudinal studies variables are measured repeatedly over time, leading to clustered and correlated observations. If the goal of the study is to develop prediction models, machine learning approaches such as the powerful random forest (RF) are often promising alternatives to standard statistical methods, especially in the context of high-dimensional data. In this paper, we review extensions of the standard RF method for the purpose of longitudinal data analysis. Extension methods are categorized according to the data structures for which they are designed. We consider both univariate and multivariate response longitudinal data and further categorize the repeated measurements according to whether the time effect is relevant. Even though most extensions are proposed for low-dimensional data, some can be applied to high-dimensional data. Information of available software implementations of the reviewed extensions is also given. We conclude with discussions on the limitations of our review and some future research directions.


Subject(s)
Random Forest , Software , Longitudinal Studies , Data Analysis
4.
Brief Bioinform ; 25(1)2023 11 22.
Article in English | MEDLINE | ID: mdl-38033292

ABSTRACT

Throughout evolution, pathogenic viruses have developed different strategies to evade the response of the adaptive immune system. To carry out successful replication, some pathogenic viruses encode different proteins that manipulate the molecular mechanisms of host cells. Currently, there are different bioinformatics tools for virus research; however, none of them focus on predicting viral proteins that evade the adaptive system. In this work, we have developed a novel tool based on machine and deep learning for predicting this type of viral protein named VirusHound-I. This tool is based on a model developed with the multilayer perceptron algorithm using the dipeptide composition molecular descriptor. In this study, we have also demonstrated the robustness of our strategy for data augmentation of the positive dataset based on generative antagonistic networks. During the 10-fold cross-validation step in the training dataset, the predictive model showed 0.947 accuracy, 0.994 precision, 0.943 F1 score, 0.995 specificity, 0.896 sensitivity, 0.894 kappa, 0.898 Matthew's correlation coefficient and 0.989 AUC. On the other hand, during the testing step, the model showed 0.964 accuracy, 1.0 precision, 0.967 F1 score, 1.0 specificity, 0.936 sensitivity, 0.929 kappa, 0.931 Matthew's correlation coefficient and 1.0 AUC. Taking this model into account, we have developed a tool called VirusHound-I that makes it possible to predict viral proteins that evade the host's adaptive immune system. We believe that VirusHound-I can be very useful in accelerating studies on the molecular mechanisms of evasion of pathogenic viruses, as well as in the discovery of therapeutic targets.


Subject(s)
Viral Proteins , Viruses , Viral Proteins/genetics , Viral Proteins/chemistry , Random Forest , Neural Networks, Computer , Algorithms , Viruses/genetics
5.
Bioinformatics ; 40(Suppl 2): ii198-ii207, 2024 09 01.
Article in English | MEDLINE | ID: mdl-39230698

ABSTRACT

MOTIVATION: In the realm of precision medicine, effective patient stratification and disease subtyping demand innovative methodologies tailored for multi-omics data. Clustering techniques applied to multi-omics data have become instrumental in identifying distinct subgroups of patients, enabling a finer-grained understanding of disease variability. Meanwhile, clinical datasets are often small and must be aggregated from multiple hospitals. Online data sharing, however, is seen as a significant challenge due to privacy concerns, potentially impeding big data's role in medical advancements using machine learning. This work establishes a powerful framework for advancing precision medicine through unsupervised random forest-based clustering in combination with federated computing. RESULTS: We introduce a novel multi-omics clustering approach utilizing unsupervised random forests. The unsupervised nature of the random forest enables the determination of cluster-specific feature importance, unraveling key molecular contributors to distinct patient groups. Our methodology is designed for federated execution, a crucial aspect in the medical domain where privacy concerns are paramount. We have validated our approach on machine learning benchmark datasets as well as on cancer data from The Cancer Genome Atlas. Our method is competitive with the state-of-the-art in terms of disease subtyping, but at the same time substantially improves the cluster interpretability. Experiments indicate that local clustering performance can be improved through federated computing. AVAILABILITY AND IMPLEMENTATION: The proposed methods are available as an R-package (https://github.com/pievos101/uRF).


Subject(s)
Precision Medicine , Humans , Cluster Analysis , Precision Medicine/methods , Unsupervised Machine Learning , Machine Learning , Neoplasms , Privacy , Algorithms , Random Forest
6.
PLoS Comput Biol ; 20(6): e1011361, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38875302

ABSTRACT

Tumor microenvironments (TMEs) contain vast amounts of information on patient's cancer through their cellular composition and the spatial distribution of tumor cells and immune cell populations. Exploring variations in TMEs between patient groups, as well as determining the extent to which this information can predict outcomes such as patient survival or treatment success with emerging immunotherapies, is of great interest. Moreover, in the face of a large number of cell interactions to consider, we often wish to identify specific interactions that are useful in making such predictions. We present an approach to achieve these goals based on summarizing spatial relationships in the TME using spatial K functions, and then applying functional data analysis and random forest models to both predict outcomes of interest and identify important spatial relationships. This approach is shown to be effective in simulation experiments at both identifying important spatial interactions while also controlling the false discovery rate. We further used the proposed approach to interrogate two real data sets of Multiplexed Ion Beam Images of TMEs in triple negative breast cancer and lung cancer patients. The methods proposed are publicly available in a companion R package funkycells.


Subject(s)
Cell Communication , Tumor Microenvironment , Tumor Microenvironment/physiology , Humans , Cell Communication/physiology , Computational Biology/methods , Lung Neoplasms/immunology , Lung Neoplasms/pathology , Algorithms , Computer Simulation , Triple Negative Breast Neoplasms/pathology , Triple Negative Breast Neoplasms/immunology , Neoplasms/immunology , Neoplasms/pathology , Models, Biological , Female , Random Forest
7.
Proteomics ; 24(6): e2300231, 2024 Mar.
Article in English | MEDLINE | ID: mdl-37525341

ABSTRACT

Non-invasive diagnostics and therapies are crucial to prevent patients from undergoing painful procedures. Exosomal proteins can serve as important biomarkers for such advancements. In this study, we attempted to build a model to predict exosomal proteins. All models are trained, tested, and evaluated on a non-redundant dataset comprising 2831 exosomal and 2831 non-exosomal proteins, where no two proteins have more than 40% similarity. Initially, the standard similarity-based method Basic Local Alignment Search Tool (BLAST) was used to predict exosomal proteins, which failed due to low-level similarity in the dataset. To overcome this challenge, machine learning (ML) based models were developed using compositional and evolutionary features of proteins achieving an area under the receiver operating characteristics (AUROC) of 0.73. Our analysis also indicated that exosomal proteins have a variety of sequence-based motifs which can be used to predict exosomal proteins. Hence, we developed a hybrid method combining motif-based and ML-based approaches for predicting exosomal proteins, achieving a maximum AUROC of 0.85 and MCC of 0.56 on an independent dataset. This hybrid model performs better than presently available methods when assessed on an independent dataset. A web server and a standalone software ExoProPred (https://webs.iiitd.edu.in/raghava/exopropred/) have been created to help scientists predict and discover exosomal proteins and find functional motifs present in them.


Subject(s)
Random Forest , Sequence Analysis, Protein , Humans , Amino Acid Sequence , Sequence Analysis, Protein/methods , Proteins/metabolism , Software
8.
BMC Bioinformatics ; 25(1): 78, 2024 Feb 20.
Article in English | MEDLINE | ID: mdl-38378437

ABSTRACT

BACKGROUND: In recent years, the extensive use of drugs and antibiotics has led to increasing microbial resistance. Therefore, it becomes crucial to explore deep connections between drugs and microbes. However, traditional biological experiments are very expensive and time-consuming. Therefore, it is meaningful to develop efficient computational models to forecast potential microbe-drug associations. RESULTS: In this manuscript, we proposed a novel prediction model called GARFMDA by combining graph attention networks and bilayer random forest to infer probable microbe-drug correlations. In GARFMDA, through integrating different microbe-drug-disease correlation indices, we constructed two different microbe-drug networks first. And then, based on multiple measures of similarity, we constructed a unique feature matrix for drugs and microbes respectively. Next, we fed these newly-obtained microbe-drug networks together with feature matrices into the graph attention network to extract the low-dimensional feature representations for drugs and microbes separately. Thereafter, these low-dimensional feature representations, along with the feature matrices, would be further inputted into the first layer of the Bilayer random forest model to obtain the contribution values of all features. And then, after removing features with low contribution values, these contribution values would be fed into the second layer of the Bilayer random forest to detect potential links between microbes and drugs. CONCLUSIONS: Experimental results and case studies show that GARFMDA can achieve better prediction performance than state-of-the-art approaches, which means that GARFMDA may be a useful tool in the field of microbe-drug association prediction in the future. Besides, the source code of GARFMDA is available at https://github.com/KuangHaiYue/GARFMDA.git.


Subject(s)
Anti-Bacterial Agents , Random Forest , Probability , Software
9.
BMC Bioinformatics ; 25(1): 253, 2024 Aug 01.
Article in English | MEDLINE | ID: mdl-39090608

ABSTRACT

BACKGROUND: Conditional logistic regression trees have been proposed as a flexible alternative to the standard method of conditional logistic regression for the analysis of matched case-control studies. While they allow to avoid the strict assumption of linearity and automatically incorporate interactions, conditional logistic regression trees may suffer from a relatively high variability. Further machine learning methods for the analysis of matched case-control studies are missing because conventional machine learning methods cannot handle the matched structure of the data. RESULTS: A random forest method for the analysis of matched case-control studies based on conditional logistic regression trees is proposed, which overcomes the issue of high variability. It provides an accurate estimation of exposure effects while being more flexible in the functional form of covariate effects. The efficacy of the method is illustrated in a simulation study and within an application to real-world data from a matched case-control study on the effect of regular participation in cervical cancer screening on the development of cervical cancer. CONCLUSIONS: The proposed random forest method is a promising add-on to the toolbox for the analysis of matched case-control studies and addresses the need for machine-learning methods in this field. It provides a more flexible approach compared to the standard method of conditional logistic regression, but also compared to conditional logistic regression trees. It allows for non-linearity and the automatic inclusion of interaction effects and is suitable both for exploratory and explanatory analyses.


Subject(s)
Machine Learning , Random Forest , Female , Humans , Case-Control Studies , Logistic Models , Uterine Cervical Neoplasms
10.
BMC Bioinformatics ; 25(1): 18, 2024 Jan 11.
Article in English | MEDLINE | ID: mdl-38212697

ABSTRACT

BACKGROUND: Metabolic syndrome (MetS) is a cluster of metabolic abnormalities (including obesity, insulin resistance, hypertension, and dyslipidemia), which can be used to identify at-risk populations for diabetes and cardiovascular diseases, the main causes of morbidity and mortality worldwide. The achievement of a simple approach for diagnosing MetS without needing biochemical tests is so valuable. The present study aimed to predict MetS using non-invasive features based on a successful random forest learning algorithm. Also, to deal with the problem of data imbalance that naturally exists in this type of data, the effect of two different data balancing approaches, including the Synthetic Minority Over-sampling Technique (SMOTE) and Random Splitting data balancing (SplitBal), on model performance is investigated. RESULTS: The most important determinant for MetS prediction was waist circumference. Applying a random forest learning algorithm to imbalanced data, the trained models reach 86.9% and 79.4% accuracies and 37.1% and 38.2% sensitivities in men and women, respectively. However, by applying the SplitBal data balancing technique, the best results were obtained, and despite that the accuracy of the trained models decreased by 7.8% and 11.3%, but their sensitivity improved significantly to 82.3% and 73.7% in men and women, respectively. CONCLUSIONS: The random forest learning method, along with data balancing techniques, especially SplitBal, could create MetS prediction models with promising results that can be applied as a useful prognostic tool in health screening programs.


Subject(s)
Insulin Resistance , Metabolic Syndrome , Male , Humans , Female , Metabolic Syndrome/diagnosis , Random Forest , Risk Factors , Obesity
11.
BMC Bioinformatics ; 25(1): 108, 2024 Mar 12.
Article in English | MEDLINE | ID: mdl-38475723

ABSTRACT

RNA-protein interaction (RPI) is crucial to the life processes of diverse organisms. Various researchers have identified RPI through long-term and high-cost biological experiments. Although numerous machine learning and deep learning-based methods for predicting RPI currently exist, their robustness and generalizability have significant room for improvement. This study proposes LPI-MFF, an RPI prediction model based on multi-source information fusion, to address these issues. The LPI-MFF employed protein-protein interactions features, sequence features, secondary structure features, and physical and chemical properties as the information sources with the corresponding coding scheme, followed by the random forest algorithm for feature screening. Finally, all information was combined and a classification method based on convolutional neural networks is used. The experimental results of fivefold cross-validation demonstrated that the accuracy of LPI-MFF on RPI1807 and NPInter was 97.60% and 97.67%, respectively. In addition, the accuracy rate on the independent test set RPI1168 was 84.9%, and the accuracy rate on the Mus musculus dataset was 90.91%. Accordingly, LPI-MFF demonstrated greater robustness and generalization than other prevalent RPI prediction methods.


Subject(s)
Deep Learning , RNA, Long Noncoding , Animals , Mice , RNA, Long Noncoding/chemistry , Random Forest , Neural Networks, Computer , Machine Learning , Computational Biology/methods
12.
Oncologist ; 29(1): e68-e80, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37669005

ABSTRACT

BACKGROUND: We aimed to develop a machine-learning model for predicting treatment response to radioiodine (131I) therapy and thyrotropin (TSH) suppression therapy in patients with differentiated thyroid cancer (DTC) but without structural disease, based on pre-treatment information. PATIENTS AND METHODS: Overall, 597 and 326 patients with DTC but without structural disease were randomly assigned to "training" cohorts for predicting treatment response to 131I therapy and TSH suppression therapy, respectively. Six supervised algorithms, including Logistic Regression, Support Vector Machine, Random Forest (RF), Neural Networks, Adaptive Boosting, and Gradient Boost, were used to predict effective response (ER) to 131I therapy and biochemical remission (BR) to TSH suppression therapy. RESULTS: Stimulated and suppressed thyroglobulin (Tg) and radioiodine uptake before the current course of 131I therapy were mostly attributed to ER to 131I therapy, while thyroid remnant available on the post-therapeutic whole-body scan at the last course of 131I therapy and TSH were greatly contributed to Tg decline under TSH suppression therapy. RF showed the best performance among all models. The accuracy and area under the receiver operating characteristic curve (AUC) for segregating ER from non-ER during 131I therapy with RF were 81.3% and 0.896, respectively. The accuracy and AUC for predicting BR to TSH suppression therapy with RF were 78.7% and 0.857, respectively. CONCLUSION: This study demonstrates that machine learning models, especially the RF algorithm are useful tools that may predict treatment response to 131I therapy and TSH suppression therapy in DTC patients without structural disease based on pre-treatment routine clinical variables and biochemical markers.


Subject(s)
Iodine Radioisotopes , Thyroid Neoplasms , Humans , Iodine Radioisotopes/therapeutic use , Random Forest , Thyroglobulin/therapeutic use , Thyroid Neoplasms/drug therapy , Thyroid Neoplasms/radiotherapy , Thyroidectomy , Thyrotropin/therapeutic use
13.
Anal Chem ; 96(35): 14168-14177, 2024 Sep 03.
Article in English | MEDLINE | ID: mdl-39163401

ABSTRACT

Antibiotic resistance can rapidly spread through bacterial populations via bacterial conjugation. The bacterial membrane has an important role in facilitating conjugation, thus investigating the effects on the bacterial membrane caused by conjugative plasmids, antibiotic resistance, and genes involved in conjugation is of interest. Analysis of bacterial membranes was conducted using gas cluster ion beam-secondary ion mass spectrometry (GCIB-SIMS). The complexity of the data means that data analysis is important for the identification of changes in the membrane composition. Preprocessing of data and several analytical methods for identification of changes in bacterial membranes have been investigated. GCIB-SIMS data from Escherichia coli samples were subjected to principal components analysis (PCA), principal components-canonical variate analysis (PC-CVA), and Random Forests (RF) data analysis with the aim of extracting the maximum biological information. The influence of increasing replicate data was assessed, and the effect of diminishing biological variation was studied. Optimized m/z region-specific scaling provided improved clustering, with an increase in biologically significant peaks contributing to the loadings. PC-CVA improved clustering, provided clearer loadings, and benefited from larger data sets collected over several months. RF required larger sample numbers and while showing overlap with the PC-CVA, produced additional peaks of interest. The combination of PC-CVA and RF allowed very subtle differences between bacterial strains and growth conditions to be elucidated for the first time. Specifically, comparative analysis of an E. coli strain with and without the F-plasmid revealed changes in cyclopropanation of fatty acids, where the addition of the F-plasmid led to a reduction in cyclopropanation.


Subject(s)
Escherichia coli , Principal Component Analysis , Spectrometry, Mass, Secondary Ion , Escherichia coli/drug effects , Spectrometry, Mass, Secondary Ion/methods , Anti-Bacterial Agents/pharmacology , Cell Membrane/metabolism , Cell Membrane/chemistry , Drug Resistance, Bacterial , Drug Resistance, Microbial , Random Forest
14.
J Gene Med ; 26(1): e3593, 2024 Jan.
Article in English | MEDLINE | ID: mdl-37730948

ABSTRACT

BACKGROUND: The dysfunction of secretory pathways may represent biomarkers or therapeutic targets of cancer. The hepatocellular carcinoma (HCC) phenotype was studied in relation to the genes in the secretory pathway and to screen for a combination of genes that may be a viable therapeutic target for HCC and connected to the pathophysiological features of the tumor. METHODS: Using the HCC information from The Cancer Genome Atlas, somatic mutation and prognostic association analysis were performed on the secretory pathway genes. Based on prognostic genes in the secretory pathway, the samples were consensus clustered, and a Random Forest model was built. The clinical characteristics, tumor mutation burden, functional status and potential responses to immunotherapy and tumor suppressor medications of various subtypes and risk groups were discussed. RESULTS: Of the 84 genes for secretory pathway, 32 were prognostic genes related to HCC, which divided HCC into two categories: C1 and C2. By comparing the two types of HCC samples, it was found that the survival outcome of C1 was inferior, with stronger adaptive and innate immunity, but less sensitive to immunotherapy than C2. The constructed prognostic signature included seven of the 32 prognostic genes in the secretory pathway, which showed significant correlation with the prognosis, somatic mutation, biological pathway status, potential response to immunotherapy and sensitivity of 72 tumor suppressor drugs from different HCC cohorts, and had a feasible prognostic effect for 31 types of cancer and immunotherapy cohorts. CONCLUSIONS: In this study, HCC was divided into two molecular subtypes according to prognostic genes in the secretory pathway, and seven of them were combined into one signature, which produced significant results in evaluating the prognosis of different HCC cohorts, pan-cancer cohorts and immunotherapy cohorts, and had potential guiding significance for prophylactic immunotherapy in patients with HCC.


Subject(s)
Carcinoma, Hepatocellular , Liver Neoplasms , Humans , Carcinoma, Hepatocellular/genetics , Carcinoma, Hepatocellular/therapy , Random Forest , Secretory Pathway , Liver Neoplasms/genetics , Liver Neoplasms/therapy , Immunotherapy
15.
Cancer Immunol Immunother ; 73(6): 112, 2024 May 02.
Article in English | MEDLINE | ID: mdl-38693422

ABSTRACT

OBJECTIVE: The high mortality rate of gastric cancer, traditionally managed through surgery, underscores the urgent need for advanced therapeutic strategies. Despite advancements in treatment modalities, outcomes remain suboptimal, necessitating the identification of novel biomarkers to predict sensitivity to immunotherapy. This study focuses on utilizing single-cell sequencing for gene identification and developing a random forest model to predict immunotherapy sensitivity in gastric cancer patients. METHODS: Differentially expressed genes were identified using single-cell RNA sequencing (scRNA-seq) and gene set enrichment analysis (GESA). A random forest model was constructed based on these genes, and its effectiveness was validated through prognostic analysis. Further, analyses of immune cell infiltration, immune checkpoints, and the random forest model provided deeper insights. RESULTS: High METTL1 expression was found to correlate with improved survival rates in gastric cancer patients (P = 0.042), and the random forest model, based on METTL1 and associated prognostic genes, achieved a significant predictive performance (AUC = 0.863). It showed associations with various immune cell types and negative correlations with CTLA4 and PDCD1 immune checkpoints. Experiments in vitro and in vivo demonstrated that METTL1 enhances gastric cancer cell activity by suppressing T cell proliferation and upregulating CTLA4 and PDCD1. CONCLUSION: The random forest model, based on scRNA-seq, shows high predictive value for survival and immunotherapy sensitivity in gastric cancer patients. This study underscores the potential of METTL1 as a biomarker in enhancing the efficacy of gastric cancer immunotherapy.


Subject(s)
Immunotherapy , Single-Cell Analysis , Stomach Neoplasms , Stomach Neoplasms/genetics , Stomach Neoplasms/therapy , Stomach Neoplasms/immunology , Stomach Neoplasms/mortality , Humans , Single-Cell Analysis/methods , Immunotherapy/methods , Animals , Mice , Prognosis , Biomarkers, Tumor/genetics , Sequence Analysis, RNA/methods , Female , Male , Gene Expression Regulation, Neoplastic , Xenograft Model Antitumor Assays , Cell Line, Tumor , Random Forest
16.
Bioinformatics ; 39(6)2023 06 01.
Article in English | MEDLINE | ID: mdl-37326960

ABSTRACT

MOTIVATION: Interpretable deep learning (DL) models that can provide biological insights, in addition to accurate predictions, are of great interest to the biomedical community. Recently, interpretable DL models that incorporate signaling pathways have been proposed for drug response prediction (DRP). While these models improve interpretability, it is unclear whether this comes at the cost of less accurate DRPs, or a prediction improvement can also be obtained. RESULTS: We comprehensively and systematically assessed four state-of-the-art interpretable DL models using three pathway collections to assess their ability in making accurate predictions on unseen samples from the same dataset, as well as their generalizability to an independent dataset. Our results showed that models that explicitly incorporate pathway information in the form of a latent layer perform worse compared to models that incorporate this information implicitly. However, in most evaluation setups, the best performance was achieved using a black-box multilayer perceptron, and the performance of a random forests baseline was comparable to those of the interpretable models. Replacing the signaling pathways with randomly generated pathways showed a comparable performance for the majority of the models. Finally, the performance of all models deteriorated when applied to an independent dataset. These results highlight the importance of systematic evaluation of newly proposed models using carefully selected baselines. We provide different evaluation setups and baseline models that can be used to achieve this goal. AVAILABILITY AND IMPLEMENTATION: Implemented models and datasets are provided at https://doi.org/10.5281/zenodo.7787178 and https://doi.org/10.5281/zenodo.7101665, respectively.


Subject(s)
Deep Learning , Neural Networks, Computer , Random Forest
17.
Bioinformatics ; 39(39 Suppl 1): i368-i376, 2023 06 30.
Article in English | MEDLINE | ID: mdl-37387178

ABSTRACT

MOTIVATION: Single-cell RNA sequencing (scRNA-seq) offers a powerful tool to dissect the complexity of biological tissues through cell sub-population identification in combination with clustering approaches. Feature selection is a critical step for improving the accuracy and interpretability of single-cell clustering. Existing feature selection methods underutilize the discriminatory potential of genes across distinct cell types. We hypothesize that incorporating such information could further boost the performance of single cell clustering. RESULTS: We develop CellBRF, a feature selection method that considers genes' relevance to cell types for single-cell clustering. The key idea is to identify genes that are most important for discriminating cell types through random forests guided by predicted cell labels. Moreover, it proposes a class balancing strategy to mitigate the impact of unbalanced cell type distributions on feature importance evaluation. We benchmark CellBRF on 33 scRNA-seq datasets representing diverse biological scenarios and demonstrate that it substantially outperforms state-of-the-art feature selection methods in terms of clustering accuracy and cell neighborhood consistency. Furthermore, we demonstrate the outstanding performance of our selected features through three case studies on cell differentiation stage identification, non-malignant cell subtype identification, and rare cell identification. CellBRF provides a new and effective tool to boost single-cell clustering accuracy. AVAILABILITY AND IMPLEMENTATION: All source codes of CellBRF are freely available at https://github.com/xuyp-csu/CellBRF.


Subject(s)
Benchmarking , Random Forest , Cell Differentiation , Cluster Analysis
18.
Bioinformatics ; 39(8)2023 08 01.
Article in English | MEDLINE | ID: mdl-37522865

ABSTRACT

MOTIVATION: Random forest is a popular machine learning approach for the analysis of high-dimensional data because it is flexible and provides variable importance measures for the selection of relevant features. However, the complex relationships between the features are usually not considered for the selection and thus also neglected for the characterization of the analysed samples. RESULTS: Here we propose two novel approaches that focus on the mutual impact of features in random forests. Mutual forest impact (MFI) is a relation parameter that evaluates the mutual association of the features to the outcome and, hence, goes beyond the analysis of correlation coefficients. Mutual impurity reduction (MIR) is an importance measure that combines this relation parameter with the importance of the individual features. MIR and MFI are implemented together with testing procedures that generate P-values for the selection of related and important features. Applications to one experimental and various simulated datasets and the comparison to other methods for feature selection and relation analysis show that MFI and MIR are very promising to shed light on the complex relationships between features and outcome. In addition, they are not affected by common biases, e.g. that features with many possible splits or high minor allele frequencies are preferred. AVAILABILITY AND IMPLEMENTATION: The approaches are implemented in Version 0.3.3 of the R package RFSurrogates that is available at github.com/AGSeifert/RFSurrogates and the data are available at doi.org/10.25592/uhhfdm.12620.


Subject(s)
Machine Learning , Random Forest , Bias , Gene Frequency
19.
Bioinformatics ; 39(10)2023 10 03.
Article in English | MEDLINE | ID: mdl-37851379

ABSTRACT

MOTIVATION: Gene regulatory networks (GRNs) are a way of describing the interaction between genes, which contribute to revealing the different biological mechanisms in the cell. Reconstructing GRNs based on gene expression data has been a central computational problem in systems biology. However, due to the high dimensionality and non-linearity of large-scale GRNs, accurately and efficiently inferring GRNs is still a challenging task. RESULTS: In this article, we propose a new approach, iLSGRN, to reconstruct large-scale GRNs from steady-state and time-series gene expression data based on non-linear ordinary differential equations. Firstly, the regulatory gene recognition algorithm calculates the Maximal Information Coefficient between genes and excludes redundant regulatory relationships to achieve dimensionality reduction. Then, the feature fusion algorithm constructs a model leveraging the feature importance derived from XGBoost (eXtreme Gradient Boosting) and RF (Random Forest) models, which can effectively train the non-linear ordinary differential equations model of GRNs and improve the accuracy and stability of the inference algorithm. The extensive experiments on different scale datasets show that our method makes sensible improvement compared with the state-of-the-art methods. Furthermore, we perform cross-validation experiments on the real gene datasets to validate the robustness and effectiveness of the proposed method. AVAILABILITY AND IMPLEMENTATION: The proposed method is written in the Python language, and is available at: https://github.com/lab319/iLSGRN.


Subject(s)
Algorithms , Gene Regulatory Networks , Systems Biology , Random Forest , Time Factors , Computational Biology/methods
20.
J Transl Med ; 22(1): 640, 2024 Jul 08.
Article in English | MEDLINE | ID: mdl-38978066

ABSTRACT

BACKGROUND: The tumor microenvironment (TME) plays a key role in lung cancer initiation, proliferation, invasion, and metastasis. Artificial intelligence (AI) methods could potentially accelerate TME analysis. The aims of this study were to (1) assess the feasibility of using hematoxylin and eosin (H&E)-stained whole slide images (WSI) to develop an AI model for evaluating the TME and (2) to characterize the TME of adenocarcinoma (ADCA) and squamous cell carcinoma (SCCA) in fibrotic and non-fibrotic lung. METHODS: The cohort was derived from chest CT scans of patients presenting with lung neoplasms, with and without background fibrosis. WSI images were generated from slides of all 76 available pathology cases with ADCA (n = 53) or SCCA (n = 23) in fibrotic (n = 47) or non-fibrotic (n = 29) lung. Detailed ground-truth annotations, including of stroma (i.e., fibrosis, vessels, inflammation), necrosis and background, were performed on WSI and optimized via an expert-in-the-loop (EITL) iterative procedure using a lightweight [random forest (RF)] classifier. A convolution neural network (CNN)-based model was used to achieve tissue-level multiclass segmentation. The model was trained on 25 annotated WSI from 13 cases of ADCA and SCCA within and without fibrosis and then applied to the 76-case cohort. The TME analysis included tumor stroma ratio (TSR), tumor fibrosis ratio (TFR), tumor inflammation ratio (TIR), tumor vessel ratio (TVR), tumor necrosis ratio (TNR), and tumor background ratio (TBR). RESULTS: The model's overall classification for precision, sensitivity, and F1-score were 94%, 90%, and 91%, respectively. Statistically significant differences were noted in TSR (p = 0.041) and TFR (p = 0.001) between fibrotic and non-fibrotic ADCA. Within fibrotic lung, statistically significant differences were present in TFR (p = 0.039), TIR (p = 0.003), TVR (p = 0.041), TNR (p = 0.0003), and TBR (p = 0.020) between ADCA and SCCA. CONCLUSION: The combined EITL-RF CNN model using only H&E WSI can facilitate multiclass evaluation and quantification of the TME. There are significant differences in the TME of ADCA and SCCA present within or without background fibrosis. Future studies are needed to determine the significance of TME on prognosis and treatment.


Subject(s)
Artificial Intelligence , Carcinoma, Non-Small-Cell Lung , Fibrosis , Lung Neoplasms , Tumor Microenvironment , Humans , Lung Neoplasms/pathology , Lung Neoplasms/diagnostic imaging , Carcinoma, Non-Small-Cell Lung/pathology , Carcinoma, Non-Small-Cell Lung/diagnostic imaging , Male , Female , Middle Aged , Aged , Neural Networks, Computer , Image Processing, Computer-Assisted/methods , Random Forest
SELECTION OF CITATIONS
SEARCH DETAIL