Search | VHL Search Portal

1.

MiRAGE: mining relationships for advanced generative evaluation in drug repositioning.

Hassanali Aragh, Aria; Givehchian, Pegah; Moslemi Amirani, Razieh; Masumshah, Raziyeh; Eslahchi, Changiz.

Brief Bioinform ; 25(4)2024 May 23.

Article in English | MEDLINE | ID: mdl-39038932

ABSTRACT

MOTIVATION: Drug repositioning, the identification of new therapeutic uses for existing drugs, is crucial for accelerating drug discovery and reducing development costs. Some methods rely on heterogeneous networks, which may not fully capture the complex relationships between drugs and diseases. However, integrating diverse biological data sources offers promise for discovering new drug-disease associations (DDAs). Previous evidence indicates that the combination of information would be conducive to the discovery of new DDAs. However, the challenge lies in effectively integrating different biological data sources to identify the most effective drugs for a certain disease based on drug-disease coupled mechanisms. RESULTS: In response to this challenge, we present MiRAGE, a novel computational method for drug repositioning. MiRAGE leverages a three-step framework, comprising negative sampling using hard negative mining, classification employing random forest models, and feature selection based on feature importance. We evaluate MiRAGE on multiple benchmark datasets, demonstrating its superiority over state-of-the-art algorithms across various metrics. Notably, MiRAGE consistently outperforms other methods in uncovering novel DDAs. Case studies focusing on Parkinson's disease and schizophrenia showcase MiRAGE's ability to identify top candidate drugs supported by previous studies. Overall, our study underscores MiRAGE's efficacy and versatility as a computational tool for drug repositioning, offering valuable insights for therapeutic discoveries and addressing unmet medical needs.

Subject(s)

Algorithms , Data Mining , Drug Repositioning , Drug Repositioning/methods , Data Mining/methods , Humans , Computational Biology/methods , Schizophrenia/drug therapy , Parkinson Disease/drug therapy , Drug Discovery/methods

2.

Fuzzy kernel evidence Random Forest for identifying pseudouridine sites.

Chen, Mingshuai; Sun, Mingai; Su, Xi; Tiwari, Prayag; Ding, Yijie.

Brief Bioinform ; 25(3)2024 Mar 27.

Article in English | MEDLINE | ID: mdl-38622357

ABSTRACT

Pseudouridine is an RNA modification that is widely distributed in both prokaryotes and eukaryotes, and plays a critical role in numerous biological activities. Despite its importance, the precise identification of pseudouridine sites through experimental approaches poses significant challenges, requiring substantial time and resources.Therefore, there is a growing need for computational techniques that can reliably and quickly identify pseudouridine sites from vast amounts of RNA sequencing data. In this study, we propose fuzzy kernel evidence Random Forest (FKeERF) to identify pseudouridine sites. This method is called PseU-FKeERF, which demonstrates high accuracy in identifying pseudouridine sites from RNA sequencing data. The PseU-FKeERF model selected four RNA feature coding schemes with relatively good performance for feature combination, and then input them into the newly proposed FKeERF method for category prediction. FKeERF not only uses fuzzy logic to expand the original feature space, but also combines kernel methods that are easy to interpret in general for category prediction. Both cross-validation tests and independent tests on benchmark datasets have shown that PseU-FKeERF has better predictive performance than several state-of-the-art methods. This new method not only improves the accuracy of pseudouridine site identification, but also provides a certain reference for disease control and related drug development in the future.

Subject(s)

Pseudouridine , Random Forest , Pseudouridine/genetics , RNA/genetics , Base Sequence

3.

StrVCTVRE: A supervised learning method to predict the pathogenicity of human genome structural variants.

Sharo, Andrew G; Hu, Zhiqiang; Sunyaev, Shamil R; Brenner, Steven E.

Am J Hum Genet ; 109(2): 195-209, 2022 02 03.

Article in English | MEDLINE | ID: mdl-35032432

ABSTRACT

Whole-genome sequencing resolves many clinical cases where standard diagnostic methods have failed. However, at least half of these cases remain unresolved after whole-genome sequencing. Structural variants (SVs; genomic variants larger than 50 base pairs) of uncertain significance are the genetic cause of a portion of these unresolved cases. As sequencing methods using long or linked reads become more accessible and SV detection algorithms improve, clinicians and researchers are gaining access to thousands of reliable SVs of unknown disease relevance. Methods to predict the pathogenicity of these SVs are required to realize the full diagnostic potential of long-read sequencing. To address this emerging need, we developed StrVCTVRE to distinguish pathogenic SVs from benign SVs that overlap exons. In a random forest classifier, we integrated features that capture gene importance, coding region, conservation, expression, and exon structure. We found that features such as expression and conservation are important but are absent from SV classification guidelines. We leveraged multiple resources to construct a size-matched training set of rare, putatively benign and pathogenic SVs. StrVCTVRE performs accurately across a wide SV size range on independent test sets, which will allow clinicians and researchers to eliminate about half of SVs from consideration while retaining a 90% sensitivity. We anticipate clinicians and researchers will use StrVCTVRE to prioritize SVs in probands where no SV is immediately compelling, empowering deeper investigation into novel SVs to resolve cases and understand new mechanisms of disease. StrVCTVRE runs rapidly and is publicly available.

Subject(s)

Algorithms , Genome, Human , Genomic Structural Variation , Software , Supervised Machine Learning , Datasets as Topic , Exons , Genomics/methods , Humans , ROC Curve , Whole Genome Sequencing/statistics & numerical data

4.

Estimation of optimal treatment regimes with electronic medical record data using the residual life value estimator.

Rhodes, Grace; Davidian, Marie; Lu, Wenbin.

Biostatistics ; 2024 Feb 09.

Article in English | MEDLINE | ID: mdl-38332633

ABSTRACT

Clinicians and patients must make treatment decisions at a series of key decision points throughout disease progression. A dynamic treatment regime is a set of sequential decision rules that return treatment decisions based on accumulating patient information, like that commonly found in electronic medical record (EMR) data. When applied to a patient population, an optimal treatment regime leads to the most favorable outcome on average. Identifying optimal treatment regimes that maximize residual life is especially desirable for patients with life-threatening diseases such as sepsis, a complex medical condition that involves severe infections with organ dysfunction. We introduce the residual life value estimator (ReLiVE), an estimator for the expected value of cumulative restricted residual life under a fixed treatment regime. Building on ReLiVE, we present a method for estimating an optimal treatment regime that maximizes expected cumulative restricted residual life. Our proposed method, ReLiVE-Q, conducts estimation via the backward induction algorithm Q-learning. We illustrate the utility of ReLiVE-Q in simulation studies, and we apply ReLiVE-Q to estimate an optimal treatment regime for septic patients in the intensive care unit using EMR data from the Multiparameter Intelligent Monitoring Intensive Care database. Ultimately, we demonstrate that ReLiVE-Q leverages accumulating patient information to estimate personalized treatment regimes that optimize a clinically meaningful function of residual life.

5.

dSCOPE: a software to detect sequences critical for liquid-liquid phase separation.

Yu, Kai; Liu, Zekun; Cheng, Haoyang; Li, Shihua; Zhang, Qingfeng; Liu, Jia; Ju, Huai-Qiang; Zuo, Zhixiang; Zhao, Qi; Kang, Shiyang; Liu, Ze-Xian.

Brief Bioinform ; 24(1)2023 01 19.

Article in English | MEDLINE | ID: mdl-36528388

ABSTRACT

Membrane-based cells are the fundamental structural and functional units of organisms, while evidences demonstrate that liquid-liquid phase separation (LLPS) is associated with the formation of membraneless organelles, such as P-bodies, nucleoli and stress granules. Many studies have been undertaken to explore the functions of protein phase separation (PS), but these studies lacked an effective tool to identify the sequence segments that critical for LLPS. In this study, we presented a novel software called dSCOPE (http://dscope.omicsbio.info) to predict the PS-driving regions. To develop the predictor, we curated experimentally identified sequence segments that can drive LLPS from published literature. Then sliding sequence window based physiological, biochemical, structural and coding features were integrated by random forest algorithm to perform prediction. Through rigorous evaluation, dSCOPE was demonstrated to achieve satisfactory performance. Furthermore, large-scale analysis of human proteome based on dSCOPE showed that the predicted PS-driving regions enriched various protein post-translational modifications and cancer mutations, and the proteins which contain predicted PS-driving regions enriched critical cellular signaling pathways. Taken together, dSCOPE precisely predicted the protein sequence segments critical for LLPS, with various helpful information visualized in the webserver to facilitate LLPS-related research.

Subject(s)

Proteins , Software , Humans , Proteins/chemistry

6.

Artificial intelligence-enabled microbiome-based diagnosis models for a broad spectrum of cancer types.

Xu, Wei; Wang, Teng; Wang, Nan; Zhang, Haohong; Zha, Yuguo; Ji, Lei; Chu, Yuwen; Ning, Kang.

Brief Bioinform ; 24(3)2023 05 19.

Article in English | MEDLINE | ID: mdl-37141141

ABSTRACT

Microbiome-based diagnosis of cancer is an increasingly important supplement for the genomics approach in cancer diagnosis, yet current models for microbiome-based diagnosis of cancer face difficulties in generality: not only diagnosis models could not be adapted from one cancer to another, but models built based on microbes from tissues could not be adapted for diagnosis based on microbes from blood. Therefore, a microbiome-based model suitable for a broad spectrum of cancer types is urgently needed. Here we have introduced DeepMicroCancer, a diagnosis model using artificial intelligence techniques for a broad spectrum of cancer types. Built based on the random forest models it has enabled superior performances on more than twenty types of cancers' tissue samples. And by using the transfer learning techniques, improved accuracies could be obtained, especially for cancer types with only a few samples, which could satisfy the requirement in clinical scenarios. Moreover, transfer learning techniques have enabled high diagnosis accuracy that could also be achieved for blood samples. These results indicated that certain sets of microbes could, if excavated using advanced artificial techniques, reveal the intricate differences among cancers and healthy individuals. Collectively, DeepMicroCancer has provided a new venue for accurate diagnosis of cancer based on tissue and blood materials, which could potentially be used in clinics.

Subject(s)

Body Fluids , Microbiota , Neoplasms , Humans , Artificial Intelligence , Neoplasms/diagnosis , Genomics

7.

Predicting potential microbe-disease associations based on multi-source features and deep learning.

Wang, Liugen; Wang, Yan; Xuan, Chenxu; Zhang, Bai; Wu, Hanwen; Gao, Jie.

Brief Bioinform ; 24(4)2023 07 20.

Article in English | MEDLINE | ID: mdl-37406190

ABSTRACT

Studies have confirmed that the occurrence of many complex diseases in the human body is closely related to the microbial community, and microbes can affect tumorigenesis and metastasis by regulating the tumor microenvironment. However, there are still large gaps in the clinical observation of the microbiota in disease. Although biological experiments are accurate in identifying disease-associated microbes, they are also time-consuming and expensive. The computational models for effective identification of diseases related microbes can shorten this process, and reduce capital and time costs. Based on this, in the paper, a model named DSAE_RF is presented to predict latent microbe-disease associations by combining multi-source features and deep learning. DSAE_RF calculates four similarities between microbes and diseases, which are then used as feature vectors for the disease-microbe pairs. Later, reliable negative samples are screened by k-means clustering, and a deep sparse autoencoder neural network is further used to extract effective features of the disease-microbe pairs. In this foundation, a random forest classifier is presented to predict the associations between microbes and diseases. To assess the performance of the model in this paper, 10-fold cross-validation is implemented on the same dataset. As a result, the AUC and AUPR of the model are 0.9448 and 0.9431, respectively. Furthermore, we also conduct a variety of experiments, including comparison of negative sample selection methods, comparison with different models and classifiers, Kolmogorov-Smirnov test and t-test, ablation experiments, robustness analysis, and case studies on Covid-19 and colorectal cancer. The results fully demonstrate the reliability and availability of our model.

Subject(s)

COVID-19 , Deep Learning , Microbiota , Humans , Reproducibility of Results , Algorithms , Computational Biology/methods

8.

Identification of species-specific RNA N6-methyladinosine modification sites from RNA sequences.

Wang, Rulan; Chung, Chia-Ru; Huang, Hsien-Da; Lee, Tzong-Yi.

Brief Bioinform ; 24(2)2023 03 19.

Article in English | MEDLINE | ID: mdl-36715277

ABSTRACT

N6-methyladinosine (m6A) modification is the most abundant co-transcriptional modification in eukaryotic RNA and plays important roles in cellular regulation. Traditional high-throughput sequencing experiments used to explore functional mechanisms are time-consuming and labor-intensive, and most of the proposed methods focused on limited species types. To further understand the relevant biological mechanisms among different species with the same RNA modification, it is necessary to develop a computational scheme that can be applied to different species. To achieve this, we proposed an attention-based deep learning method, adaptive-m6A, which consists of convolutional neural network, bi-directional long short-term memory and an attention mechanism, to identify m6A sites in multiple species. In addition, three conventional machine learning (ML) methods, including support vector machine, random forest and logistic regression classifiers, were considered in this work. In addition to the performance of ML methods for multi-species prediction, the optimal performance of adaptive-m6A yielded an accuracy of 0.9832 and the area under the receiver operating characteristic curve of 0.98. Moreover, the motif analysis and cross-validation among different species were conducted to test the robustness of one model towards multiple species, which helped improve our understanding about the sequence characteristics and biological functions of RNA modifications in different species.

Subject(s)

Machine Learning , RNA , Base Sequence , RNA/genetics , Neural Networks, Computer

9.

ProsmORF-pred: a machine learning-based method for the identification of small ORFs in prokaryotic genomes.

Khanduja, Akshay; Kumar, Manish; Mohanty, Debasisa.

Brief Bioinform ; 24(3)2023 05 19.

Article in English | MEDLINE | ID: mdl-36988160

ABSTRACT

Small open reading frames (smORFs) encoding proteins less than 100 amino acids (aa) are known to be important regulators of key cellular processes. However, their computational identification remains a challenge. Based on a comprehensive analysis of known prokaryotic small ORFs, we have developed the ProsmORF-pred resource which uses a machine learning (ML)-based method for prediction of smORFs in the prokaryotic genome sequences. ProsmORF-pred consists of two ML models, one for initiation site recognition in nucleic acid sequences upstream of putative start codons and the other uses translated amino acid sequences to decipher functional protein like sequences. The nucleotide sequence-based initiation site recognition model has been trained using longer ORFs (>100 aa) in the same genome while the ML model for identification of protein like sequences has been trained using annotated smORFs from Escherichia coli. Comprehensive benchmarking of ProsmORF-pred reveals that its performance is comparable to other state-of-the-art approaches on the annotated smORF set derived from 32 prokaryotic genomes. Its performance is distinctly superior to other tools like PRODIGAL and RANSEPS for prediction of newly identified smORFs which have a length range of 10-30 aa, where prediction of smORFs has been a major challenge. Apart from identification of smORFs in genomic sequences, ProsmORF-pred can also aid in functional annotation of the predicted smORFs based on sequence similarity and genomic neighbourhood similarity searches in ProsmORFDB, a well-curated database of known smORFs. ProsmORF-pred along with its backend database ProsmORFDB is available as a user-friendly web server (http://www.nii.ac.in/prosmorfpred.html).

Subject(s)

Genome , Proteins , Open Reading Frames , Proteins/genetics , Genomics , Amino Acid Sequence

10.

Identification and validation of genes associated with aging-related cardiovascular disease.

Li, Jing; Jiang, Shengping; Huang, Chengyun; Lu, Baihui; Yang, Xiaolong.

FASEB J ; 38(1): e23370, 2024 01.

Article in English | MEDLINE | ID: mdl-38168496

ABSTRACT

Aging is acknowledged as the most significant risk factor for cardiovascular disease (CVD). This study sought to identify and validate potential aging-related genes associated with CVD by using bioinformatics. The confluence of the limma test, weighted correlation network analysis (WGCNA), and 2129 aging and senescence-associated genes led to the identification of aging-related differential expression genes (ARDEGs). By using Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG), potential biological roles and pathways of ARDEGs were identified. To find the significantly different functions between CVD and non-cardiovascular disease (nCVD) and to reckon the processes score, enrichment analysis of all genes was carried out using gene set enrichment analysis (GSEA) and gene set variation analysis (GSVA). By using GO and KEGG, potential biological roles and pathways of ARDEGs were identified. To evaluate the immune cell composition of the immune microenvironment, we performed an immune infiltration analysis on the dataset from the training group. We were able to acquire four ARDEGs (PTGS2, MMP9, HBEGF, and FN1). Aging, cellular senescence, and nitric oxide signal transduction were selected for biological function analysis. The diagnostic value of the four ARDEGs in distinguishing CVD from nCVD samples was deemed to be favorable. This research identified four ARDEGs that are associated with CVD. This study provides insight into prospective novel biomarkers for aging-related CVD diagnosis and progression monitoring.

Subject(s)

Cardiovascular Diseases , Cardiovascular System , Humans , Cardiovascular Diseases/genetics , Prospective Studies , Cellular Senescence , Computational Biology

11.

DBPboost:A method of classification of DNA-binding proteins based on improved differential evolution algorithm and feature extraction.

Sun, Ailun; Li, Hongfei; Dong, Guanghui; Zhao, Yuming; Zhang, Dandan.

Methods ; 223: 56-64, 2024 Mar.

Article in English | MEDLINE | ID: mdl-38237792

ABSTRACT

DNA-binding proteins are a class of proteins that can interact with DNA molecules through physical and chemical interactions. Their main functions include regulating gene expression, maintaining chromosome structure and stability, and more. DNA-binding proteins play a crucial role in cellular and molecular biology, as they are essential for maintaining normal cellular physiological functions and adapting to environmental changes. The prediction of DNA-binding proteins has been a hot topic in the field of bioinformatics. The key to accurately classifying DNA-binding proteins is to find suitable feature sources and explore the information they contain. Although there are already many models for predicting DNA-binding proteins, there is still room for improvement in mining feature source information and calculation methods. In this study, we created a model called DBPboost to better identify DNA-binding proteins. The innovation of this study lies in the use of eight feature extraction methods, the improvement of the feature selection step, which involves selecting some features first and then performing feature selection again after feature fusion, and the optimization of the differential evolution algorithm in feature fusion, which improves the performance of feature fusion. The experimental results show that the prediction accuracy of the model on the UniSwiss dataset is 89.32%, and the sensitivity is 89.01%, which is better than most existing models.

Subject(s)

DNA-Binding Proteins , Support Vector Machine , DNA-Binding Proteins/chemistry , Algorithms , DNA/chemistry , Computational Biology/methods

12.

The effect of data balancing approaches on the prediction of metabolic syndrome using non-invasive parameters based on random forest.

Mohseni-Takalloo, Sahar; Mohseni, Hadis; Mozaffari-Khosravi, Hassan; Mirzaei, Masoud; Hosseinzadeh, Mahdieh.

BMC Bioinformatics ; 25(1): 18, 2024 Jan 11.

Article in English | MEDLINE | ID: mdl-38212697

ABSTRACT

BACKGROUND: Metabolic syndrome (MetS) is a cluster of metabolic abnormalities (including obesity, insulin resistance, hypertension, and dyslipidemia), which can be used to identify at-risk populations for diabetes and cardiovascular diseases, the main causes of morbidity and mortality worldwide. The achievement of a simple approach for diagnosing MetS without needing biochemical tests is so valuable. The present study aimed to predict MetS using non-invasive features based on a successful random forest learning algorithm. Also, to deal with the problem of data imbalance that naturally exists in this type of data, the effect of two different data balancing approaches, including the Synthetic Minority Over-sampling Technique (SMOTE) and Random Splitting data balancing (SplitBal), on model performance is investigated. RESULTS: The most important determinant for MetS prediction was waist circumference. Applying a random forest learning algorithm to imbalanced data, the trained models reach 86.9% and 79.4% accuracies and 37.1% and 38.2% sensitivities in men and women, respectively. However, by applying the SplitBal data balancing technique, the best results were obtained, and despite that the accuracy of the trained models decreased by 7.8% and 11.3%, but their sensitivity improved significantly to 82.3% and 73.7% in men and women, respectively. CONCLUSIONS: The random forest learning method, along with data balancing techniques, especially SplitBal, could create MetS prediction models with promising results that can be applied as a useful prognostic tool in health screening programs.

Subject(s)

Insulin Resistance , Metabolic Syndrome , Male , Humans , Female , Metabolic Syndrome/diagnosis , Random Forest , Risk Factors , Obesity

13.

A novel microbe-drug association prediction model based on graph attention networks and bilayer random forest.

Kuang, Haiyue; Zhang, Zhen; Zeng, Bin; Liu, Xin; Zuo, Hao; Xu, Xingye; Wang, Lei.

BMC Bioinformatics ; 25(1): 78, 2024 Feb 20.

Article in English | MEDLINE | ID: mdl-38378437

ABSTRACT

BACKGROUND: In recent years, the extensive use of drugs and antibiotics has led to increasing microbial resistance. Therefore, it becomes crucial to explore deep connections between drugs and microbes. However, traditional biological experiments are very expensive and time-consuming. Therefore, it is meaningful to develop efficient computational models to forecast potential microbe-drug associations. RESULTS: In this manuscript, we proposed a novel prediction model called GARFMDA by combining graph attention networks and bilayer random forest to infer probable microbe-drug correlations. In GARFMDA, through integrating different microbe-drug-disease correlation indices, we constructed two different microbe-drug networks first. And then, based on multiple measures of similarity, we constructed a unique feature matrix for drugs and microbes respectively. Next, we fed these newly-obtained microbe-drug networks together with feature matrices into the graph attention network to extract the low-dimensional feature representations for drugs and microbes separately. Thereafter, these low-dimensional feature representations, along with the feature matrices, would be further inputted into the first layer of the Bilayer random forest model to obtain the contribution values of all features. And then, after removing features with low contribution values, these contribution values would be fed into the second layer of the Bilayer random forest to detect potential links between microbes and drugs. CONCLUSIONS: Experimental results and case studies show that GARFMDA can achieve better prediction performance than state-of-the-art approaches, which means that GARFMDA may be a useful tool in the field of microbe-drug association prediction in the future. Besides, the source code of GARFMDA is available at https://github.com/KuangHaiYue/GARFMDA.git.

Subject(s)

Anti-Bacterial Agents , Random Forest , Probability , Software

14.

Predicting lncRNA-protein interactions through deep learning framework employing multiple features and random forest algorithm.

Liang, Ying; Yin, XingRui; Zhang, YangSen; Guo, You; Wang, YingLong.

BMC Bioinformatics ; 25(1): 108, 2024 Mar 12.

Article in English | MEDLINE | ID: mdl-38475723

ABSTRACT

RNA-protein interaction (RPI) is crucial to the life processes of diverse organisms. Various researchers have identified RPI through long-term and high-cost biological experiments. Although numerous machine learning and deep learning-based methods for predicting RPI currently exist, their robustness and generalizability have significant room for improvement. This study proposes LPI-MFF, an RPI prediction model based on multi-source information fusion, to address these issues. The LPI-MFF employed protein-protein interactions features, sequence features, secondary structure features, and physical and chemical properties as the information sources with the corresponding coding scheme, followed by the random forest algorithm for feature screening. Finally, all information was combined and a classification method based on convolutional neural networks is used. The experimental results of fivefold cross-validation demonstrated that the accuracy of LPI-MFF on RPI1807 and NPInter was 97.60% and 97.67%, respectively. In addition, the accuracy rate on the independent test set RPI1168 was 84.9%, and the accuracy rate on the Mus musculus dataset was 90.91%. Accordingly, LPI-MFF demonstrated greater robustness and generalization than other prevalent RPI prediction methods.

Subject(s)

Deep Learning , RNA, Long Noncoding , Animals , Mice , RNA, Long Noncoding/chemistry , Random Forest , Neural Networks, Computer , Machine Learning , Computational Biology/methods

15.

PredictEFC: a fast and efficient multi-label classifier for predicting enzyme family classes.

Chen, Lei; Zhang, Chenyu; Xu, Jing.

BMC Bioinformatics ; 25(1): 50, 2024 Jan 30.

Article in English | MEDLINE | ID: mdl-38291384

ABSTRACT

BACKGROUND: Enzymes play an irreplaceable and important role in maintaining the lives of living organisms. The Enzyme Commission (EC) number of an enzyme indicates its essential functions. Correct identification of the first digit (family class) of the EC number for a given enzyme is a hot topic in the past twenty years. Several previous methods adopted functional domain composition to represent enzymes. However, it would lead to dimension disaster, thereby reducing the efficiency of the methods. On the other hand, most previous methods can only deal with enzymes belonging to one family class. In fact, several enzymes belong to two or more family classes. RESULTS: In this study, a fast and efficient multi-label classifier, named PredictEFC, was designed. To construct this classifier, a novel feature extraction scheme was designed for processing functional domain information of enzymes, which counting the distribution of each functional domain entry across seven family classes in the training dataset. Based on this scheme, each training or test enzyme was encoded into a 7-dimenion vector by fusing its functional domain information and above statistical results. Random k-labelsets (RAKEL) was adopted to build the classifier, where random forest was selected as the base classification algorithm. The two tenfold cross-validation results on the training dataset shown that the accuracy of PredictEFC can reach 0.8493 and 0.8370. The independent test on two datasets indicated the accuracy values of 0.9118 and 0.8777. CONCLUSION: The performance of PredictEFC was slightly lower than the classifier directly using functional domain composition. However, its efficiency was sharply improved. The running time was less than one-tenth of the time of the classifier directly using functional domain composition. In additional, the utility of PredictEFC was superior to the classifiers using traditional dimensionality reduction methods and some previous methods, and this classifier can be transplanted for predicting enzyme family classes of other species. Finally, a web-server available at http://124.221.158.221/ was set up for easy usage.

Subject(s)

Algorithms , Enzymes , Enzymes/classification

16.

Utilizing genomic signatures to gain insights into the dynamics of SARS-CoV-2 through Machine and Deep Learning techniques.

Elsherbini, Ahmed M A; Elkholy, Amr Hassan; Fadel, Youssef M; Goussarov, Gleb; Elshal, Ahmed Mohamed; El-Hadidi, Mohamed; Mysara, Mohamed.

BMC Bioinformatics ; 25(1): 131, 2024 Mar 27.

Article in English | MEDLINE | ID: mdl-38539073

ABSTRACT

The global spread of the SARS-CoV-2 pandemic, originating in Wuhan, China, has had profound consequences on both health and the economy. Traditional alignment-based phylogenetic tree methods for tracking epidemic dynamics demand substantial computational power due to the growing number of sequenced strains. Consequently, there is a pressing need for an alignment-free approach to characterize these strains and monitor the dynamics of various variants. In this work, we introduce a swift and straightforward tool named GenoSig, implemented in C++. The tool exploits the Di and Tri nucleotide frequency signatures to delineate the taxonomic lineages of SARS-CoV-2 by employing diverse machine learning (ML) and deep learning (DL) models. Our approach achieved a tenfold cross-validation accuracy of 87.88% (± 0.013) for DL and 86.37% (± 0.0009) for Random Forest (RF) model, surpassing the performance of other ML models. Validation using an additional unexposed dataset yielded comparable results. Despite variations in architectures between DL and RF, it was observed that later clades, specifically GRA, GRY, and GK, exhibited superior performance compared to earlier clades G and GH. As for the continental origin of the virus, both DL and RF models exhibited lower performance than in predicting clades. However, both models demonstrated relatively higher accuracy for Europe, North America, and South America compared to other continents, with DL outperforming RF. Both models consistently demonstrated a preference for cytosine and guanine over adenine and thymine in both clade and continental analyses, in both Di and Tri nucleotide frequencies signatures. Our findings suggest that GenoSig provides a straightforward approach to address taxonomic, epidemiological, and biological inquiries, utilizing a reductive method applicable not only to SARS-CoV-2 but also to similar research questions in an alignment-free context.

Subject(s)

COVID-19 , Deep Learning , Humans , SARS-CoV-2/genetics , Phylogeny , COVID-19/epidemiology , Genomics , Nucleotides

17.

Prediction of anxious depression using multimodal neuroimaging and machine learning.

Zhou, Enqi; Wang, Wei; Ma, Simeng; Xie, Xinhui; Kang, Lijun; Xu, Shuxian; Deng, Zipeng; Gong, Qian; Nie, Zhaowen; Yao, Lihua; Bu, Lihong; Wang, Fei; Liu, Zhongchun.

Neuroimage ; 285: 120499, 2024 Jan.

Article in English | MEDLINE | ID: mdl-38097055

ABSTRACT

Anxious depression is a common subtype of major depressive disorder (MDD) associated with adverse outcomes and severely impaired social function. It is important to clarify the underlying neurobiology of anxious depression to refine the diagnosis and stratify patients for therapy. Here we explored associations between anxiety and brain structure/function in MDD patients. A total of 260 MDD patients and 127 healthy controls underwent three-dimensional T1-weighted structural scanning and resting-state functional magnetic resonance imaging. Demographic data were collected from all participants. Differences in gray matter volume (GMV), (fractional) amplitude of low-frequency fluctuation ((f)ALFF), regional homogeneity (ReHo), and seed point-based functional connectivity were compared between anxious MDD patients, non-anxious MDD patients, and healthy controls. A random forest model was used to predict anxiety in MDD patients using neuroimaging features. Anxious MDD patients showed significant differences in GMV in the left middle temporal gyrus and ReHo in the right superior parietal gyrus and the left precuneus than HCs. Compared with non-anxious MDD patients, patients with anxious MDD showed significantly different GMV in the left inferior temporal gyrus, left superior temporal gyrus, left superior frontal gyrus (orbital part), and left dorsolateral superior frontal gyrus; fALFF in the left middle temporal gyrus; ReHo in the inferior temporal gyrus and the superior frontal gyrus (orbital part); and functional connectivity between the left superior temporal gyrus(temporal pole) and left medial superior frontal gyrus. A diagnostic predictive random forest model built using imaging features and validated by 10-fold cross-validation distinguished anxious from non-anxious MDD with an AUC of 0.802. Patients with anxious depression exhibit dysregulation of brain regions associated with emotion regulation, cognition, and decision-making, and our diagnostic model paves the way for more accurate, objective clinical diagnosis of anxious depression.

Subject(s)

Depressive Disorder, Major , Humans , Depression , Magnetic Resonance Imaging/methods , Brain , Neuroimaging , Machine Learning

18.

Perinatal asphyxia leads to acute kidney damage and increased renal susceptibility in adulthood.

Lakat, Tamas; Fekete, Andrea; Demeter, Kornel; Toth, Akos R; Varga, Zoltan K; Patonai, Attila; Kelemen, Hanga; Budai, Andras; Szabo, Miklos; Szabo, Attila J; Kaila, Kai; Denes, Adam; Mikics, Eva; Hosszu, Adam.

Am J Physiol Renal Physiol ; 2024 Jun 27.

Article in English | MEDLINE | ID: mdl-38932694

ABSTRACT

Perinatal asphyxia (PA) poses a significant threat to multiple organs, particularly the kidneys. Diagnosing PA-associated kidney injury remains challenging and treatment options are inadequate. Furthermore, there is a lack of long-term follow-up data regarding the renal implications of PA. In this study, 7-day-old male Wistar rats were exposed to PA using a gas mixture (4% O2; 20% CO2 in N2 for 15 minutes) to investigate molecular pathways linked to renal tubular damage, hypoxia, angiogenesis, heat-shock response, inflammation, and fibrosis in the kidney. In a second experiment, adult rats with a history of PA were subjected to moderate renal ischemia-reperfusion (IR) injury to test the hypothesis that PA exacerbates renal susceptibility. Our results revealed an increased gene expression of renal injury markers (KIM-1, NGAL), hypoxic- and heat shock factors (HIF-1α, HSF-1, HSP-27), pro-inflammatory cytokines (IL-1ß, IL-6, TNF-α, MCP-1), and fibrotic markers (TGF-ß, CTGF, Fibronectin) promptly after PA. Moreover, a machine learning model was identified through Random Forest analysis, demonstrating an impressive classification accuracy (95.5%) for PA. Post-PA rats showed exacerbated functional decline and tubular injury and more intense hypoxic-, heat-shock-, pro-inflammatory-, and pro-fibrotic response after renal IRI compared to controls. In conclusion, PA leads to subclinical kidney injury, which may increase the susceptibility to subsequent renal damage later in life. Additionally, the parameters identified through Random Forest analysis provide a robust foundation for future biomarker research in the context of PA.

19.

Predicting Environmental and Ecological Drivers of Human Population Structure.

Pless, Evlyn; Eckburg, Anders M; Henn, Brenna M.

Mol Biol Evol ; 40(5)2023 05 02.

Article in English | MEDLINE | ID: mdl-37146165

ABSTRACT

Landscape, climate, and culture can all structure human populations, but few existing methods are designed to simultaneously disentangle among a large number of variables in explaining genetic patterns. We developed a machine learning method for identifying the variables which best explain migration rates, as measured by the coalescent-based program MAPS that uses shared identical by descent tracts to infer spatial migration across a region of interest. We applied our method to 30 human populations in eastern Africa with high-density single nucleotide polymorphism array data. The remarkable diversity of ethnicities, languages, and environments in this region offers a unique opportunity to explore the variables that shape migration and genetic structure. We explored more than 20 spatial variables relating to landscape, climate, and presence of tsetse flies. The full model explained â¼40% of the variance in migration rate over the past 56 generations. Precipitation, minimum temperature of the coldest month, and elevation were the variables with the highest impact. Among the three groups of tsetse flies, the most impactful was fusca which transmits livestock trypanosomiasis. We also tested for adaptation to high elevation among Ethiopian populations. We did not identify well-known genes related to high elevation, but we did find signatures of positive selection related to metabolism and disease. We conclude that the environment has influenced the migration and adaptation of human populations in eastern Africa; the remaining variance in structure is likely due in part to cultural or other factors not captured in our model.

Subject(s)

Human Migration , Models, Genetic , Humans , Climate , Animals , Tsetse Flies , Genome-Wide Association Study , Africa, Eastern , Human Genetics , Genomics , Language

20.

Machine learning for detection of heterogeneous effects of Medicaid coverage on depression.

Goto, Ryunosuke; Inoue, Kosuke; Osawa, Itsuki; Baicker, Katherine; Fleming, Scott L; Tsugawa, Yusuke.

Am J Epidemiol ; 193(7): 951-958, 2024 Jul 08.

Article in English | MEDLINE | ID: mdl-38400644

ABSTRACT

In 2008, Oregon expanded its Medicaid program using a lottery, creating a rare opportunity to study the effects of Medicaid coverage using a randomized controlled design (Oregon Health Insurance Experiment). Analysis showed that Medicaid coverage lowered the risk of depression. However, this effect may vary between individuals, and the identification of individuals likely to benefit the most has the potential to improve the effectiveness and efficiency of the Medicaid program. By applying the machine learning causal forest to data from this experiment, we found substantial heterogeneity in the effect of Medicaid coverage on depression; individuals with high predicted benefit were older and had more physical or mental health conditions at baseline. Expanding coverage to individuals with high predicted benefit generated greater reduction in depression prevalence than expanding to all eligible individuals (21.5 vs 8.8 percentage-point reduction; adjusted difference = +12.7 [95% CI, +4.6 to +20.8]; P = 0.003), at substantially lower cost per case prevented ($16 627 vs $36 048; adjusted difference = -$18 598 [95% CI, -156 953 to -3120]; P = 0.04). Medicaid coverage reduces depression substantially more in a subset of the population than others, in ways that are predictable in advance. Targeting coverage on those most likely to benefit could improve the effectiveness and efficiency of insurance expansion. This article is part of a Special Collection on Mental Health.

Subject(s)

Depression , Insurance Coverage , Machine Learning , Medicaid , Humans , Medicaid/statistics & numerical data , United States , Female , Male , Adult , Oregon , Middle Aged , Insurance Coverage/statistics & numerical data , Young Adult

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL