RESUMEN
Most computational methods for predicting driver mutations have been trained using positive samples, while negative samples are typically derived from statistical methods or putative samples. The representativeness of these negative samples in capturing the diversity of passenger mutations remains to be determined. To tackle these issues, we curated a balanced dataset comprising driver mutations sourced from the COSMIC database and high-quality passenger mutations obtained from the Cancer Passenger Mutation database. Subsequently, we encoded the distinctive features of these mutations. Utilizing feature correlation analysis, we developed a cancer driver missense mutation predictor called CDMPred employing feature selection through the ensemble learning technique XGBoost. The proposed CDMPred method, utilizing the top 10 features and XGBoost, achieved an area under the receiver operating characteristic curve (AUC) value of 0.83 and 0.80 on the training and independent test sets, respectively. Furthermore, CDMPred demonstrated superior performance compared to existing state-of-the-art methods for cancer-specific and general diseases, as measured by AUC and area under the precision-recall curve. Including high-quality passenger mutations in the training data proves advantageous for CDMPred's prediction performance. We anticipate that CDMPred will be a valuable tool for predicting cancer driver mutations, furthering our understanding of personalized therapy.
Asunto(s)
Mutación Missense , Neoplasias , Humanos , Neoplasias/genética , Biología Computacional/métodos , Bases de Datos Genéticas , Curva ROC , Aprendizaje AutomáticoRESUMEN
Cell-cell communications is crucial for the regulation of cellular life and the establishment of cellular relationships. Most approaches of inferring intercellular communications from single-cell RNA sequencing (scRNA-seq) data lack a comprehensive global network view of multilayered communications. In this context, we propose scHyper, a new method that can infer intercellular communications from a global network perspective and identify the potential impact of all cells, ligand, and receptor expression on the communication score. scHyper designed a new way to represent tripartite relationships, by extracting a heterogeneous hypergraph that includes the source (ligand expression), the target (receptor expression), and the relevant ligand-receptor (L-R) pairs. scHyper is based on hypergraph representation learning, which measures the degree of match between the intrinsic attributes (static embeddings) of nodes and their observed behaviors (dynamic embeddings) in the context (hyperedges), quantifies the probability of forming hyperedges, and thus reconstructs the cell-cell communication score. Additionally, to effectively mine the key mechanisms of signal transmission, we collect a rich dataset of multisubunit complex L-R pairs and propose a nonparametric test to determine significant intercellular communications. Comparing with other tools indicates that scHyper exhibits superior performance and functionality. Experimental results on the human tumor microenvironment and immune cells demonstrate that scHyper offers reliable and unique capabilities for analyzing intercellular communication networks. Therefore, we introduced an effective strategy that can build high-order interaction patterns, surpassing the limitations of most methods that can only handle low-order interactions, thus more accurately interpreting the complexity of intercellular communications.
Asunto(s)
Comunicación Celular , Redes Neurales de la Computación , Humanos , Biología Computacional/métodos , Análisis de la Célula Individual/métodos , AlgoritmosRESUMEN
Intervertebral disc degeneration (IDD) causes a variety of symptoms such as low back pain, disc herniation, and spinal stenosis, which can lead to high social and economic costs. Alpinetin has an anti-inflammatory potential, but its effect on IDD is unclear. Herein, we investigated the effect of alpinetin on IDD. To mimic an in vitro model of IDD, nucleus pulposus cells (NPCs) were exposed to interleukin 1ß (IL-1ß). The viability of NPCs was assessed by CCK-8 assay. The expression of Toll-like receptor 4 (TLR4), myeloid differentiation primary response protein 88 (MyD88), aggrecan, collagen-2, and matrix metalloproteinase-3 (MMP-3) was examined by qRT-PCR and western blotting. The protein levels of B cell lymphoma-2 (Bcl-2), Bcl-2-associated protein X (Bax), and cleaved caspase-3 were scrutinized by western blotting. The flow cytometry assay was performed to assess apoptosis of NPCs. The contents of inflammatory factors were examined by ELISA kits. Results showed that alpinetin repressed IL-1ß-tempted activation of the TLR4/MyD88 pathway and apoptosis in NPCs. Alpinetin alleviated IL-1ß-tempted inflammatory responses and oxidative stress in NPCs. Moreover, alpinetin lessened IL-1ß-tempted extracellular matrix (ECM) degeneration in NPCs by enhancing the expression of aggrecan and collagen-2 and reducing the expression of MMP-3. The effects of alpinetin on IL-1ß-exposed NPCs were neutralized by TLR4 upregulation. In conclusion, alpinetin repressed IL-1ß-tempted apoptosis, inflammatory responses, oxidative stress, and ECM degradation in NPCs through the inactivation of the TLR4/MyD88 pathway.
RESUMEN
Osteoarthritis (OA) is a worldwide joint disease, leading to the physical pain, stiffness, and even disability. Lactate dehydrogenase A (LDHA) is known as a lactylation mediator that can regulate histone lactylation of its target genes. However, the role of LDHA-mediated histone H3 lysine 18 lactylation (H3K18la) in OA progression is yet to be clarified. Our study aims at revealing the role and mechanism of LDHA-mediated histone lactylation in the glycolysis of chondrocytes. In this study, we determined at first that the H3K18la level was enhanced in OA. Energy metabolism such as glycolysis is often altered in OA progress. Therefore, we further explored the mechanism mediating glycolysis and thus promoting OA progress. Moreover, glycolysis was enhanced in LPS-induced OA cell model, as evidenced by the increased glucose consumption and lactate production. Furthermore, we silenced LDHA for loss-of-function assays. The results showed that knockdown of LDHA suppressed glycolysis of LPS-induced chondrocytes. In vivo animal study demonstrated that knockout of LDHA recovered cartilage injury of OA mice. Mechanistically, we uncovered that LDHA-mediated H3K18la in TPI1 promoter enhanced the transcription activity of TPI1. Mutation of K69 site was found to ameliorate LPS-induced glycolysis in OA cell model. In conclusion, our study reveals the role of LDHA-mediated H3K18la of TPI1 promoter in OA progress.
Asunto(s)
Condrocitos , Glucólisis , Histonas , Osteoartritis , Osteoartritis/metabolismo , Osteoartritis/genética , Osteoartritis/patología , Animales , Ratones , Histonas/metabolismo , Humanos , Condrocitos/metabolismo , Modelos Animales de Enfermedad , Lactato Deshidrogenasa 5/metabolismo , Masculino , Regulación de la Expresión Génica , Ratones Noqueados , Regiones Promotoras Genéticas , Transcripción GenéticaRESUMEN
Recent methods often introduce attention mechanisms into the skip connections of U-shaped networks to capture features. However, these methods usually overlook spatial information extraction in skip connections and exhibit inefficiency in capturing spatial and channel information. This issue prompts us to reevaluate the design of the skip-connection mechanism and propose a new deep-learning network called the Fusing Spatial and Channel Attention Network, abbreviated as FSCA-Net. FSCA-Net is a novel U-shaped network architecture that utilizes the Parallel Attention Transformer (PAT) to enhance the extraction of spatial and channel features in the skip-connection mechanism, further compensating for downsampling losses. We design the Cross-Attention Bridge Layer (CAB) to mitigate excessive feature and resolution loss when downsampling to the lowest level, ensuring meaningful information fusion during upsampling at the lowest level. Finally, we construct the Dual-Path Channel Attention (DPCA) module to guide channel and spatial information filtering for Transformer features, eliminating ambiguities with decoder features and better concatenating features with semantic inconsistencies between the Transformer and the U-Net decoder. FSCA-Net is designed explicitly for fine-grained segmentation tasks of multiple organs and regions. Our approach achieves over 48% reduction in FLOPs and over 32% reduction in parameters compared to the state-of-the-art method. Moreover, FSCA-Net outperforms existing segmentation methods on seven public datasets, demonstrating exceptional performance.
Asunto(s)
Aprendizaje Profundo , Humanos , Procesamiento de Imagen Asistido por Computador/métodos , Algoritmos , Interpretación de Imagen Asistida por Computador/métodosRESUMEN
Single-cell RNA sequencing (scRNA-seq) is a potent advancement for analyzing gene expression at the individual cell level, allowing for the identification of cellular heterogeneity and subpopulations. However, it suffers from technical limitations that result in sparse and heterogeneous data. Here, we propose scVSC, an unsupervised clustering algorithm built on deep representation neural networks. The method incorporates the variational inference into the subspace model, which imposes regularization constraints on the latent space and further prevents overfitting. In a series of experiments across multiple datasets, scVSC outperforms existing state-of-the-art unsupervised and semi-supervised clustering tools regarding clustering accuracy and running efficiency. Moreover, the study indicates that scVSC could visually reveal the state of trajectory differentiation, accurately identify differentially expressed genes, and further discover biologically critical pathways.
Asunto(s)
Algoritmos , Biología Computacional , Perfilación de la Expresión Génica , Análisis de la Célula Individual , Transcriptoma , Análisis de la Célula Individual/métodos , Análisis por Conglomerados , Humanos , Biología Computacional/métodos , Perfilación de la Expresión Génica/métodos , Transcriptoma/genética , Redes Neurales de la Computación , Análisis de Secuencia de ARN/métodosRESUMEN
The inference of gene regulatory networks (GRNs) from gene expression profiles has been a key issue in systems biology, prompting many researchers to develop diverse computational methods. However, most of these methods do not reconstruct directed GRNs with regulatory types because of the lack of benchmark datasets or defects in the computational methods. Here, we collect benchmark datasets and propose a deep learning-based model, DeepFGRN, for reconstructing fine gene regulatory networks (FGRNs) with both regulation types and directions. In addition, the GRNs of real species are always large graphs with direction and high sparsity, which impede the advancement of GRN inference. Therefore, DeepFGRN builds a node bidirectional representation module to capture the directed graph embedding representation of the GRN. Specifically, the source and target generators are designed to learn the low-dimensional dense embedding of the source and target neighbors of a gene, respectively. An adversarial learning strategy is applied to iteratively learn the real neighbors of each gene. In addition, because the expression profiles of genes with regulatory associations are correlative, a correlation analysis module is designed. Specifically, this module not only fully extracts gene expression features, but also captures the correlation between regulators and target genes. Experimental results show that DeepFGRN has a competitive capability for both GRN and FGRN inference. Potential biomarkers and therapeutic drugs for breast cancer, liver cancer, lung cancer and coronavirus disease 2019 are identified based on the candidate FGRNs, providing a possible opportunity to advance our knowledge of disease treatments.
Asunto(s)
Redes Reguladoras de Genes , Neoplasias Hepáticas , Humanos , Biología de Sistemas/métodos , Transcriptoma , Algoritmos , Biología Computacional/métodosRESUMEN
Radiation therapy (RT) is one of the primary treatment modalities of cancer, with 40-60% of cancer patients benefiting from RT during their treatment course. The intrinsic radiosensitivity or acquired radioresistance of tumor cells would affect the response to RT and clinical outcomes in patients. Thus, mining the regulatory mechanisms in tumor radiosensitivity or radioresistance that have been verified by biological experiments and computational analysis methods will enhance the overall understanding of RT. Here, we describe a comprehensive database dbCRAF (http://dbCRAF.xialab.info/) to document and annotate the factors (1,677 genes, 49 proteins and 612 radiosensitizers) linked with radiation response, including radiosensitivity, radioresistance in cancer cells and prognosis in cancer patients receiving RT. On the one hand, dbCRAF enables researchers to directly access knowledge for regulation of radiation response in human cancer buried in the vast literature. On the other hand, dbCRAF provides four flexible modules to analyze and visualize the functional relationship between these factors and clinical outcome, KEGG pathway and target genes. In conclusion, dbCRAF serves as a valuable resource for elucidating the regulatory mechanisms of radiation response in human cancers as well as for the improvement of RT options.
RESUMEN
Accurate identification of driver mutations is crucial in genetic studies of human cancers. While numerous cancer driver missense mutations have been identified, research into potential cancer drivers for synonymous mutations has shown limited success to date. Here, we developed a novel machine learning framework, epSMic, for predicting cancer driver synonymous mutations. epSMic employs an iterative feature representation scheme that facilitates the learning of discriminative features from various sequential models in a supervised iterative mode. We constructed the benchmark datasets and encoded the embedding sequence, physicochemical property, and basic information such as conservation and splicing feature. The evaluation results on benchmark test datasets demonstrate that epSMic outperforms existing methods, making it a valuable tool for researchers in identifying functional synonymous mutations in cancer. We hope epSMic can enable researchers to concentrate on synonymous mutations that have a functional impact on cancer.
Asunto(s)
Neoplasias , Mutación Silenciosa , Humanos , Neoplasias/genética , Aprendizaje AutomáticoRESUMEN
As one of the most important post-translational modifications (PTM), lysine acetylation (Kace) plays an important role in various biological activities. Traditional experimental methods for identifying Kace sites are inefficient and expensive. Instead, several machine learning methods have been developed for Kace site prediction, and hand-crafted features have been used to encode the protein sequences. However, there are still two challenges: the complex biological information may be under-represented by these manmade features and the small sample issue of some species needs to be addressed. We propose a novel model, MSTL-Kace, which was developed based on transfer learning strategy with pretrained bidirectional encoder representations from transformers (BERT) model. In this model, the high-level embeddings were extracted from species-specific BERT models, and a two-stage fine-tuning strategy was used to deal with small sample issue. Specifically, a domain-specific BERT model was pretrained using all of the sequences in our data sets, which was then fine-tuned, or two-stage fine-tuned based on the training data set of each species to obtain the species-specific BERT models. Afterward, the embeddings of residues were extracted from the fine-tuned model and fed to the different downstream learning algorithms. After comparison, the best model for the six prokaryotic species was built by using a random forest. The results for the independent test sets show that our model outperforms the state-of-the-art methods on all six species. The source codes and data for MSTL-Kace are available at https://github.com/leo97king/MSTL-Kace.
RESUMEN
Alternative splicing (AS) regulation has an essential role in complex diseases. However, the AS profiles in the hippocampal (HIPPO) region of human brain are underexplored. Here, we investigated cis-acting sQTLs of HIPPO region in 264 samples and identified thousands of significant sQTLs. By enrichment analysis and functional characterization of these sQTLs, we found that the HIPPO sQTLs were enriched among histone-marked regions, transcription factors binding sites, RNA binding proteins sites, and brain disorders-associated loci. Comparative analyses with the dorsolateral prefrontal cortex revealed the importance of AS regulation in HIPPO (rg = 0.87). Furthermore, we performed a transcriptome-wide association study of Alzheimer's disease and identified 16 significant genes whose genetically regulated splicing levels may have a causal role in Alzheimer. Overall, our study improves our knowledge of the transcriptome gene regulation in the HIPPO region and provides novel insights into elucidating the pathogenesis of potential genes associated with brain disorders.
RESUMEN
BACKGROUND AND HYPOTHESIS: Psychiatric disorders impose a huge health and economic burden on modern society. However, there is currently no proven completely effective treatment available, partly owing to the inefficiency of drug target identification and validation. We aim to identify therapeutic targets relevant to psychiatric disorders by conducting Mendelian randomization (MR) analysis. STUDY DESIGN: We performed genome-wide MR analysis by integrating expression quantitative trait loci (eQTL) of 4479 actionable genes that encode druggable proteins and genetic summary statistics from genome-wide association studies of psychiatric disorders. After conducting colocalization analysis on the brain MR findings, we employed protein quantitative trait loci (pQTL) data as genetic proposed instruments for intersecting the colocalized genes to provide further genetic evidence. STUDY RESULTS: By performing MR and colocalization analysis with eQTL genetic instruments, we obtained 31 promising drug targets for psychiatric disorders, including 21 significant genes for schizophrenia, 7 for bipolar disorder, 2 for depression, 1 for attention deficit and hyperactivity (ADHD) and none for autism spectrum disorder. Combining MR results using pQTL genetic instruments, we finally proposed 8 drug-targeting genes supported by the strongest MR evidence, including gene ACE, BTN3A3, HAPLN4, MAPK3 and NEK4 for schizophrenia, gene NEK4 and HAPLN4 for bipolar disorder, and gene TIE1 for ADHD. CONCLUSIONS: Our findings with genetic support were more likely to be to succeed in clinical trials. In addition, our study prioritizes approved drug targets for the development of new therapies and provides critical drug reuse opportunities for psychiatric disorders.
Asunto(s)
Trastorno por Déficit de Atención con Hiperactividad , Trastorno del Espectro Autista , Trastorno Bipolar , Humanos , Estudio de Asociación del Genoma Completo/métodos , Análisis de la Aleatorización Mendeliana/métodos , Trastorno Bipolar/tratamiento farmacológico , Trastorno Bipolar/genética , Trastorno por Déficit de Atención con Hiperactividad/genética , Polimorfismo de Nucleótido Simple/genéticaRESUMEN
Gene regulatory networks (GRNs) participate in many biological processes, and reconstructing them plays an important role in systems biology. Although many advanced methods have been proposed for GRN reconstruction, their predictive performance is far from the ideal standard, so it is urgent to design a more effective method to reconstruct GRN. Moreover, most methods only consider the gene expression data, ignoring the network structure information contained in GRN. In this study, we propose a supervised model named CNNGRN, which infers GRN from bulk time-series expression data via convolutional neural network (CNN) model, with a more informative feature. Bulk time series gene expression data imply the intricate regulatory associations between genes, and the network structure feature of ground-truth GRN contains rich neighbor information. Hence, CNNGRN integrates the above two features as model inputs. In addition, CNN is adopted to extract intricate features of genes and infer the potential associations between regulators and target genes. Moreover, feature importance visualization experiments are implemented to seek the key features. Experimental results show that CNNGRN achieved competitive performance on benchmark datasets compared to the state-of-the-art computational methods. Finally, hub genes identified based on CNNGRN have been confirmed to be involved in biological processes through literature.
Asunto(s)
Algoritmos , Redes Reguladoras de Genes , Redes Reguladoras de Genes/genética , Factores de Tiempo , Redes Neurales de la Computación , Biología de Sistemas , Biología Computacional/métodosRESUMEN
MOTIVATION: With the great number of peptide sequences produced in the postgenomic era, it is highly desirable to identify the various functions of therapeutic peptides quickly. Furthermore, it is a great challenge to predict accurate multi-functional therapeutic peptides (MFTP) via sequence-based computational tools. RESULTS: Here, we propose a novel multi-label-based method, named ETFC, to predict 21 categories of therapeutic peptides. The method utilizes a deep learning-based model architecture, which consists of four blocks: embedding, text convolutional neural network, feed-forward network, and classification blocks. This method also adopts an imbalanced learning strategy with a novel multi-label focal dice loss function. multi-label focal dice loss is applied in the ETFC method to solve the inherent imbalance problem in the multi-label dataset and achieve competitive performance. The experimental results state that the ETFC method is significantly better than the existing methods for MFTP prediction. With the established framework, we use the teacher-student-based knowledge distillation to obtain the attention weight from the self-attention mechanism in the MFTP prediction and quantify their contributions toward each of the investigated activities. AVAILABILITY AND IMPLEMENTATION: The source code and dataset are available via: https://github.com/xialab-ahu/ETFC.
Asunto(s)
Aprendizaje Profundo , Humanos , Redes Neurales de la Computación , Péptidos/uso terapéutico , Programas InformáticosRESUMEN
Introduction: Driver mutations play a critical role in the occurrence and development of human cancers. Most studies have focused on missense mutations that function as drivers in cancer. However, accumulating experimental evidence indicates that synonymous mutations can also act as driver mutations. Methods: Here, we proposed a computational method called PredDSMC to accurately predict driver synonymous mutations in human cancers. We first systematically explored four categories of multimodal features, including sequence features, splicing features, conservation scores, and functional scores. Further feature selection was carried out to remove redundant features and improve the model performance. Finally, we utilized the random forest classifier to build PredDSMC. Results: The results of two independent test sets indicated that PredDSMC outperformed the state-of-the-art methods in differentiating driver synonymous mutations from passenger mutations. Discussion: In conclusion, we expect that PredDSMC, as a driver synonymous mutation prediction method, will be a valuable method for gaining a deeper understanding of synonymous mutations in human cancers.
RESUMEN
MOTIVATION: Phage genome annotation plays a key role in the design of phage therapy. To date, there have been various genome annotation tools for phages, but most of these tools focus on mono-functional annotation and have complex operational processes. Accordingly, comprehensive and user-friendly platforms for phage genome annotation are needed. RESULTS: Here, we propose PhaGAA, an online integrated platform for phage genome annotation and analysis. By incorporating several annotation tools, PhaGAA is constructed to annotate the prophage genome at DNA and protein levels and provide the analytical results. Furthermore, PhaGAA could mine and annotate phage genomes from bacterial genome or metagenome. In summary, PhaGAA will be a useful resource for experimental biologists and help advance the phage synthetic biology in basic and application research. AVAILABILITY AND IMPLEMENTATION: PhaGAA is freely available at http://phage.xialab.info/.
Asunto(s)
Bacteriófagos , Bacteriófagos/genética , Programas Informáticos , Computadores , Metagenoma , Genoma Bacteriano , Anotación de Secuencia MolecularRESUMEN
The advances in single-cell ribonucleic acid sequencing (scRNA-seq) allow researchers to explore cellular heterogeneity and human diseases at cell resolution. Cell clustering is a prerequisite in scRNA-seq analysis since it can recognize cell identities. However, the high dimensionality, noises and significant sparsity of scRNA-seq data have made it a big challenge. Although many methods have emerged, they still fail to fully explore the intrinsic properties of cells and the relationship among cells, which seriously affects the downstream clustering performance. Here, we propose a new deep contrastive clustering algorithm called scDCCA. It integrates a denoising auto-encoder and a dual contrastive learning module into a deep clustering framework to extract valuable features and realize cell clustering. Specifically, to better characterize and learn data representations robustly, scDCCA utilizes a denoising Zero-Inflated Negative Binomial model-based auto-encoder to extract low-dimensional features. Meanwhile, scDCCA incorporates a dual contrastive learning module to capture the pairwise proximity of cells. By increasing the similarities between positive pairs and the differences between negative ones, the contrasts at both the instance and the cluster level help the model learn more discriminative features and achieve better cell segregation. Furthermore, scDCCA joins feature learning with clustering, which realizes representation learning and cell clustering in an end-to-end manner. Experimental results of 14 real datasets validate that scDCCA outperforms eight state-of-the-art methods in terms of accuracy, generalizability, scalability and efficiency. Cell visualization and biological analysis demonstrate that scDCCA significantly improves clustering and facilitates downstream analysis for scRNA-seq data. The code is available at https://github.com/WJ319/scDCCA.
Asunto(s)
Perfilación de la Expresión Génica , Análisis de Expresión Génica de una Sola Célula , Humanos , Perfilación de la Expresión Génica/métodos , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Algoritmos , Análisis por ConglomeradosRESUMEN
Although previous studies have revealed that synonymous mutations contribute to various human diseases, distinguishing deleterious synonymous mutations from benign ones is still a challenge in medical genomics. Recently, computational tools have been introduced to predict the harmfulness of synonymous mutations. However, most of these computational tools rely on balanced training sets without considering abundant negative samples that could result in deficient performance. In this study, we propose a computational model that uses a selective ensemble to predict deleterious synonymous mutations (seDSM). We construct several candidate base classifiers for the ensemble using balanced training subsets randomly sampled from the imbalanced benchmark training sets. The diversity measures of the base classifiers are calculated by the pairwise diversity metrics, and the classifiers with the highest diversities are selected for integration using soft voting for synonymous mutation prediction. We also design two strategies for filling in missing values in the imbalanced dataset and constructing models using different pairwise diversity metrics. The experimental results show that a selective ensemble based on double fault with the ensemble strategy EKNNI for filling in missing values is the most effective scheme. Finally, using 40-dimensional biology features, we propose a novel model based on a selective ensemble for predicting deleterious synonymous mutations (seDSM). seDSM outperformed other state-of-the-art methods on the independent test sets according to multiple evaluation indicators, indicating that it has an outstanding predictive performance for deleterious synonymous mutations. We hope that seDSM will be useful for studying deleterious synonymous mutations and advancing our understanding of synonymous mutations. The source code of seDSM is freely accessible at https://github.com/xialab-ahu/seDSM.git.
Asunto(s)
Genómica , Mutación Silenciosa , Humanos , Genómica/métodos , Programas Informáticos , AlgoritmosRESUMEN
With the discovery of causality between synonymous mutations and diseases, it has become increasingly important to identify deleterious synonymous mutations for better understanding of their functional mechanisms. Although several machine learning methods have been proposed to solve the task, an effective feature representation method that can make use of the inner difference and relevance between deleterious and benign synonymous mutations is still challenging considering the vast number of synonymous mutations in human genome. In this work, we developed a robust and accurate predictor called frDSM for deleterious synonymous mutation prediction using logistic regression. More specifically, we introduced an effective feature representation learning method which exploits multiple feature descriptors from different perspectives including functional scores obtained from previously computational methods, evolutionary conservation, splicing and sequence feature descriptors, and these features descriptors were input into the 76 XGBoost classifiers to obtain the predictive probabilities values. These probabilities were concatenated to generate the 76-dimension new feature vector, and feature selection method was used to remove redundant and irrelevant features. Experimental results show that frDSM enables robust and accurate prediction than the competing prediction methods with 31 optimal features, which demonstrated the effectiveness of the feature representation learning method. frDSM is freely available at http://frdsm.xialab.info.
Asunto(s)
Genoma Humano , Mutación Silenciosa , Humanos , Genoma Humano/genética , Aprendizaje Automático , AlgoritmosRESUMEN
Prediction of therapeutic peptide is a significant step for the discovery of promising therapeutic drugs. Most of the existing studies have focused on the mono-functional therapeutic peptide prediction. However, the number of multi-functional therapeutic peptides (MFTP) is growing rapidly, which requires new computational schemes to be proposed to facilitate MFTP discovery. In this study, based on multi-head self-attention mechanism and class weight optimization algorithm, we propose a novel model called PrMFTP for MFTP prediction. PrMFTP exploits multi-scale convolutional neural network, bi-directional long short-term memory, and multi-head self-attention mechanisms to fully extract and learn informative features of peptide sequence to predict MFTP. In addition, we design a class weight optimization scheme to address the problem of label imbalanced data. Comprehensive evaluation demonstrate that PrMFTP is superior to other state-of-the-art computational methods for predicting MFTP. We provide a user-friendly web server of PrMFTP, which is available at http://bioinfo.ahu.edu.cn/PrMFTP.