Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
1.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38517693

RESUMO

Numerous investigations increasingly indicate the significance of microRNA (miRNA) in human diseases. Hence, unearthing associations between miRNA and diseases can contribute to precise diagnosis and efficacious remediation of medical conditions. The detection of miRNA-disease linkages via computational techniques utilizing biological information has emerged as a cost-effective and highly efficient approach. Here, we introduced a computational framework named ReHoGCNES, designed for prospective miRNA-disease association prediction (ReHoGCNES-MDA). This method constructs homogenous graph convolutional network with regular graph structure (ReHoGCN) encompassing disease similarity network, miRNA similarity network and known MDA network and then was tested on four experimental tasks. A random edge sampler strategy was utilized to expedite processes and diminish training complexity. Experimental results demonstrate that the proposed ReHoGCNES-MDA method outperforms both homogenous graph convolutional network and heterogeneous graph convolutional network with non-regular graph structure in all four tasks, which implicitly reveals steadily degree distribution of a graph does play an important role in enhancement of model performance. Besides, ReHoGCNES-MDA is superior to several machine learning algorithms and state-of-the-art methods on the MDA prediction. Furthermore, three case studies were conducted to further demonstrate the predictive ability of ReHoGCNES. Consequently, 93.3% (breast neoplasms), 90% (prostate neoplasms) and 93.3% (prostate neoplasms) of the top 30 forecasted miRNAs were validated by public databases. Hence, ReHoGCNES-MDA might serve as a dependable and beneficial model for predicting possible MDAs.


Assuntos
MicroRNAs , Neoplasias da Próstata , Humanos , Masculino , Algoritmos , Biologia Computacional/métodos , Bases de Dados Genéticas , MicroRNAs/genética , Estudos Prospectivos , Neoplasias da Próstata/genética , Feminino
2.
Brief Bioinform ; 23(1)2022 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-34671814

RESUMO

One of the main problems with the joint use of multiple drugs is that it may cause adverse drug interactions and side effects that damage the body. Therefore, it is important to predict potential drug interactions. However, most of the available prediction methods can only predict whether two drugs interact or not, whereas few methods can predict interaction events between two drugs. Accurately predicting interaction events of two drugs is more useful for researchers to study the mechanism of the interaction of two drugs. In the present study, we propose a novel method, MDF-SA-DDI, which predicts drug-drug interaction (DDI) events based on multi-source drug fusion, multi-source feature fusion and transformer self-attention mechanism. MDF-SA-DDI is mainly composed of two parts: multi-source drug fusion and multi-source feature fusion. First, we combine two drugs in four different ways and input the combined drug feature representation into four different drug fusion networks (Siamese network, convolutional neural network and two auto-encoders) to obtain the latent feature vectors of the drug pairs, in which the two auto-encoders have the same structure, and their main difference is the number of neurons in the input layer of the two auto-encoders. Then, we use transformer blocks that include self-attention mechanism to perform latent feature fusion. We conducted experiments on three different tasks with two datasets. On the small dataset, the area under the precision-recall-curve (AUPR) and F1 scores of our method on task 1 reached 0.9737 and 0.8878, respectively, which were better than the state-of-the-art method. On the large dataset, the AUPR and F1 scores of our method on task 1 reached 0.9773 and 0.9117, respectively. In task 2 and task 3 of two datasets, our method also achieved the same or better performance as the state-of-the-art method. More importantly, the case studies on five DDI events are conducted and achieved satisfactory performance. The source codes and data are available at https://github.com/ShenggengLin/MDF-SA-DDI.


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Redes Neurais de Computação , Interações Medicamentosas , Humanos , Oligossacarídeos , Software
3.
Brief Bioinform ; 23(4)2022 07 18.
Artigo em Inglês | MEDLINE | ID: mdl-35830870

RESUMO

We construct a protein-protein interaction (PPI) targeted drug-likeness dataset and propose a deep molecular generative framework to generate novel drug-likeness molecules from the features of the seed compounds. This framework gains inspiration from published molecular generative models, uses the key features associated with PPI inhibitors as input and develops deep molecular generative models for de novo molecular design of PPI inhibitors. For the first time, quantitative estimation index for compounds targeting PPI was applied to the evaluation of the molecular generation model for de novo design of PPI-targeted compounds. Our results estimated that the generated molecules had better PPI-targeted drug-likeness and drug-likeness. Additionally, our model also exhibits comparable performance to other several state-of-the-art molecule generation models. The generated molecules share chemical space with iPPI-DB inhibitors as demonstrated by chemical space analysis. The peptide characterization-oriented design of PPI inhibitors and the ligand-based design of PPI inhibitors are explored. Finally, we recommend that this framework will be an important step forward for the de novo design of PPI-targeted therapeutics.


Assuntos
Desenho de Fármacos , Redes Neurais de Computação , Ligantes , Modelos Moleculares
4.
Brief Bioinform ; 22(1): 451-462, 2021 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-31885041

RESUMO

Drug-target interactions (DTIs) play a crucial role in target-based drug discovery and development. Computational prediction of DTIs can effectively complement experimental wet-lab techniques for the identification of DTIs, which are typically time- and resource-consuming. However, the performances of the current DTI prediction approaches suffer from a problem of low precision and high false-positive rate. In this study, we aim to develop a novel DTI prediction method for improving the prediction performance based on a cascade deep forest (CDF) model, named DTI-CDF, with multiple similarity-based features between drugs and the similarity-based features between target proteins extracted from the heterogeneous graph, which contains known DTIs. In the experiments, we built five replicates of 10-fold cross-validation under three different experimental settings of data sets, namely, corresponding DTI values of certain drugs (SD), targets (ST), or drug-target pairs (SP) in the training sets are missed but existed in the test sets. The experimental results demonstrate that our proposed approach DTI-CDF achieves a significantly higher performance than that of the traditional ensemble learning-based methods such as random forest and XGBoost, deep neural network, and the state-of-the-art methods such as DDR. Furthermore, there are 1352 newly predicted DTIs which are proved to be correct by KEGG and DrugBank databases. The data sets and source code are freely available at https://github.com//a96123155/DTI-CDF.


Assuntos
Desenvolvimento de Medicamentos/métodos , Proteômica/métodos , Software , Humanos , Simulação de Acoplamento Molecular/métodos , Análise de Sequência de Proteína/métodos
5.
Brief Bioinform ; 22(3)2021 05 20.
Artigo em Inglês | MEDLINE | ID: mdl-32964234

RESUMO

Identifying drug-target interactions (DTIs) is an important step for drug discovery and drug repositioning. To reduce the experimental cost, a large number of computational approaches have been proposed for this task. The machine learning-based models, especially binary classification models, have been developed to predict whether a drug-target pair interacts or not. However, there is still much room for improvement in the performance of current methods. Multi-label learning can overcome some difficulties caused by single-label learning in order to improve the predictive performance. The key challenge faced by multi-label learning is the exponential-sized output space, and considering label correlations can help to overcome this challenge. In this paper, we facilitate multi-label classification by introducing community detection methods for DTI prediction, named DTI-MLCD. Moreover, we updated the gold standard data set by adding 15,000 more positive DTI samples in comparison to the data set, which has widely been used by most of previously published DTI prediction methods since 2008. The proposed DTI-MLCD is applied to both data sets, demonstrating its superiority over other machine learning methods and several existing methods. The data sets and source code of this study are freely available at https://github.com/a96123155/DTI-MLCD.


Assuntos
Algoritmos , Biologia Computacional/métodos , Aprendizado de Máquina , Preparações Farmacêuticas/metabolismo , Proteínas/metabolismo , Simulação por Computador , Descoberta de Drogas/métodos , Reposicionamento de Medicamentos/métodos , Internet , Terapia de Alvo Molecular/métodos , Preparações Farmacêuticas/administração & dosagem , Preparações Farmacêuticas/química , Ligação Proteica , Proteínas/antagonistas & inibidores , Proteínas/química , Reprodutibilidade dos Testes
6.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34009265

RESUMO

Accurate identification of the miRNA-disease associations (MDAs) helps to understand the etiology and mechanisms of various diseases. However, the experimental methods are costly and time-consuming. Thus, it is urgent to develop computational methods towards the prediction of MDAs. Based on the graph theory, the MDA prediction is regarded as a node classification task in the present study. To solve this task, we propose a novel method MDA-GCNFTG, which predicts MDAs based on Graph Convolutional Networks (GCNs) via graph sampling through the Feature and Topology Graph to improve the training efficiency and accuracy. This method models both the potential connections of feature space and the structural relationships of MDA data. The nodes of the graphs are represented by the disease semantic similarity, miRNA functional similarity and Gaussian interaction profile kernel similarity. Moreover, we considered six tasks simultaneously on the MDA prediction problem at the first time, which ensure that under both balanced and unbalanced sample distribution, MDA-GCNFTG can predict not only new MDAs but also new diseases without known related miRNAs and new miRNAs without known related diseases. The results of 5-fold cross-validation show that the MDA-GCNFTG method has achieved satisfactory performance on all six tasks and is significantly superior to the classic machine learning methods and the state-of-the-art MDA prediction methods. Moreover, the effectiveness of GCNs via the graph sampling strategy and the feature and topology graph in MDA-GCNFTG has also been demonstrated. More importantly, case studies for two diseases and three miRNAs are conducted and achieved satisfactory performance.


Assuntos
Biomarcadores , Biologia Computacional/métodos , Suscetibilidade a Doenças , Regulação da Expressão Gênica , MicroRNAs/genética , Software , Algoritmos , Bases de Dados Genéticas , Humanos , Reprodutibilidade dos Testes , Fluxo de Trabalho
7.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34396388

RESUMO

Neuropeptides acting as signaling molecules in the nervous system of various animals play crucial roles in a wide range of physiological functions and hormone regulation behaviors. Neuropeptides offer many opportunities for the discovery of new drugs and targets for the treatment of neurological diseases. In recent years, there have been several data-driven computational predictors of various types of bioactive peptides, but the relevant work about neuropeptides is little at present. In this work, we developed an interpretable stacking model, named NeuroPpred-Fuse, for the prediction of neuropeptides through fusing a variety of sequence-derived features and feature selection methods. Specifically, we used six types of sequence-derived features to encode the peptide sequences and then combined them. In the first layer, we ensembled three base classifiers and four feature selection algorithms, which select non-redundant important features complementarily. In the second layer, the output of the first layer was merged and fed into logistic regression (LR) classifier to train the model. Moreover, we analyzed the selected features and explained the feasibility of the selected features. Experimental results show that our model achieved 90.6% accuracy and 95.8% AUC on the independent test set, outperforming the state-of-the-art models. In addition, we exhibited the distribution of selected features by these tree models and compared the results on the training set to that on the test set. These results fully showed that our model has a certain generalization ability. Therefore, we expect that our model would provide important advances in the discovery of neuropeptides as new drugs for the treatment of neurological diseases.


Assuntos
Modelos Biológicos , Neuropeptídeos/química , Algoritmos , Biologia Computacional/métodos , Aprendizado de Máquina
8.
Mol Pharm ; 17(4): 1059-1070, 2020 04 06.
Artigo em Inglês | MEDLINE | ID: mdl-31860321

RESUMO

Various types of nanocarriers modified with poly(ethylene glycol) (PEG) exhibit the accelerated blood clearance (ABC) phenomenon, resulting in reduced circulation time and abnormal increase in hepatic and splenic accumulations. Based on the abundance of esterases in the serum of rats, we developed cleavable methoxy PEG-cholesteryl methyl carbonate (mPEG-CHMC) with a carbonate linkage and noncleavable N-(carbonyl-methoxy PEG-n)-1,2-distearoyl-sn-glycero-3-phos-phoethanolamine (mPEG-DSPE) with a carbamate linkage on the surface of the nanoemulsions (CHMCE and PE, respectively). Both PEG derivatives possessed PEG with six different molecular weights (n = 350, 550, 750, 1000, 2000, and 5000). The pharmacokinetic behaviors and biodistributions of single and repeated injection of the two types of PEGylated nanoemulsions were determined to investigate the influence of cleavable linkages and PEG molecular weights on the ABC phenomenon in an attempt to find a potential strategy to eliminate the ABC phenomenon. CHMCEns (n = 1000, 2000, and 5000) exhibited the same pharmacokinetic behaviors as PE550 and PE750 and only alleviated the ABC phenomenon to a certain extent at the expense of shortened cycle time, indicating that the cleavable carbonate linkage was not an ideal strategy to eliminate the ABC phenomenon. As the molecular weights of PEG increased, the ABC phenomenon became more severe. Surprisingly, PE5000 induced a lower anti-PEG IgM level and a weaker ABC phenomenon compared with PE2000 while possessing a similar long circulation time. The results suggested that increasing the molecular weight of PEG in the PEG derivatives could be a potential strategy for eliminating the ABC phenomenon while simultaneously guaranteeing longer circulation time.


Assuntos
Colesterol/metabolismo , Emulsões/metabolismo , Lipídeos/química , Nanopartículas/metabolismo , Fosfolipídeos/metabolismo , Polietilenoglicóis/metabolismo , Animais , Portadores de Fármacos/química , Portadores de Fármacos/metabolismo , Emulsões/química , Imunoglobulina M/metabolismo , Cinética , Masculino , Taxa de Depuração Metabólica/fisiologia , Peso Molecular , Nanopartículas/química , Polietilenoglicóis/química , Ratos , Ratos Wistar , Baço/metabolismo
9.
J Chem Inf Model ; 59(11): 4577-4586, 2019 11 25.
Artigo em Inglês | MEDLINE | ID: mdl-31603319

RESUMO

A drug may be metabolized by multiple cytochrome P450 (CYP450) isoforms. Predicting the metabolic fate of drugs is very important to prevent drug-drug interactions in the development of novel pharmaceuticals. Prediction of CYP450 enzyme-substrate selectivity is formulized as a multilabel learning task in this study. First, we compared the performance of feature combinations based on four different categories of features, which are physiochemical property descriptors, mol2vec descriptors, extended connectivity fingerprints, and molecular access system key fingerprints on modeling. After identifying the best combination of features, we applied seven different multilabel models, which are multilabel k-nearest neighbor (ML-kNN), multilabel twin support vector machine, and five network-based label space division (NLSD)-based methods (NLSD-MLP, NLSD-XGB, NLSD-EXT, NLSD-RF, and NLSD-SVM). All of the six models (ML-kNN, NLSD-MLP, NLSD-XGB, NLSD-EXT, NLSD-RF, and NLSD-SVM) in this paper exhibit better performances than the previous work. Besides, NLSD-XGB achieves the best performance with the average top-1 prediction success of 91.1%, the average top-2 prediction success of 96.2%, and the average top-3 prediction success of 98.2%. When compared with the previous work, NLSD-XGB shows a significant improvement over 11% on top-1 in the 10 times repeated 5-fold cross-validation test and over 14% on top-1 in the 10 times repeated hold-out method. To the best of our knowledge, the network-based label space division model is first introduced in drug metabolism and performs well in this task.


Assuntos
Sistema Enzimático do Citocromo P-450/metabolismo , Preparações Farmacêuticas/metabolismo , Humanos , Modelos Biológicos , Redes Neurais de Computação , Preparações Farmacêuticas/química , Especificidade por Substrato , Máquina de Vetores de Suporte
10.
Nat Mach Intell ; 6(4): 449-460, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38855263

RESUMO

The 5' UTR, a regulatory region at the beginning of an mRNA molecule, plays a crucial role in regulating the translation process and impacts the protein expression level. Language models have showcased their effectiveness in decoding the functions of protein and genome sequences. Here, we introduced a language model for 5' UTR, which we refer to as the UTR-LM. The UTR-LM is pre-trained on endogenous 5' UTRs from multiple species and is further augmented with supervised information including secondary structure and minimum free energy. We fine-tuned the UTR-LM in a variety of downstream tasks. The model outperformed the best known benchmark by up to 5% for predicting the Mean Ribosome Loading, and by up to 8% for predicting the Translation Efficiency and the mRNA Expression Level. The model also applies to identifying unannotated Internal Ribosome Entry Sites within the untranslated region and improves the AUPR from 0.37 to 0.52 compared to the best baseline. Further, we designed a library of 211 novel 5' UTRs with high predicted values of translation efficiency and evaluated them via a wet-lab assay. Experiment results confirmed that our top designs achieved a 32.5% increase in protein production level relative to well-established 5' UTR optimized for therapeutics.

11.
Protein Sci ; 33(1): e4841, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-37983648

RESUMO

The recognition of T-cell receptor (TCR) on the surface of T cell to specific epitope presented by the major histocompatibility complex is the key to trigger the immune response. Identifying the binding rules of TCR-epitope pair is crucial for developing immunotherapies, including neoantigen vaccine and drugs. Accurate prediction of TCR-epitope binding specificity via deep learning remains challenging, especially in test cases which are unseen in the training set. Here, we propose TEPCAM (TCR-EPitope identification based on Cross-Attention and Multi-channel convolution), a deep learning model that incorporates self-attention, cross-attention mechanism, and multi-channel convolution to improve the generalizability and enhance the model interpretability. Experimental results demonstrate that our model outperformed several state-of-the-art models on two challenging tasks including a strictly split dataset and an external dataset. Furthermore, the model can learn some interaction patterns between TCR and epitope by extracting the interpretable matrix from cross-attention layer and mapping them to the three-dimensional structures. The source code and data are freely available at https://github.com/Chenjw99/TEPCAM.


Assuntos
Aprendizado Profundo , Linfócitos T , Receptores de Antígenos de Linfócitos T , Ligação Proteica , Epitopos de Linfócito T/química
12.
J Chem Theory Comput ; 20(11): 4469-4480, 2024 Jun 11.
Artigo em Inglês | MEDLINE | ID: mdl-38816696

RESUMO

Protein-protein interactions are the basis of many protein functions, and understanding the contact and conformational changes of protein-protein interactions is crucial for linking the protein structure to biological function. Although difficult to detect experimentally, molecular dynamics (MD) simulations are widely used to study the conformational ensembles and dynamics of protein-protein complexes, but there are significant limitations in sampling efficiency and computational costs. In this study, a generative neural network was trained on protein-protein complex conformations obtained from molecular simulations to directly generate novel conformations with physical realism. We demonstrated the use of a deep learning model based on the transformer architecture to explore the conformational ensembles of protein-protein complexes through MD simulations. The results showed that the learned latent space can be used to generate unsampled conformations of protein-protein complexes for obtaining new conformations complementing pre-existing ones, which can be used as an exploratory tool for the analysis and enhancement of molecular simulations of protein-protein complexes.


Assuntos
Simulação de Dinâmica Molecular , Conformação Proteica , Proteínas , Proteínas/química , Redes Neurais de Computação , Ligação Proteica
13.
Interdiscip Sci ; 15(3): 405-418, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37247186

RESUMO

DNA methylation-based precision tumor early diagnostics is emerging as state-of-the-art technology that could capture early cancer signs 3 ~ 5 years in advance, even for clinically homogenous groups. Presently, the sensitivity of early detection for many tumors is ~ 30%, which needs significant improvement. Nevertheless, based on the genome-wide DNA methylation data, one could comprehensively characterize tumors' entire molecular genetic landscape and their subtle differences. Therefore, novel high-performance methods must be modeled by considering unbiased information using excessively available DNA methylation data. To fill this gap, we have designed a computational model involving a self-attention graph convolutional network and multi-class classification support vector machine to identify the 11 most common cancers using DNA methylation data. The self-attention graph convolutional network automatically learns key methylation sites in a data-driven way. Then, multi-tumor early diagnostics is realized by training a multi-class classification support vector machine based on the selected methylation sites. We evaluated our model's performance through several data sets of experiments, and our results demonstrate the effectiveness of the selected key methylation sites, which are highly relevant for blood diagnosis. The pipeline of the self-attention graph convolutional network based computational framework.


Assuntos
Metilação de DNA , Neoplasias , Humanos , Metilação de DNA/genética , Neoplasias/diagnóstico , Neoplasias/genética , Processamento de Proteína Pós-Traducional , Máquina de Vetores de Suporte
14.
Proc Natl Acad Sci U S A ; 105(38): 14262-4, 2008 Sep 23.
Artigo em Inglês | MEDLINE | ID: mdl-18776050

RESUMO

The recent discovery of superconductivity with relatively high transition temperature (Tc) in the layered iron-based quaternary oxypnictides La[O(1-x)F(x)] FeAs by Kamihara et al. [Kamihara Y, Watanabe T, Hirano M, Hosono H (2008) Iron-based layered superconductor La[O1-xFx] FeAs (x = 0.05-0.12) with Tc = 26 K. J Am Chem Soc 130:3296-3297.] was a real surprise and has generated tremendous interest. Although superconductivity exists in alloy that contains the element Fe, LaOMPn (with M = Fe, Ni; and Pn = P and As) is the first system where Fe plays the key role to the occurrence of superconductivity. LaOMPn has a layered crystal structure with an Fe-based plane. It is quite natural to search whether there exists other Fe based planar compounds that exhibit superconductivity. Here, we report the observation of superconductivity with zero-resistance transition temperature at 8 K in the PbO-type alpha-FeSe compound. A key observation is that the clean superconducting phase exists only in those samples prepared with intentional Se deficiency. FeSe, compared with LaOFeAs, is less toxic and much easier to handle. What is truly striking is that this compound has the same, perhaps simpler, planar crystal sublattice as the layered oxypnictides. Therefore, this result provides an opportunity to better understand the underlying mechanism of superconductivity in this class of unconventional superconductors.

15.
Comput Biol Med ; 136: 104706, 2021 09.
Artigo em Inglês | MEDLINE | ID: mdl-34371319

RESUMO

MicroRNAs (miRNAs) are significant regulators in various biological processes. They may become promising biomarkers or therapeutic targets, which provide a new perspective in diagnosis and treatment of multiple diseases. Since the experimental methods are always costly and resource-consuming, prediction of disease-related miRNAs using computational methods is in great need. In this study, we developed MDA-CF to identify underlying miRNA-disease associations based on a cascade forest model. In this method, multi-source information was integrated to represent miRNAs and diseases comprehensively, and the autoencoder was utilized for dimension reduction to obtain the optimal feature space. The cascade forest model was then employed for miRNA-disease association prediction. As a result, the average AUC of MDA-CF was 0.9464 on HMDD v3.2 in five-fold cross-validation. Compared with previous computational methods, MDA-CF performed better on HMDD v2.0 with an average AUC of 0.9258. Moreover, MDA-CF was implemented to investigate colon neoplasm, breast neoplasm, and gastric neoplasm, and 100%, 86%, 88% of the top 50 potential miRNAs were validated by authoritative databases. In conclusion, MDA-CF appears to be a reliable method to uncover disease-associated miRNAs. The source code of MDA-CF is available at https://github.com/a1622108/MDA-CF.


Assuntos
MicroRNAs , Algoritmos , Biologia Computacional , Florestas , Predisposição Genética para Doença , Humanos , MicroRNAs/genética , Software
16.
Front Microbiol ; 11: 580382, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33072049

RESUMO

Type IV secreted effectors (T4SEs) can be translocated into the cytosol of host cells via type IV secretion system (T4SS) and cause diseases. However, experimental approaches to identify T4SEs are time- and resource-consuming, and the existing computational tools based on machine learning techniques have some obvious limitations such as the lack of interpretability in the prediction models. In this study, we proposed a new model, T4SE-XGB, which uses the eXtreme gradient boosting (XGBoost) algorithm for accurate identification of type IV effectors based on optimal features based on protein sequences. After trying 20 different types of features, the best performance was achieved when all features were fed into XGBoost by the 5-fold cross validation in comparison with other machine learning methods. Then, the ReliefF algorithm was adopted to get the optimal feature set on our dataset, which further improved the model performance. T4SE-XGB exhibited highest predictive performance on the independent test set and outperformed other published prediction tools. Furthermore, the SHAP method was used to interpret the contribution of features to model predictions. The identification of key features can contribute to improved understanding of multifactorial contributors to host-pathogen interactions and bacterial pathogenesis. In addition to type IV effector prediction, we believe that the proposed framework can provide instructive guidance for similar studies to construct prediction methods on related biological problems. The data and source code of this study can be freely accessed at https://github.com/CT001002/T4SE-XGB.

17.
Front Chem ; 7: 895, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31998687

RESUMO

Drug discovery is an academical and commercial process of global importance. Accurate identification of drug-target interactions (DTIs) can significantly facilitate the drug discovery process. Compared to the costly, labor-intensive and time-consuming experimental methods, machine learning (ML) plays an ever-increasingly important role in effective, efficient and high-throughput identification of DTIs. However, upstream feature extraction methods require tremendous human resources and expert insights, which limits the application of ML approaches. Inspired by the unsupervised representation learning methods like Word2vec, we here proposed SPVec, a novel way to automatically represent raw data such as SMILES strings and protein sequences into continuous, information-rich and lower-dimensional vectors, so as to avoid the sparseness and bit collisions from the cumbersomely manually extracted features. Visualization of SPVec nicely illustrated that the similar compounds or proteins occupy similar vector space, which indicated that SPVec not only encodes compound substructures or protein sequences efficiently, but also implicitly reveals some important biophysical and biochemical patterns. Compared with manually-designed features like MACCS fingerprints and amino acid composition (AAC), SPVec showed better performance with several state-of-art machine learning classifiers such as Gradient Boosting Decision Tree, Random Forest and Deep Neural Network on BindingDB. The performance and robustness of SPVec were also confirmed on independent test sets obtained from DrugBank database. Also, based on the whole DrugBank dataset, we predicted the possibilities of all unlabeled DTIs, where two of the top five predicted novel DTIs were supported by external evidences. These results indicated that SPVec can provide an effective and efficient way to discover reliable DTIs, which would be beneficial for drug reprofiling.

SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa