RESUMEN
Recently, various modern experimental screening pipelines and assays have been developed to find promising anticancer drug candidates. However, it is time-consuming and almost infeasible to screen an immense number of compounds for anticancer activity via experimental approaches. To partially address this issue, several computational advances have been proposed. In this study, we present iACP-GCR, a model based on multitask learning on graph convolutional residual neural networks with two types of shortcut connections, to identify multitarget anticancer compounds. In our architecture, the graph convolutional residual neural networks are shared by all the prediction tasks before being separately customized. The NCI-60 data set, one of the most reliable and well-known sources of experimentally verified compounds, was used to develop our model. From that data set, we collected and refined data about compounds screened across nine cancer types (panels), including breast, central nervous system, colon, leukemia, nonsmall cell lung, melanoma, ovarian, prostate, and renal, for model training and evaluation. The model performance evaluated on an independent test set shows that iACP-GCR surpasses the three advanced computational methods for multitask learning. The integration of two shortcut connection types in the shared networks also improves the prediction efficiency. We also deployed the model as a public web server to assist the research community in screening potential anticancer compounds.
Asunto(s)
Antineoplásicos , Redes Neurales de la Computación , Antineoplásicos/química , Antineoplásicos/farmacología , Humanos , Aprendizaje Automático , Ensayos de Selección de Medicamentos Antitumorales , Evaluación Preclínica de Medicamentos , Neoplasias/tratamiento farmacológicoRESUMEN
Short-length antimicrobial peptides (AMPs) have been demonstrated to have intensified antimicrobial activities against a wide spectrum of microbes. Therefore, exploration of novel and promising short AMPs is highly essential in developing various types of antimicrobial drugs or treatments. In addition to experimental approaches, computational methods have been developed to improve screening efficiency. Although existing computational methods have achieved satisfactory performance, there is still much room for model improvement. In this study, we proposed iAMP-DL, an efficient hybrid deep learning architecture, for predicting short AMPs. The model was constructed using two well-known deep learning architectures: the long short-term memory architecture and convolutional neural networks. To fairly assess the performance of the model, we compared our model with existing state-of-the-art methods using the same independent test set. Our comparative analysis shows that iAMP-DL outperformed other methods. Furthermore, to assess the robustness and stability of our model, the experiments were repeated 10 times to observe the variation in prediction efficiency. The results demonstrate that iAMP-DL is an effective, robust, and stable framework for detecting promising short AMPs. Another comparative study of different negative data sampling methods also confirms the effectiveness of our method and demonstrates that it can also be used to develop a robust model for predicting AMPs in general. The proposed framework was also deployed as an online web server with a user-friendly interface to support the research community in identifying short AMPs.
Asunto(s)
Péptidos Antimicrobianos , Aprendizaje Profundo , Péptidos Antimicrobianos/química , Péptidos Antimicrobianos/farmacología , Redes Neurales de la Computación , Biología Computacional/métodos , Péptidos Catiónicos Antimicrobianos/química , Péptidos Catiónicos Antimicrobianos/farmacologíaRESUMEN
Nowadays, antibiotic resistance has become one of the most concerning problems that directly affects the recovery process of patients. For years, numerous efforts have been made to efficiently use antimicrobial drugs with appropriate doses not only to exterminate microbes but also stringently constrain any chances for bacterial evolution. However, choosing proper antibiotics is not a straightforward and time-effective process because well-defined drugs can only be given to patients after determining microbic taxonomy and evaluating minimum inhibitory concentrations (MICs). Besides conventional methods, numerous computer-aided frameworks have been recently developed using computational advances and public data sources of clinical antimicrobial resistance. In this study, we introduce eMIC-AntiKP, a computational framework specifically designed to predict the MIC values of 20 antibiotics towards Klebsiella pneumoniae. Our prediction models were constructed using convolutional neural networks and k-mer counting-based features. The model for cefepime has the most limited performance with a test 1-tier accuracy of 0.49, while the model for ampicillin has the highest performance with a test 1-tier accuracy of 1.00. Most models have satisfactory performance, with test accuracies ranging from about 0.70-0.90. The significance of eMIC-AntiKP is the effective utilization of computing resources to make it a compact and portable tool for most moderately configured computers. We provide users with two options, including an online web server for basic analysis and an offline package for deeper analysis and technical modification.
RESUMEN
Nonclassical secreted proteins (NSPs) refer to a group of proteins released into the extracellular environment under the facilitation of different biological transporting pathways apart from the Sec/Tat system. As experimental determination of NSPs is often costly and requires skilled handling techniques, computational approaches are necessary. In this study, we introduce iNSP-GCAAP, a computational prediction framework, to identify NSPs. We propose using global composition of a customized set of amino acid properties to encode sequence data and use the random forest (RF) algorithm for classification. We used the training dataset introduced by Zhang et al. (Bioinformatics, 36(3), 704-712, 2020) to develop our model and test it with the independent test set in the same study. The area under the receiver operating characteristic curve on that test set was 0.9256, which outperformed other state-of-the-art methods using the same datasets. Our framework is also deployed as a user-friendly web-based application to support the research community to predict NSPs.
Asunto(s)
Aminoácidos , Proteínas , Aminoácidos/metabolismo , Proteínas/química , Programas Informáticos , Biología Computacional/métodos , AlgoritmosRESUMEN
Malaria is a threatening disease that has claimed many lives and has a high prevalence rate annually. Through the past decade, there have been many studies to uncover effective antimalarial compounds to combat this disease. Alongside chemically synthesized chemicals, a number of natural compounds have also been proven to be as effective in their antimalarial properties. Besides experimental approaches to investigate antimalarial activities in natural products, computational methods have been developed with satisfactory outcomes obtained. In this study, we propose a novel molecular encoding scheme based on Bidirectional Encoder Representations from Transformers and used our pretrained encoding model called NPBERT with four machine learning algorithms, including k-Nearest Neighbors (k-NN), Support Vector Machines (SVM), eXtreme Gradient Boosting (XGB), and Random Forest (RF), to develop various prediction models to identify antimalarial natural products. The results show that SVM models are the best-performing classifiers, followed by the XGB, k-NN, and RF models. Additionally, comparative analysis between our proposed molecular encoding scheme and existing state-of-the-art methods indicates that NPBERT is more effective compared to the others. Moreover, the deployment of transformers in constructing molecular encoders is not limited to this study but can be utilized for other biomedical applications.
Asunto(s)
Antimaláricos , Productos Biológicos , Antimaláricos/farmacología , Antimaláricos/química , Productos Biológicos/farmacología , Máquina de Vectores de Soporte , Aprendizaje Automático , AlgoritmosRESUMEN
Transcription factors (TFs) play an important role in gene expression and regulation of 3D genome conformation. TFs have ability to bind to specific DNA fragments called enhancers and promoters. Some TFs bind to promoter DNA fragments which are near the transcription initiation site and form complexes that allow polymerase enzymes to bind to initiate transcription. Previous studies showed that methylated DNAs had ability to inhibit and prevent TFs from binding to DNA fragments. However, recent studies have found that there were TFs that could bind to methylated DNA fragments. The identification of these TFs is an important steppingstone to a better understanding of cellular gene expression mechanisms. However, as experimental methods are often time-consuming and labor-intensive, developing computational methods is essential. In this study, we propose two machine learning methods for two problems: (1) identifying TFs and (2) identifying TFs that prefer binding to methylated DNA targets (TFPMs). For the TF identification problem, the proposed method uses the position-specific scoring matrix for data representation and a deep convolutional neural network for modeling. This method achieved 90.56% sensitivity, 83.96% specificity, and an area under the receiver operating characteristic curve (AUC) of 0.9596 on an independent test set. For the TFPM identification problem, we propose to use the reduced g-gap dipeptide composition for data representation and the support vector machine algorithm for modeling. This method achieved 82.61% sensitivity, 64.86% specificity, and an AUC of 0.8486 on another independent test set. These results are higher than those of other studies on the same problems.
RESUMEN
BACKGROUND: Enhancers are non-coding DNA fragments which are crucial in gene regulation (e.g. transcription and translation). Having high locational variation and free scattering in 98% of non-encoding genomes, enhancer identification is, therefore, more complicated than other genetic factors. To address this biological issue, several in silico studies have been done to identify and classify enhancer sequences among a myriad of DNA sequences using computational advances. Although recent studies have come up with improved performance, shortfalls in these learning models still remain. To overcome limitations of existing learning models, we introduce iEnhancer-ECNN, an efficient prediction framework using one-hot encoding and k-mers for data transformation and ensembles of convolutional neural networks for model construction, to identify enhancers and classify their strength. The benchmark dataset from Liu et al.'s study was used to develop and evaluate the ensemble models. A comparative analysis between iEnhancer-ECNN and existing state-of-the-art methods was done to fairly assess the model performance. RESULTS: Our experimental results demonstrates that iEnhancer-ECNN has better performance compared to other state-of-the-art methods using the same dataset. The accuracy of the ensemble model for enhancer identification (layer 1) and enhancer classification (layer 2) are 0.769 and 0.678, respectively. Compared to other related studies, improvements in the Area Under the Receiver Operating Characteristic Curve (AUC), sensitivity, and Matthews's correlation coefficient (MCC) of our models are remarkable, especially for the model of layer 2 with about 11.0%, 46.5%, and 65.0%, respectively. CONCLUSIONS: iEnhancer-ECNN outperforms other previously proposed methods with significant improvement in most of the evaluation metrics. Strong growths in the MCC of both layers are highly meaningful in assuring the stability of our models.
Asunto(s)
Elementos de Facilitación Genéticos , Redes Neurales de la Computación , Análisis de Secuencia de ADN/métodosRESUMEN
OBJECTIVE: Diabetes is responsible for considerable morbidity, healthcare utilisation and mortality in both developed and developing countries. Currently, methods of treating diabetes are inadequate and costly so prevention becomes an important step in reducing the burden of diabetes and its complications. Electronic health records (EHRs) for each individual or a population have become important tools in understanding developing trends of diseases. Using EHRs to predict the onset of diabetes could improve the quality and efficiency of medical care. In this paper, we apply a wide and deep learning model that combines the strength of a generalised linear model with various features and a deep feed-forward neural network to improve the prediction of the onset of type 2 diabetes mellitus (T2DM). MATERIALS AND METHODS: The proposed method was implemented by training various models into a logistic loss function using a stochastic gradient descent. We applied this model using public hospital record data provided by the Practice Fusion EHRs for the United States population. The dataset consists of de-identified electronic health records for 9948 patients, of which 1904 have been diagnosed with T2DM. Prediction of diabetes in 2012 was based on data obtained from previous years (2009-2011). The imbalance class of the model was handled by Synthetic Minority Oversampling Technique (SMOTE) for each cross-validation training fold to analyse the performance when synthetic examples for the minority class are created. We used SMOTE of 150 and 300 percent, in which 300 percent means that three new synthetic instances are created for each minority class instance. This results in the approximated diabetes:non-diabetes distributions in the training set of 1:2 and 1:1, respectively. RESULTS: Our final ensemble model not using SMOTE obtained an accuracy of 84.28%, area under the receiver operating characteristic curve (AUC) of 84.13%, sensitivity of 31.17% and specificity of 96.85%. Using SMOTE of 150 and 300 percent did not improve AUC (83.33% and 82.12%, respectively) but increased sensitivity (49.40% and 71.57%, respectively) with a moderate decrease in specificity (90.16% and 76.59%, respectively). DISCUSSION AND CONCLUSIONS: Our algorithm has further optimised the prediction of diabetes onset using a novel state-of-the-art machine learning algorithm: the wide and deep learning neural network architecture.
Asunto(s)
Aprendizaje Profundo , Diabetes Mellitus Tipo 2/diagnóstico , Registros Electrónicos de Salud , Humanos , Aprendizaje AutomáticoRESUMEN
BACKGROUND: Pseudouridine modification is most commonly found among various kinds of RNA modification occurred in both prokaryotes and eukaryotes. This biochemical event has been proved to occur in multiple types of RNAs, including rRNA, mRNA, tRNA, and nuclear/nucleolar RNA. Hence, gaining a holistic understanding of pseudouridine modification can contribute to the development of drug discovery and gene therapies. Although some laboratory techniques have come up with moderately good outcomes in pseudouridine identification, they are costly and required skilled work experience. We propose iPseU-NCP - an efficient computational framework to predict pseudouridine sites using the Random Forest (RF) algorithm combined with nucleotide chemical properties (NCP) generated from RNA sequences. The benchmark dataset collected from Chen et al. (2016) was used to develop iPseU-NCP and fairly compare its performances with other methods. RESULTS: Under the same experimental settings, comparing with three state-of-the-art methods including iPseU-CNN, PseUI, and iRNA-PseU, the Matthew's correlation coefficient (MCC) of our model increased by about 20.0%, 55.0%, and 109.0% when tested on the H. sapiens (H_200) dataset and by about 6.5%, 35.0%, and 150.0% when tested on the S. cerevisiae (S_200) dataset, respectively. This significant growth in MCC is very important since it ensures the stability and performance of our model. With those two independent test datasets, our model also presented higher accuracy with a success rate boosted by 7.0%, 13.0%, and 20.0% and 2.0%, 9.5%, and 25.0% when compared to iPseU-CNN, PseUI, and iRNA-PseU, respectively. For majority of other evaluation metrics, iPseU-NCP demonstrated superior performance as well. CONCLUSIONS: iPseU-NCP combining the RF and NPC-encoded features showed better performances than other existing state-of-the-art methods in the identification of pseudouridine sites. This also shows an optimistic view in addressing biological issues related to human diseases.