RESUMO
Multifactorial diseases demand therapeutics that can modulate multiple targets for enhanced safety and efficacy, yet the clinical approval of multitarget drugs remains rare. The integration of machine learning (ML) and deep learning (DL) in drug discovery has revolutionized virtual screening. This study investigates the synergy between ML/DL methodologies, molecular representations, and data augmentation strategies. Notably, we found that SVM can match or even surpass the performance of state-of-the-art DL methods. However, conventional data augmentation often involves a trade-off between the true positive rate and false positive rate. To address this, we introduce Negative-Augmented PU-bagging (NAPU-bagging) SVM, a novel semi-supervised learning framework. By leveraging ensemble SVM classifiers trained on resampled bags containing positive, negative, and unlabeled data, our approach is capable of managing false positive rates while maintaining high recall rates. We applied this method to the identification of multitarget-directed ligands (MTDLs), where high recall rates are critical for compiling a list of interaction candidate compounds. Case studies demonstrate that NAPU-bagging SVM can identify structurally novel MTDL hits for ALK-EGFR with favorable docking scores and binding modes, as well as pan-agonists for dopamine receptors. The NAPU-bagging SVM methodology should serve as a promising avenue to virtual screening, especially for the discovery of MTDLs.
Assuntos
Descoberta de Drogas , Descoberta de Drogas/métodos , Humanos , Simulação de Acoplamento Molecular , Ligantes , Máquina de Vetores de Suporte , Aprendizado Profundo , Aprendizado de Máquina Supervisionado , Aprendizado de MáquinaRESUMO
BACKGROUND: Investigating and understanding drug-drug interactions (DDIs) is important in improving the effectiveness of clinical care. DDIs can occur when two or more drugs are administered together. Experimentally based DDI detection methods require a large cost and time. Hence, there is a great interest in developing efficient and useful computational methods for inferring potential DDIs. Standard binary classifiers require both positives and negatives for training. In a DDI context, drug pairs that are known to interact can serve as positives for predictive methods. But, the negatives or drug pairs that have been confirmed to have no interaction are scarce. To address this lack of negatives, we introduce a Positive-Unlabeled Learning method for inferring potential DDIs. RESULTS: The proposed method consists of three steps: i) application of Growing Self Organizing Maps to infer negatives from the unlabeled dataset; ii) using a pairwise similarity function to quantify the overlap between individual features of drugs and iii) using support vector machine classifier for inferring DDIs. We obtained 6036 DDIs from DrugBank database. Using the proposed approach, we inferred 589 drug pairs that are likely to not interact with each other; these drug pairs are used as representative data for the negative class in binary classification for DDI prediction. Moreover, we classify the predicted DDIs as Cytochrome P450 (CYP) enzyme-Dependent and CYP-Independent interactions invoking their locations on the Growing Self Organizing Map, due to the particular importance of these enzymes in clinically significant interaction effects. Further, we provide a case study on three predicted CYP-Dependent DDIs to evaluate the clinical relevance of this study. CONCLUSION: Our proposed approach showed an absolute improvement in F1-score of 14 and 38% in comparison to the method that randomly selects unlabeled data points as likely negatives, depending on the choice of similarity function. We inferred 5300 possible CYP-Dependent DDIs and 592 CYP-Independent DDIs with the highest posterior probabilities. Our discoveries can be used to improve clinical care as well as the research outcomes of drug development.
Assuntos
Interações Medicamentosas/fisiologia , Preparações Farmacêuticas/metabolismo , Máquina de Vetores de Suporte , Análise por Conglomerados , Sistema Enzimático do Citocromo P-450/metabolismo , Bases de Dados Factuais , Humanos , Preparações Farmacêuticas/química , Isoformas de Proteínas/genética , Isoformas de Proteínas/metabolismoRESUMO
Epilepsy is a chronic disorder caused by excessive electrical discharges. Currently, clinical experts identify the seizure onset zone (SOZ) channel through visual judgment based on long-time intracranial electroencephalogram (iEEG), which is a very time-consuming, difficult and experience-based task. Therefore, there is a need for high-accuracy diagnostic aids to reduce the workload of clinical experts. In this article, we propose a method in which, the iEEG is split into the 20-s segment and for each patient, we ask clinical experts to label a part of the data, which is used to train a model and classify the remaining iEEG data. In recent years, machine learning methods have been successfully applied to solve some medical problems. Filtering, entropy and short-time Fourier transform (STFT) are used for extracting features. We compare them to wavelet transform (WT), empirical mode decomposition (EMD) and other traditional methods with the aim of obtaining the best possible discriminating features. Finally, we look for their medical interpretation, which is important for clinical experts. We achieve high-performance results for SOZ and non-SOZ data classification by using the labeled iEEG data and support vector machine (SVM), fully connected neural network (FCNN) and convolutional neural network (CNN) as classification models. In addition, we introduce the positive unlabeled (PU) learning to further reduce the workload of clinical experts. By using PU learning, we can learn a binary classifier with a small amount of labeled data and a large amount of unlabeled data. This can greatly reduce the amount and difficulty of annotation work by clinical experts. All together, we show that using 105 minutes of labeled data we achieve a classification result of 91.46% on average for multiple patients.
RESUMO
Long non-coding RNAs (lncRNAs) have been shown to play a regulatory role in various processes of human diseases. However, lncRNA experiments are inefficient, time-consuming and highly subjective, so that the number of experimentally verified associations between lncRNA and diseases is limited. In the era of big data, numerous machine learning methods have been proposed to predict the potential association between lncRNA and diseases, but the characteristics of the associated data were seldom explored. In these methods, negative samples are randomly selected for model training and the model is prone to learn the potential positive association error, thus affecting the prediction accuracy. In this paper, we proposed a cyclic optimization model of predicting lncRNA-disease associations (COPTLDA in short). In COPTLDA, the two-step training strategy is adopted to search for the samples with the greater probability of being negative examples from unlabeled samples and the determined samples are treated as negative samples, which are combined together with known positive samples to train the model. The searching and training steps are repeated until the best model is obtained as the final prediction model. In order to evaluate the performance of the model, 30% of the known positive samples are used to calculate the model accuracy and 10% of positive samples are used to calculate the recall rate of the model. The sampling strategy used in this paper can improve the accuracy and the AUC value reaches 0.9348. The results of case studies showed that the model could predict the potential associations between lncRNA and malignant tumors such as colorectal cancer, gastric cancer, and breast cancer. The predicted top 20 associated lncRNAs included 10 colorectal cancer lncRNAs, 2 gastric cancer lncRNAs, and 8 breast cancer lncRNAs.
RESUMO
Peptide toxins generally have extreme pharmacological activities and provide a rich source for the discovery of drug leads. However, determining the optimal activity of a new peptide can be a long and expensive process. In this study, peptide toxins were retrieved from Uniprot; three positive-unlabeled (PU) learning schemes, adaptive basis classifier, two-step method, and PU bagging were adopted to develop models for predicting the biological function of new peptide toxins. All three schemes were embedded with 14 machine learning classifiers. The prediction results of the adaptive base classifier and the two-step method were highly consistent. The models with top comprehensive performances were further optimized by feature selection and hyperparameter tuning, and the models were validated by making predictions for 61 three-finger toxins or the external HemoPI dataset. Biological functions that can be identified by these models include cardiotoxicity, vasoactivity, lipid binding, hemolysis, neurotoxicity, postsynaptic neurotoxicity, hypotension, and cytolysis, with relatively weak predictions for hemostasis and presynaptic neurotoxicity. These models are discovery-prediction tools for active peptide toxins and are expected to accelerate the development of peptide toxins as drugs.
Assuntos
Toxinas Biológicas , Humanos , Peptídeos/toxicidade , Hemólise , Cardiotoxicidade , Morte CelularRESUMO
Creating a large-scale dataset of abnormality annotation on medical images is a labor-intensive and costly task. Leveraging weak supervision from readily available data such as radiology reports can compensate lack of large-scale data for anomaly detection methods. However, most of the current methods only use image-level pathological observations, failing to utilize the relevant anatomy mentions in reports. Furthermore, Natural Language Processing (NLP)-mined weak labels are noisy due to label sparsity and linguistic ambiguity. We propose an Anatomy-Guided chest X-ray Network (AGXNet) to address these issues of weak annotation. Our framework consists of a cascade of two networks, one responsible for identifying anatomical abnormalities and the second responsible for pathological observations. The critical component in our framework is an anatomy-guided attention module that aids the downstream observation network in focusing on the relevant anatomical regions generated by the anatomy network. We use Positive Unlabeled (PU) learning to account for the fact that lack of mention does not necessarily mean a negative label. Our quantitative and qualitative results on the MIMIC-CXR dataset demonstrate the effectiveness of AGXNet in disease and anatomical abnormality localization. Experiments on the NIH Chest X-ray dataset show that the learned feature representations are transferable and can achieve the state-of-the-art performances in disease classification and competitive disease localization results. Our code is available at https://github.com/batmanlab/AGXNet.
RESUMO
INTRODUCTION: Prevalence of overweight and obesity are increas- ing in the last decades, and with them, diseases and health conditions such as diabetes, hypertension or cardiovascular diseases. However, hos- pital databases usually do not record such conditions in adults, neither anthropomorfic measures that facilitate their identification. METHODS: We implemented a machine learning method based on PU (Positive and Unlabelled) Learning to identify obese patients without a diagnose code of obesity in the health records. RESULTS: The algorithm presented a high sensitivity (98%) and predicted that around 18% of the patients without a diagnosis were obese. This result is consistent with the report of the WHO.
Assuntos
Registros Eletrônicos de Saúde , Aprendizado de Máquina , Obesidade , Diabetes Mellitus , HumanosRESUMO
Complex diseases seriously affect people's physical and mental health. The discovery of disease-causing genes has become a target of research. With the emergence of bioinformatics and the rapid development of biotechnology, to overcome the inherent difficulties of the long experimental period and high cost of traditional biomedical methods, researchers have proposed many gene prioritization algorithms that use a large amount of biological data to mine pathogenic genes. However, because the currently known gene-disease association matrix is still very sparse and lacks evidence that genes and diseases are unrelated, there are limits to the predictive performance of gene prioritization algorithms. Based on the hypothesis that functionally related gene mutations may lead to similar disease phenotypes, this paper proposes a PU induction matrix completion algorithm based on heterogeneous information fusion (PUIMCHIF) to predict candidate genes involved in the pathogenicity of human diseases. On the one hand, PUIMCHIF uses different compact feature learning methods to extract features of genes and diseases from multiple data sources, making up for the lack of sparse data. On the other hand, based on the prior knowledge that most of the unknown gene-disease associations are unrelated, we use the PU-Learning strategy to treat the unknown unlabeled data as negative examples for biased learning. The experimental results of the PUIMCHIF algorithm regarding the three indexes of precision, recall, and mean percentile ranking (MPR) were significantly better than those of other algorithms. In the top 100 global prediction analysis of multiple genes and multiple diseases, the probability of recovering true gene associations using PUIMCHIF reached 50% and the MPR value was 10.94%. The PUIMCHIF algorithm has higher priority than those from other methods, such as IMC and CATAPULT.
RESUMO
In various real-world problems, we are presented with classification problems with positive and unlabeled data, referred to as presence-only responses. In this article we study variable selection in the context of presence only responses where the number of features or covariates p is large. The combination of presence-only responses and high dimensionality presents both statistical and computational challenges. In this article, we develop the PUlasso algorithm for variable selection and classification with positive and unlabeled responses. Our algorithm involves using the majorization-minimization framework which is a generalization of the well-known expectation-maximization (EM) algorithm. In particular to make our algorithm scalable, we provide two computational speed-ups to the standard EM algorithm. We provide a theoretical guarantee where we first show that our algorithm converges to a stationary point, and then prove that any stationary point within a local neighborhood of the true parameter achieves the minimax optimal mean-squared error under both strict sparsity and group sparsity assumptions. We also demonstrate through simulations that our algorithm outperforms state-of-the-art algorithms in the moderate p settings in terms of classification performance. Finally, we demonstrate that our PUlasso algorithm performs well on a biochemistry example. Supplementary materials for this article are available online.