RESUMO
Scaffold proteins drive liquid-liquid phase separation (LLPS) to form biomolecular condensates and organize various biochemical reactions in cells. Dysregulation of scaffolds can lead to aberrant condensate assembly and various complex diseases. However, bioinformatics predictors dedicated to scaffolds are still lacking and their development suffers from an extreme imbalance between limited experimentally identified scaffolds and unlabeled candidates. Here, using the joint distribution of hybrid multimodal features, we implemented a positive unlabeled (PU) learning-based framework named PULPS that combined ProbTagging and penalty logistic regression (PLR) to profile the propensity of scaffolds. PULPS achieved the best AUC of 0.8353 and showed an area under the lift curve (AUL) of 0.8339 as an estimation of true performance. Upon reviewing recent experimentally verified scaffolds, we performed a partial recovery with 2.85% increase in AUL from 0.8339 to 0.8577. In comparison, PULPS showed a 45.7% improvement in AUL compared with PLR, whereas 8.2% superiority over other existing tools. Our study first proved that PU learning is more suitable for scaffold prediction and demonstrated the widespread existence of phase separation states. This profile also uncovered potential scaffolds that co-drive LLPS in the human proteome and generated candidates for further experiments. PULPS is free for academic research at http://pulps.zbiolab.cn.
Assuntos
Fenômenos Fisiológicos Celulares , Proteoma , HumanosRESUMO
MOTIVATION: Transcriptome data from gene knock-out (KO) experiments in mice provide crucial insights into the intricate interactions between genotype and phenotype. Differentially expressed gene (DEG) analysis and network propagation (NP) are well-established methods for analyzing transcriptome data. To determine genes related to phenotype changes from a KO experiment, we need to choose a cutoff value for the corresponding criterion based on the specific method. Using a rigorous cutoff value for DEG analysis and NP is likely to select mostly positive genes related to the phenotype, but many will be rejected as false negatives. On the other hand, using a loose cutoff value for either method is prone to include a number of genes that are not phenotype-related, which are false positives. Thus, the research problem at hand is how to deal with the trade-off between false negatives and false positives. RESULTS: We propose a novel framework called PONYTA for gene prioritization via positive-unlabeled (PU) learning on biological networks. Beginning with the selection of true phenotype-related genes using a rigorous cutoff value for DEG analysis and NP, we address the issue of handling false negatives by rescuing them through PU learning. Evaluations on transcriptome data from multiple studies show that our approach has superior gene prioritization ability compared to benchmark models. Therefore, PONYTA effectively prioritizes genes related to phenotypes derived from gene KO events and guides in vitro and in vivo gene KO experiments for increased efficiency. AVAILABILITY AND IMPLEMENTATION: The source code of PONYTA is available at https://github.com/Jun-Hyeong-Kim/PONYTA.
RESUMO
BACKGROUND: Compared to traditional supervised machine learning approaches employing fully labeled samples, positive-unlabeled (PU) learning techniques aim to classify "unlabeled" samples based on a smaller proportion of known positive examples. This more challenging modeling goal reflects many real-world scenarios in which negative examples are not available-posing direct challenges to defining prediction accuracy and robustness. While several studies have evaluated predictions learned from only definitive positive examples, few have investigated whether correct classification of a high proportion of known positives (KP) samples from among unlabeled samples can act as a surrogate to indicate model quality. RESULTS: In this study, we report a novel methodology combining multiple established PU learning-based strategies with permutation testing to evaluate the potential of KP samples to accurately classify unlabeled samples without using "ground truth" positive and negative labels for validation. Multivariate synthetic and real-world high-dimensional benchmark datasets were employed to demonstrate the suitability of the proposed pipeline to provide evidence of model robustness across varied underlying ground truth class label compositions among the unlabeled set and with different proportions of KP examples. Comparisons between model performance with actual and permuted labels could be used to distinguish reliable from unreliable models. CONCLUSIONS: As in fully supervised machine learning, permutation testing offers a means to set a baseline "no-information rate" benchmark in the context of semi-supervised PU learning inference tasks-providing a standard against which model performance can be compared.
Assuntos
Aprendizado de Máquina , Aprendizado de Máquina Supervisionado , Humanos , Biologia Computacional/métodos , AlgoritmosRESUMO
Conventional supervised binary classification algorithms have been widely applied to address significant research questions using biological and biomedical data. This classification scheme requires two fully labeled classes of data (e.g. positive and negative samples) to train a classification model. However, in many bioinformatics applications, labeling data is laborious, and the negative samples might be potentially mislabeled due to the limited sensitivity of the experimental equipment. The positive unlabeled (PU) learning scheme was therefore proposed to enable the classifier to learn directly from limited positive samples and a large number of unlabeled samples (i.e. a mixture of positive or negative samples). To date, several PU learning algorithms have been developed to address various biological questions, such as sequence identification, functional site characterization and interaction prediction. In this paper, we revisit a collection of 29 state-of-the-art PU learning bioinformatic applications to address various biological questions. Various important aspects are extensively discussed, including PU learning methodology, biological application, classifier design and evaluation strategy. We also comment on the existing issues of PU learning and offer our perspectives for the future development of PU learning applications. We anticipate that our work serves as an instrumental guideline for a better understanding of the PU learning framework in bioinformatics and further developing next-generation PU learning frameworks for critical biological applications.
Assuntos
Algoritmos , Biologia Computacional , Biologia Computacional/métodos , Aprendizado de Máquina SupervisionadoRESUMO
The advent of full-length transcriptome sequencing technologies has accelerated the discovery of novel splicing isoforms. However, existing alternative splicing (AS) tools are either tailored for short-read RNA-Seq data or designed for human and animal studies. The disparities in AS patterns between plants and animals still pose a challenge to the reliable identification and functional exploration of novel isoforms in plants. Here, we developed integrated full-length alternative splicing analysis (iFLAS), a plant-optimized AS toolkit that introduced a semi-supervised machine learning method known as positive-unlabeled (PU) learning to accurately identify novel isoforms. iFLAS also enables the investigation of AS functions from various perspectives, such as differential AS, poly(A) tail length, and allele-specific AS (ASAS) analyses. By applying iFLAS to three full-length transcriptome sequencing datasets, we systematically identified and functionally characterized maize (Zea mays) AS patterns. We found intron retention not only introduces premature termination codons, resulting in lower expression levels of isoforms, but may also regulate the length of 3'UTR and poly(A) tail, thereby affecting the functional differentiation of isoforms. Moreover, we observed distinct ASAS patterns in two genes within heterosis offspring, highlighting their potential value in breeding. These results underscore the broad applicability of iFLAS in plant full-length transcriptome-based AS research.
Assuntos
Processamento Alternativo , Transcriptoma , Humanos , Processamento Alternativo/genética , Transcriptoma/genética , Zea mays/genética , Perfilação da Expressão Gênica/métodos , Melhoramento Vegetal , Isoformas de Proteínas/genética , RNA Mensageiro/genética , Análise de Sequência de RNARESUMO
In a standard binary supervised classification task, the existence of both negative and positive samples in the training dataset are required to construct a classification model. However, this condition is not met in certain applications where only one class of samples is obtainable. To overcome this problem, a different classification method, which learns from positive and unlabeled (PU) data, must be incorporated. In this study, a novel method is presented: neighborhood-based positive unlabeled learning using decision tree (NPULUD). First, NPULUD uses the nearest neighborhood approach for the PU strategy and then employs a decision tree algorithm for the classification task by utilizing the entropy measure. Entropy played a pivotal role in assessing the level of uncertainty in the training dataset, as a decision tree was developed with the purpose of classification. Through experiments, we validated our method over 24 real-world datasets. The proposed method attained an average accuracy of 87.24%, while the traditional supervised learning approach obtained an average accuracy of 83.99% on the datasets. Additionally, it is also demonstrated that our method obtained a statistically notable enhancement (7.74%), with respect to state-of-the-art peers, on average.
RESUMO
Access to large-scale genomics datasets has increased the utility of hypothesis-free genome-wide analyses. However, gene signals are often insufficiently powered to reach experiment-wide significance, triggering a process of laborious triaging of genomic-association-study results. We introduce mantis-ml, a multi-dimensional, multi-step machine-learning framework that allows objective assessment of the biological relevance of genes to disease studies. Mantis-ml is an automated machine-learning framework that follows a multi-model approach of stochastic semi-supervised learning to rank disease-associated genes through iterative learning sessions on random balanced datasets across the protein-coding exome. When applied to a range of human diseases, including chronic kidney disease (CKD), epilepsy, and amyotrophic lateral sclerosis (ALS), mantis-ml achieved an average area under curve (AUC) prediction performance of 0.81-0.89. Critically, to prove its value as a tool that can be used to interpret exome-wide association studies, we overlapped mantis-ml predictions with data from published cohort-level association studies. We found a statistically significant enrichment of high mantis-ml predictions among the highest-ranked genes from hypothesis-free cohort-level statistics, indicating a substantial improvement over the performance of current state-of-the-art methods and pointing to the capture of true prioritization signals for disease-associated genes. Finally, we introduce a generic mantis-ml score (GMS) trained with over 1,200 features as a generic-disease-likelihood estimator, outperforming published gene-level scores. In addition to our tool, we provide a gene prioritization atlas that includes mantis-ml's predictions across ten disease areas and empowers researchers to interactively navigate through the gene-triaging framework. Mantis-ml is an intuitive tool that supports the objective triaging of large-scale genomic discovery studies and enhances our understanding of complex genotype-phenotype associations.
Assuntos
Esclerose Lateral Amiotrófica/genética , Epilepsia/genética , Genômica/métodos , Insuficiência Renal Crônica/genética , Aprendizado de Máquina Supervisionado , Animais , Área Sob a Curva , Aprendizado Profundo , Modelos Animais de Doenças , Exoma/genética , Estudos de Associação Genética , Humanos , Camundongos , Redes Neurais de Computação , Curva ROC , Reprodutibilidade dos Testes , Processos EstocásticosRESUMO
Accumulated researches have revealed that Piwi-interacting RNAs (piRNAs) are regulating the development of germ and stem cells, and they are closely associated with the progression of many diseases. As the number of the detected piRNAs is increasing rapidly, it is important to computationally identify new piRNA-disease associations with low cost and provide candidate piRNA targets for disease treatment. However, it is a challenging problem to learn effective association patterns from the positive piRNA-disease associations and the large amount of unknown piRNA-disease pairs. In this study, we proposed a computational predictor called iPiDi-PUL to identify the piRNA-disease associations. iPiDi-PUL extracted the features of piRNA-disease associations from three biological data sources, including piRNA sequence information, disease semantic terms and the available piRNA-disease association network. Principal component analysis (PCA) was then performed on these features to extract the key features. The training datasets were constructed based on known positive associations and the negative associations selected from the unknown pairs. Various random forest classifiers trained with these different training sets were merged to give the predictive results via an ensemble learning approach. Finally, the web server of iPiDi-PUL was established at http://bliulab.net/iPiDi-PUL to help the researchers to explore the associated diseases for newly discovered piRNAs.
Assuntos
Predisposição Genética para Doença , Aprendizado de Máquina , RNA Interferente Pequeno/metabolismo , Algoritmos , Biologia Computacional/métodos , Conjuntos de Dados como Assunto , Humanos , Análise de Componente PrincipalRESUMO
The insertion or deletion (indel) of amino acids has a variety of effects on protein function, ranging from disease-forming changes to gaining new functions. Despite their importance, indels have not been systematically characterized towards protein engineering or modification goals. In the present work, we focus on deletions composed of multiple contiguous amino acids (mAA-dels) and their effects on the protein (mutant) folding ability. Our analysis reveals that the mutant retains the native fold when the mAA-del obeys well-defined structural dynamics properties: localization in intrinsically flexible regions, showing low resistance to mechanical stress, and separation from allosteric signaling paths. Motivated by the possibility of distinguishing the features that underlie the adaptability of proteins to mAA-dels, and by the rapid evaluation of these features using elastic network models, we developed a positive-unlabeled learning-based classifier that can be adopted for protein design purposes. Trained on a consolidated set of features, including those reflecting the intrinsic dynamics of the regions where the mAA-dels occur, the new classifier yields a high recall of 84.3% for identifying mAA-dels that are stably tolerated by the protein. The comparative examination of the relative contribution of different features to the prediction reveals the dominant role of structural dynamics in enabling the adaptation of the mutant to mAA-del without disrupting the native fold.
Assuntos
Aminoácidos , Proteínas , Aminoácidos/genética , Proteínas/química , Mutação INDEL , Engenharia de ProteínasRESUMO
Identification of disease-associated circular RNAs (circRNAs) is of critical importance, especially with the dramatic increase in the amount of circRNAs. However, the availability of experimentally validated disease-associated circRNAs is limited, which restricts the development of effective computational methods. To our knowledge, systematic approaches for the prediction of disease-associated circRNAs are still lacking. In this study, we propose the use of deep forests combined with positive-unlabeled learning methods to predict potential disease-related circRNAs. In particular, a heterogeneous biological network involving 17 961 circRNAs, 469 miRNAs, and 248 diseases was constructed, and then 24 meta-path-based topological features were extracted. We applied 5-fold cross-validation on 15 disease data sets to benchmark the proposed approach and other competitive methods and used Recall@k and PRAUC@k to evaluate their performance. In general, our method performed better than the other methods. In addition, the performance of all methods improved with the accumulation of known positive labels. Our results provided a new framework to investigate the associations between circRNA and disease and might improve our understanding of its functions.
Assuntos
Doença/genética , RNA Circular/genética , Biologia Computacional/métodos , HumanosRESUMO
Drug-drug interactions (DDIs) are crucial for public health and patient safety, which has aroused widespread concern in academia and industry. The existing computational DDI prediction methods are mainly divided into four categories: literature extraction-based, similarity-based, matrix operations-based and network-based. A number of recent studies have revealed that integrating heterogeneous drug features is of significant importance for developing high-accuracy prediction models. Meanwhile, drugs that lack certain features could utilize other features to learn representations. However, it also brings some new challenges such as incomplete data, non-linear relations and heterogeneous properties. In this paper, we propose a multi-modal deep auto-encoders based drug representation learning method named DDI-MDAE, to predict DDIs from large-scale, noisy and sparse data. Our method aims to learn unified drug representations from multiple drug feature networks simultaneously using multi-modal deep auto-encoders. Then, we apply four operators on the learned drug embeddings to represent drug-drug pairs and adopt the random forest classifier to train models for predicting DDIs. The experimental results demonstrate the effectiveness of our proposed method for DDI prediction and significant improvement compared to other state-of-the-art benchmark methods. Moreover, we apply a specialized random forest classifier in the positive-unlabeled (PU) learning setting to enhance the prediction accuracy. Experimental results reveal that the model improved by PU learning outperforms the original method DDI-MDAE by 7.1% and 6.2% improvement in AUPR metric respectively on 3-fold cross-validation (3-CV) and 5-fold cross-validation (5-CV). And in F-measure metric, the improved model gains 10.4% and 8.4% improvement over DDI-MDAE respectively on 3-CV and 5-CV. The usefulness of DDI-MDAE is further demonstrated by case studies.
Assuntos
Biologia Computacional/métodos , Aprendizado Profundo , Farmacologia Clínica/métodos , Conjuntos de Dados como Assunto , Interações Medicamentosas , Quimioterapia Combinada , Previsões/métodos , HumanosRESUMO
BACKGROUND: Circular RNA (circRNA) has been extensively identified in cells and tissues, and plays crucial roles in human diseases and biological processes. circRNA could act as dynamic scaffolding molecules that modulate protein-protein interactions. The interactions between circRNA and RNA Binding Proteins (RBPs) are also deemed to an essential element underlying the functions of circRNA. Considering cost-heavy and labor-intensive aspects of these biological experimental technologies, instead, the high-throughput experimental data has enabled the large-scale prediction and analysis of circRNA-RBP interactions. RESULTS: A computational framework is constructed by employing Positive Unlabeled learning (P-U learning) to predict unknown circRNA-RBP interaction pairs with kernel model MFNN (Matrix Factorization with Neural Networks). The neural network is employed to extract the latent factors of circRNA and RBP in the interaction matrix, the P-U learning strategy is applied to alleviate the imbalanced characteristics of data samples and predict unknown interaction pairs. For this purpose, the known circRNA-RBP interaction data samples are collected from the circRNAs in cancer cell lines database (CircRic), and the circRNA-RBP interaction matrix is constructed as the input of the model. The experimental results show that kernel MFNN outperforms the other deep kernel models. Interestingly, it is found that the deeper of hidden layers in neural network framework does not mean the better in our model. Finally, the unlabeled interactions are scored using P-U learning with MFNN kernel, and the predicted interaction pairs are matched to the known interactions database. The results indicate that our method is an effective model to analyze the circRNA-RBP interactions. CONCLUSION: For a poorly studied circRNA-RBP interactions, we design a prediction framework only based on interaction matrix by employing matrix factorization and neural network. We demonstrate that MFNN achieves higher prediction accuracy, and it is an effective method.
Assuntos
Redes Neurais de Computação , RNA Circular/metabolismo , Proteínas de Ligação a RNA/metabolismo , Área Sob a Curva , Linhagem Celular Tumoral , Bases de Dados Factuais , Humanos , Neoplasias/genética , Neoplasias/patologia , Curva ROCRESUMO
BACKGROUND: Personalized cancer vaccines are emerging as one of the most promising approaches to immunotherapy of advanced cancers. However, only a small proportion of the neoepitopes generated by somatic DNA mutations in cancer cells lead to tumor rejection. Since it is impractical to experimentally assess all candidate neoepitopes prior to vaccination, developing accurate methods for predicting tumor-rejection mediating neoepitopes (TRMNs) is critical for enabling routine clinical use of cancer vaccines. RESULTS: In this paper we introduce Positive-unlabeled Learning using AuTOml (PLATO), a general semi-supervised approach to improving accuracy of model-based classifiers. PLATO generates a set of high confidence positive calls by applying a stringent filter to model-based predictions, then rescores remaining candidates by using positive-unlabeled learning. To achieve robust performance on clinical samples with large patient-to-patient variation, PLATO further integrates AutoML hyper-parameter tuning, classification threshold selection based on spies, and support for bootstrapping. CONCLUSIONS: Experimental results on real datasets demonstrate that PLATO has improved performance compared to model-based approaches for two key steps in TRMN prediction, namely somatic variant calling from exome sequencing data and peptide identification from MS/MS data.
Assuntos
Imunoterapia , Neoplasias/terapia , Peptídeos/análise , Medicina de Precisão , Aprendizado de Máquina Supervisionado , Epitopos/imunologia , Epitopos/metabolismo , Humanos , Polimorfismo de Nucleotídeo Único , Espectrometria de Massas em Tandem , Sequenciamento do ExomaRESUMO
BACKGROUND: As a new type of protein acylation modification, lysine glutarylation has been found to play a crucial role in metabolic processes and mitochondrial functions. To further explore the biological mechanisms and functions of glutarylation, it is significant to predict the potential glutarylation sites. In the existing glutarylation site predictors, experimentally verified glutarylation sites are treated as positive samples and non-verified lysine sites as the negative samples to train predictors. However, the non-verified lysine sites may contain some glutarylation sites which have not been experimentally identified yet. METHODS: In this study, experimentally verified glutarylation sites are treated as the positive samples, whereas the remaining non-verified lysine sites are treated as unlabeled samples. A bioinformatics tool named PUL-GLU was developed to identify glutarylation sites using a positive-unlabeled learning algorithm. RESULTS: Experimental results show that PUL-GLU significantly outperforms the current glutarylation site predictors. Therefore, PUL-GLU can be a powerful tool for accurate identification of protein glutarylation sites. CONCLUSION: A user-friendly web-server for PUL-GLU is available at http://bioinform.cn/pul_glu/.
RESUMO
BACKGROUND: Drug-drug interactions (DDIs) are a major concern in patients' medication. It's unfeasible to identify all potential DDIs using experimental methods which are time-consuming and expensive. Computational methods provide an effective strategy, however, facing challenges due to the lack of experimentally verified negative samples. RESULTS: To address this problem, we propose a novel positive-unlabeled learning method named DDI-PULearn for large-scale drug-drug-interaction predictions. DDI-PULearn first generates seeds of reliable negatives via OCSVM (one-class support vector machine) under a high-recall constraint and via the cosine-similarity based KNN (k-nearest neighbors) as well. Then trained with all the labeled positives (i.e., the validated DDIs) and the generated seed negatives, DDI-PULearn employs an iterative SVM to identify a set of entire reliable negatives from the unlabeled samples (i.e., the unobserved DDIs). Following that, DDI-PULearn represents all the labeled positives and the identified negatives as vectors of abundant drug properties by a similarity-based method. Finally, DDI-PULearn transforms these vectors into a lower-dimensional space via PCA (principal component analysis) and utilizes the compressed vectors as input for binary classifications. The performance of DDI-PULearn is evaluated on simulative prediction for 149,878 possible interactions between 548 drugs, comparing with two baseline methods and five state-of-the-art methods. Related experiment results show that the proposed method for the representation of DDIs characterizes them accurately. DDI-PULearn achieves superior performance owing to the identified reliable negatives, outperforming all other methods significantly. In addition, the predicted novel DDIs suggest that DDI-PULearn is capable to identify novel DDIs. CONCLUSIONS: The results demonstrate that positive-unlabeled learning paves a new way to tackle the problem caused by the lack of experimentally verified negatives in the computational prediction of DDIs.
Assuntos
Interações Medicamentosas , Análise por Conglomerados , Humanos , Máquina de Vetores de SuporteRESUMO
BACKGROUND: Machine learning can effectively nominate novel genes for various research purposes in the laboratory. On a genome-wide scale, we implemented multiple databases and algorithms to predict and prioritize the human aging genes (PPHAGE). RESULTS: We fused data from 11 databases, and used Naïve Bayes classifier and positive unlabeled learning (PUL) methods, NB, Spy, and Rocchio-SVM, to rank human genes in respect with their implication in aging. The PUL methods enabled us to identify a list of negative (non-aging) genes to use alongside the seed (known age-related) genes in the ranking process. Comparison of the PUL algorithms revealed that none of the methods for identifying a negative sample were advantageous over other methods, and their simultaneous use in a form of fusion was critical for obtaining optimal results (PPHAGE is publicly available at https://cbb.ut.ac.ir/pphage). CONCLUSION: We predict and prioritize over 3,000 candidate age-related genes in human, based on significant ranking scores. The identified candidate genes are associated with pathways, ontologies, and diseases that are linked to aging, such as cancer and diabetes. Our data offer a platform for future experimental research on the genetic and biological aspects of aging. Additionally, we demonstrate that fusion of PUL methods and data sources can be successfully used for aging and disease candidate gene prioritization.
Assuntos
Envelhecimento/genética , Genômica/métodos , Aprendizado de Máquina , Análise de Dados , HumanosRESUMO
MicroRNAs (miRNAs) play an important role in prevention, diagnosis and treatment of human complex diseases. Predicting potential miRNA-disease associations could provide important prior information for medical researchers. Therefore, reliable computational models are expected to be an effective supplement for inferring associations between miRNAs and diseases. In this study, we developed a novel calculative model named Negative Samples Extraction based MiRNA-Disease Association prediction (NSEMDA). NSEMDA filtered reliable negative samples by two positive-unlabeled learning models, namely, the Spy and Rocchio techniques and calculated similarity weights for ambiguous samples. The positive samples, reliable negative samples and ambiguous samples with similarity weights were used to construct a Support Vector Machine-Similarity Weight model to predict miRNA-disease associations. NSEMDA improved the credibility of negative samples and reduced the impact of noise samples by introducing ambiguous samples with similarity weights to train prediction model. As a result, NSEMDA achieved the AUC of 0.8899 in global leave-one-out cross validation (LOOCV) and AUC of 0.8353 under local LOOCV. In 100 times 5-fold cross validation, NSEMDA obtained an average AUC of 0.8878 and standard deviation of 0.0014. These AUCs are higher than many classical models. Besides, we also carried out three kinds of case studies to evaluate the performance of NSEMDA. Among the top 50 potential related miRNAs of esophageal neoplasms, lung neoplasms and carcinoma hepatocellular predicted by NSEMDA, 46, 50 and 45 miRNAs were verified to be associated with the investigated disease by experimental evidences, respectively. Therefore, NSEMDA would be a reliable calculative model for inferring miRNA-disease associations.
Assuntos
Biologia Computacional , Perfilação da Expressão Gênica , Predisposição Genética para Doença , MicroRNAs/genética , Algoritmos , Biomarcadores , Biologia Computacional/métodos , Bases de Dados Genéticas , Perfilação da Expressão Gênica/métodos , Humanos , Neoplasias/etiologia , Prognóstico , Curva ROC , Reprodutibilidade dos Testes , TranscriptomaRESUMO
MOTIVATION: Predicting Drug-Drug Interaction (DDI) has become a crucial step in the drug discovery and development process, owing to the rise in the number of drugs co-administered with other drugs. Consequently, the usage of computational methods for DDI prediction can greatly help in reducing the costs of in vitro experiments done during the drug development process. With lots of emergent data sources that describe the properties and relationships between drugs and drug-related entities (gene, protein, disease, and side effects), an integrated approach that uses multiple data sources would be most effective. METHOD: We propose a semi-supervised learning framework which utilizes representation learning, positive-unlabeled (PU) learning and meta-learning efficiently to predict the drug interactions. Information from multiple data sources is used to create feature networks, which is used to learn the meta-knowledge about the DDIs. Given that DDIs have only positive labeled data, a PU learning-based classifier is used to generate meta-knowledge from feature networks. Finally, a meta-classifier that combines the predicted probability of interaction from the meta-knowledge learnt is designed. RESULTS: Node2vec, a network representation learning method and bagging SVM, a PU learning algorithm, are used in this work. Both representation learning and PU learning algorithms improve the performance of the system by 22% and 12.7% respectively. The meta-classifier performs better and predicts more reliable DDIs than the base classifiers.
Assuntos
Desenho de Fármacos , Interações Medicamentosas , Aprendizado de Máquina Supervisionado , Algoritmos , Biologia Computacional , Desenvolvimento de Medicamentos , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Preparações Farmacêuticas , Valor Preditivo dos Testes , Probabilidade , Software , Tecnologia FarmacêuticaRESUMO
Predicting disease candidate genes from human genome is a crucial part of nowadays biomedical research. According to observations, diseases with the same phenotype have the similar biological characteristics and genes associated with these same diseases tend to share common functional properties. Therefore, by applying machine learning methods, new disease genes are predicted based on previous ones. In recent studies, some semi-supervised learning methods, called Positive-Unlabeled Learning (PU-Learning) are used for predicting disease candidate genes. In this study, a novel method is introduced to predict disease candidate genes through gene expression profiles by learning hidden Markov models. In order to evaluate the proposed method, it is applied on a mixed part of 398 disease genes from three disease types and 12001 unlabeled genes. Compared to the other methods in literature, the experimental results indicate a significant improvement in favor of the proposed method.
Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica , Predisposição Genética para Doença , Cadeias de Markov , Mapeamento de Interação de Proteínas , Algoritmos , Inteligência Artificial , Análise por Conglomerados , Humanos , Modelos Estatísticos , Análise de Sequência com Séries de Oligonucleotídeos , Fenótipo , Probabilidade , Software , Aprendizado de Máquina Supervisionado , TranscriptomaRESUMO
Protein pupylation is a type of post-translation modification, which plays a crucial role in cellular function of bacterial organisms in prokaryotes. To have a better insight of the mechanisms underlying pupylation an initial, but important, step is to identify pupylation sites. To date, several computational methods have been established for the prediction of pupylation sites which usually artificially design the negative samples using the verified pupylation proteins to train the classifiers. However, if this process is not properly done it can affect the performance of the final predictor dramatically. In this work, different from previous computational methods, we proposed an enhanced positive-unlabeled learning algorithm (EPuL) to the pupylation site prediction problem, which uses only positive and unlabeled samples. Firstly, we separate the training dataset into the positive dataset and the unlabeled dataset which contains the remaining non-annotated lysine residues. Then, the EPuL algorithm is utilized to select the reliably negative initial dataset and then iteratively pick out the non-pupylation sites. The performance of the proposed method was measured with an accuracy of 90.24%, an Area Under Curve (AUC) of 0.93 and an MCC of 0.81 by 10-fold cross-validation. A user-friendly web server for predicting pupylation sites was developed and was freely available at http://59.73.198.144:8080/EPuL.