RESUMEN
Graph is a natural data structure for describing complex systems, which contains a set of objects and relationships. Ubiquitous real-life biomedical problems can be modeled as graph analytics tasks. Machine learning, especially deep learning, succeeds in vast bioinformatics scenarios with data represented in Euclidean domain. However, rich relational information between biological elements is retained in the non-Euclidean biomedical graphs, which is not learning friendly to classic machine learning methods. Graph representation learning aims to embed graph into a low-dimensional space while preserving graph topology and node properties. It bridges biomedical graphs and modern machine learning methods and has recently raised widespread interest in both machine learning and bioinformatics communities. In this work, we summarize the advances of graph representation learning and its representative applications in bioinformatics. To provide a comprehensive and structured analysis and perspective, we first categorize and analyze both graph embedding methods (homogeneous graph embedding, heterogeneous graph embedding, attribute graph embedding) and graph neural networks. Furthermore, we summarize their representative applications from molecular level to genomics, pharmaceutical and healthcare systems level. Moreover, we provide open resource platforms and libraries for implementing these graph representation learning methods and discuss the challenges and opportunities of graph representation learning in bioinformatics. This work provides a comprehensive survey of emerging graph representation learning algorithms and their applications in bioinformatics. It is anticipated that it could bring valuable insights for researchers to contribute their knowledge to graph representation learning and future-oriented bioinformatics studies.
Asunto(s)
Biología Computacional , Redes Neurales de la Computación , Algoritmos , Biología Computacional/métodos , Conocimiento , Aprendizaje AutomáticoRESUMEN
BACKGROUND: Accurately identifying the risk level of drug combinations is of great significance in investigating the mechanisms of combination medication and adverse reactions. Most existing methods can only predict whether there is an interaction between two drugs, but cannot directly determine their accurate risk level. METHODS: In this study, we propose a multi-class drug combination risk prediction model named AERGCN-DDI, utilizing a relational graph convolutional network with a multi-head attention mechanism. Drug-drug interaction events with varying risk levels are modeled as a heterogeneous information graph. Attribute features of drug nodes and links are learned based on compound chemical structure information. Finally, the AERGCN-DDI model is proposed to predict drug combination risk level based on heterogenous graph neural network and multi-head attention modules. RESULTS: To evaluate the effectiveness of the proposed method, five-fold cross-validation and ablation study were conducted. Furthermore, we compared its predictive performance with baseline models and other state-of-the-art methods on two benchmark datasets. Empirical studies demonstrated the superior performances of AERGCN-DDI. CONCLUSIONS: AERGCN-DDI emerges as a valuable tool for predicting the risk levels of drug combinations, thereby aiding in clinical medication decision-making, mitigating severe drug side effects, and enhancing patient clinical prognosis.
Asunto(s)
Redes Neurales de la Computación , Humanos , Interacciones Farmacológicas , Combinación de Medicamentos , Medición de Riesgo , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Reproducibilidad de los Resultados , Gráficos por ComputadorRESUMEN
Effectively representing Medical Subject Headings (MeSH) headings (terms) such as disease and drug as discriminative vectors could greatly improve the performance of downstream computational prediction models. However, these terms are often abstract and difficult to quantify. In this paper, we converted the MeSH tree structure into a relationship network and applied several graph embedding algorithms on it to represent these terms. Specifically, the relationship network consisting of nodes (MeSH headings) and edges (relationships), which can be constructed by the tree num. Then, five graph embedding algorithms including DeepWalk, LINE, SDNE, LAP and HOPE were implemented on the relationship network to represent MeSH headings as vectors. In order to evaluate the performance of the proposed methods, we carried out the node classification and relationship prediction tasks. The results show that the MeSH headings characterized by graph embedding algorithms can not only be treated as an independent carrier for representation, but also can be utilized as additional information to enhance the representation ability of vectors. Thus, it can serve as an input and continue to play a significant role in any computational models related to disease, drug, microbe, etc. Besides, our method holds great hope to inspire relevant researchers to study the representation of terms in this network perspective.
Asunto(s)
Algoritmos , Medical Subject Headings , Simulación por Computador , Sistemas de Liberación de Medicamentos , Predisposición Genética a la Enfermedad , Humanos , MicroARNs/genética , SemánticaRESUMEN
BACKGROUND: Long non-coding RNAs (lncRNAs) play a crucial role in diverse biological processes and have been confirmed to be concerned with various diseases. Largely uncharacterized of the physiological role and functions of lncRNA remains. MicroRNAs (miRNAs), which are usually 20-24 nucleotides, have several critical regulatory parts in cells. LncRNA can be regarded as a sponge to adsorb miRNA and indirectly regulate transcription and translation. Thus, the identification of lncRNA-miRNA associations is essential and valuable. RESULTS: In our work, we present DWLMI to infer the potential associations between lncRNAs and miRNAs by representing them as vectors via a lncRNA-miRNA-disease-protein-drug graph. Specifically, DeepWalk can be used to learn the behavior representation of vertices. The methods of fingerprint, k-mer and MeSH descriptors were mainly used to learn the attribute representation of vertices. By combining the above two kinds of information, unknown lncRNA-miRNA associations can be predicted by the random forest classifier. Under the five-fold cross-validation, the proposed DWLMI model obtained an average prediction accuracy of 95.22% with a sensitivity of 94.35% at the AUC of 98.56%. CONCLUSIONS: The experimental results demonstrated that DWLMI can effectively predict the potential lncRNA-miRNA associated pairs, and the results can provide a new insight for related non-coding RNA researchers in the field of combing biology big data with deep learning.
Asunto(s)
MicroARNs , Preparaciones Farmacéuticas , ARN Largo no Codificante , Biología Computacional/métodos , MicroARNs/genética , ARN Largo no Codificante/genéticaRESUMEN
BACKGROUND: lncRNAs play a critical role in numerous biological processes and life activities, especially diseases. Considering that traditional wet experiments for identifying uncovered lncRNA-disease associations is limited in terms of time consumption and labor cost. It is imperative to construct reliable and efficient computational models as addition for practice. Deep learning technologies have been proved to make impressive contributions in many areas, but the feasibility of it in bioinformatics has not been adequately verified. RESULTS: In this paper, a machine learning-based model called LDACE was proposed to predict potential lncRNA-disease associations by combining Extreme Learning Machine (ELM) and Convolutional Neural Network (CNN). Specifically, the representation vectors are constructed by integrating multiple types of biology information including functional similarity and semantic similarity. Then, CNN is applied to mine both local and global features. Finally, ELM is chosen to carry out the prediction task to detect the potential lncRNA-disease associations. The proposed method achieved remarkable Area Under Receiver Operating Characteristic Curve of 0.9086 in Leave-one-out cross-validation and 0.8994 in fivefold cross-validation, respectively. In addition, 2 kinds of case studies based on lung cancer and endometrial cancer indicate the robustness and efficiency of LDACE even in a real environment. CONCLUSIONS: Substantial results demonstrated that the proposed model is expected to be an auxiliary tool to guide and assist biomedical research, and the close integration of deep learning and biology big data will provide life sciences with novel insights.
Asunto(s)
ARN Largo no Codificante , Biología Computacional/métodos , Aprendizaje Automático , Redes Neurales de la Computación , ARN Largo no Codificante/genética , Curva ROCRESUMEN
BACKGROUND: Recent evidences have suggested that human microorganisms participate in important biological activities in the human body. The dysfunction of host-microbiota interactions could lead to complex human disorders. The knowledge on host-microbiota interactions can provide valuable insights into understanding the pathological mechanism of diseases. However, it is time-consuming and costly to identify the disorder-specific microbes from the biological "haystack" merely by routine wet-lab experiments. With the developments in next-generation sequencing and omics-based trials, it is imperative to develop computational prediction models for predicting microbe-disease associations on a large scale. RESULTS: Based on the known microbe-disease associations derived from the Human Microbe-Disease Association Database (HMDAD), the proposed model shows reliable performance with high values of the area under ROC curve (AUC) of 0.9456 and 0.8866 in leave-one-out cross validations and five-fold cross validations, respectively. In case studies of colorectal carcinoma, 80% out of the top-20 predicted microbes have been experimentally confirmed via published literatures. CONCLUSION: Based on the assumption that functionally similar microbes tend to share the similar interaction patterns with human diseases, we here propose a group based computational model of Bayesian disease-oriented ranking to prioritize the most potential microbes associating with various human diseases. Based on the sequence information of genes, two computational approaches (BLAST+ and MEGA 7) are leveraged to measure the microbe-microbe similarity from different perspectives. The disease-disease similarity is calculated by capturing the hierarchy information from the Medical Subject Headings (MeSH) data. The experimental results illustrate the accuracy and effectiveness of the proposed model. This work is expected to facilitate the characterization and identification of promising microbial biomarkers.
Asunto(s)
Algoritmos , Bacterias/clasificación , Biología Computacional , ARN Ribosómico 16S , Teorema de Bayes , Biología Computacional/métodos , Genes de ARNr , Humanos , ARN Ribosómico 16S/genéticaRESUMEN
BACKGROUND: Drug repositioning, meanings finding new uses for existing drugs, which can accelerate the processing of new drugs research and development. Various computational methods have been presented to predict novel drug-disease associations for drug repositioning based on similarity measures among drugs and diseases. However, there are some known associations between drugs and diseases that previous studies not utilized. METHODS: In this work, we develop a deep gated recurrent units model to predict potential drug-disease interactions using comprehensive similarity measures and Gaussian interaction profile kernel. More specifically, the similarity measure is used to exploit discriminative feature for drugs based on their chemical fingerprints. Meanwhile, the Gaussian interactions profile kernel is employed to obtain efficient feature of diseases based on known disease-disease associations. Then, a deep gated recurrent units model is developed to predict potential drug-disease interactions. RESULTS: The performance of the proposed model is evaluated on two benchmark datasets under tenfold cross-validation. And to further verify the predictive ability, case studies for predicting new potential indications of drugs were carried out. CONCLUSION: The experimental results proved the proposed model is a useful tool for predicting new indications for drugs or new treatments for diseases, and can accelerate drug repositioning and related drug research and discovery.
Asunto(s)
Aprendizaje Profundo , Reposicionamiento de Medicamentos , Algoritmos , Biología Computacional , Simulación por ComputadorRESUMEN
BACKGROUND: As an important non-coding RNA, microRNA (miRNA) plays a significant role in a series of life processes and is closely associated with a variety of Human diseases. Hence, identification of potential miRNA-disease associations can make great contributions to the research and treatment of Human diseases. However, to our knowledge, many existing computational methods only utilize the single type of known association information between miRNAs and diseases to predict their potential associations, without focusing on their interactions or associations with other types of molecules. RESULTS: In this paper, we propose a network embedding-based method for predicting miRNA-disease associations by preserving behavior and attribute information. Firstly, a heterogeneous network is constructed by integrating known associations among miRNA, protein and disease, and the network representation method Learning Graph Representations with Global Structural Information (GraRep) is implemented to learn the behavior information of miRNAs and diseases in the network. Then, the behavior information of miRNAs and diseases is combined with the attribute information of them to represent miRNA-disease association pairs. Finally, the prediction model is established based on the Random Forest algorithm. Under the five-fold cross validation, the proposed NEMPD model obtained average 85.41% prediction accuracy with 80.96% sensitivity at the AUC of 91.58%. Furthermore, the performance of NEMPD is also validated by the case studies. Among the top 50 predicted disease-related miRNAs, 48 (breast neoplasms), 47 (colon neoplasms), 47 (lung neoplasms) were confirmed by two other databases. CONCLUSIONS: The proposed NEMPD model has a good performance in predicting the potential associations between miRNAs and diseases, and has great potency in the field of miRNA-disease association prediction in the future.
Asunto(s)
Neoplasias de la Mama/diagnóstico , Neoplasias del Colon/diagnóstico , Biología Computacional/métodos , Neoplasias Pulmonares/diagnóstico , MicroARNs/metabolismo , Algoritmos , Área Bajo la Curva , Neoplasias de la Mama/genética , Neoplasias del Colon/genética , Femenino , Humanos , Neoplasias Pulmonares/genética , MicroARNs/genética , Curva ROCRESUMEN
BACKGROUND: The interactions between non-coding RNAs (ncRNA) and proteins play an essential role in many biological processes. Several high-throughput experimental methods have been applied to detect ncRNA-protein interactions. However, these methods are time-consuming and expensive. Accurate and efficient computational methods can assist and accelerate the study of ncRNA-protein interactions. RESULTS: In this work, we develop a stacking ensemble computational framework, RPI-SE, for effectively predicting ncRNA-protein interactions. More specifically, to fully exploit protein and RNA sequence feature, Position Weight Matrix combined with Legendre Moments is applied to obtain protein evolutionary information. Meanwhile, k-mer sparse matrix is employed to extract efficient feature of ncRNA sequences. Finally, an ensemble learning framework integrated different types of base classifier is developed to predict ncRNA-protein interactions using these discriminative features. The accuracy and robustness of RPI-SE was evaluated on three benchmark data sets under five-fold cross-validation and compared with other state-of-the-art methods. CONCLUSIONS: The results demonstrate that RPI-SE is competent for ncRNA-protein interactions prediction task with high accuracy and robustness. It's anticipated that this work can provide a computational prediction tool to advance ncRNA-protein interactions related biomedical research.
Asunto(s)
ARN no Traducido/metabolismo , Proteínas de Unión al ARN/metabolismo , Análisis de Secuencia de Proteína/métodos , Análisis de Secuencia de ARN/métodos , Posición Específica de Matrices de Puntuación , ARN no Traducido/química , Proteínas de Unión al ARN/químicaRESUMEN
BACKGROUND: The key to modern drug discovery is to find, identify and prepare drug molecular targets. However, due to the influence of throughput, precision and cost, traditional experimental methods are difficult to be widely used to infer these potential Drug-Target Interactions (DTIs). Therefore, it is urgent to develop effective computational methods to validate the interaction between drugs and target. METHODS: We developed a deep learning-based model for DTIs prediction. The proteins evolutionary features are extracted via Position Specific Scoring Matrix (PSSM) and Legendre Moment (LM) and associated with drugs molecular substructure fingerprints to form feature vectors of drug-target pairs. Then we utilized the Sparse Principal Component Analysis (SPCA) to compress the features of drugs and proteins into a uniform vector space. Lastly, the deep long short-term memory (DeepLSTM) was constructed for carrying out prediction. RESULTS: A significant improvement in DTIs prediction performance can be observed on experimental results, with AUC of 0.9951, 0.9705, 0.9951, 0.9206, respectively, on four classes important drug-target datasets. Further experiments preliminary proves that the proposed characterization scheme has great advantage on feature expression and recognition. We also have shown that the proposed method can work well with small dataset. CONCLUSION: The results demonstration that the proposed approach has a great advantage over state-of-the-art drug-target predictor. To the best of our knowledge, this study first tests the potential of deep learning method with memory and Turing completeness in DTIs prediction.
Asunto(s)
Aprendizaje Profundo , Memoria a Corto Plazo/efectos de los fármacos , Redes Neurales de la Computación , Preparaciones Farmacéuticas , Desarrollo de Medicamentos , Humanos , Análisis de Componente Principal , ProteínasRESUMEN
The interactions between ncRNAs and proteins are critical for regulating various cellular processes in organisms, such as gene expression regulations. However, due to limitations, including financial and material consumptions in recent experimental methods for predicting ncRNA and protein interactions, it is essential to propose an innovative and practical approach with convincing performance of prediction accuracy. In this study, based on the protein sequences from a biological perspective, we put forward an effective deep learning method, named BGFE, to predict ncRNA and protein interactions. Protein sequences are represented by bi-gram probability feature extraction method from Position Specific Scoring Matrix (PSSM), and for ncRNA sequences, k-mers sparse matrices are employed to represent them. Furthermore, to extract hidden high-level feature information, a stacked auto-encoder network is employed with the stacked ensemble integration strategy. We evaluate the performance of the proposed method by using three datasets and a five-fold cross-validation after classifying the features through the random forest classifier. The experimental results clearly demonstrate the effectiveness and the prediction accuracy of our approach. In general, the proposed method is helpful for ncRNA and protein interacting predictions and it provides some serviceable guidance in future biological research.
Asunto(s)
Biología Computacional/métodos , Aprendizaje Profundo , ARN no Traducido/genética , Programas Informáticos , Secuencia de Aminoácidos , Bases de Datos de Proteínas , Posición Específica de Matrices de Puntuación , Unión Proteica , Curva ROCRESUMEN
It is significant for biological cells to predict self-interacting proteins (SIPs) in the field of bioinformatics. SIPs mean that two or more identical proteins can interact with each other by one gene expression. This plays a major role in the evolution of proteinâprotein interactions (PPIs) and cellular functions. Owing to the limitation of the experimental identification of self-interacting proteins, it is more and more significant to develop a useful biological tool for the prediction of SIPs from protein sequence information. Therefore, we propose a novel prediction model called RP-FFT that merges the Random Projection (RP) model and Fast Fourier Transform (FFT) for detecting SIPs. First, each protein sequence was transformed into a Position Specific Scoring Matrix (PSSM) using the Position Specific Iterated BLAST (PSI-BLAST). Second, the features of protein sequences were extracted by the FFT method on PSSM. Lastly, we evaluated the performance of RP-FFT and compared the RP classifier with the state-of-the-art support vector machine (SVM) classifier and other existing methods on the human and yeast datasets; after the five-fold cross-validation, the RP-FFT model can obtain high average accuracies of 96.28% and 91.87% on the human and yeast datasets, respectively. The experimental results demonstrated that our RP-FFT prediction model is reasonable and robust.
Asunto(s)
Análisis de Fourier , Análisis de Secuencia de Proteína/métodos , Máquina de Vectores de Soporte , Animales , Sitios de Unión , Humanos , Unión Proteica , Proteínas de Saccharomyces cerevisiae/químicaRESUMEN
Protein-protein interactions (PPIs) play a very large part in most cellular processes. Although a great deal of research has been devoted to detecting PPIs through high-throughput technologies, these methods are clearly expensive and cumbersome. Compared with the traditional experimental methods, computational methods have attracted much attention because of their good performance in detecting PPIs. In our work, a novel computational method named as PCVM-LM is proposed which combines the probabilistic classification vector machine (PCVM) model and Legendre moments (LMs) to predict PPIs from amino acid sequences. The improvement mainly comes from using the LMs to extract discriminatory information embedded in the position-specific scoring matrix (PSSM) combined with the PCVM classifier to implement prediction. The proposed method was evaluated on Yeast and Helicobacter pylori datasets with five-fold cross-validation experiments. The experimental results show that the proposed method achieves high average accuracies of 96.37% and 93.48%, respectively, which are much better than other well-known methods. To further evaluate the proposed method, we also compared the proposed method with the state-of-the-art support vector machine (SVM) classifier and other existing methods on the same datasets. The comparison results clearly show that our method is better than the SVM-based method and other existing methods. The promising experimental results show the reliability and effectiveness of the proposed method, which can be a useful decision support tool for protein research.
Asunto(s)
Proteínas Bacterianas/genética , Biología Computacional , Mapeo de Interacción de Proteínas/métodos , Mapas de Interacción de Proteínas/genética , Algoritmos , Bases de Datos de Proteínas , Helicobacter pylori , Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae , Máquina de Vectores de SoporteRESUMEN
BACKGROUND: Drug-drug interaction events influence the effectiveness of drug combinations and can lead to unexpected side effects or exacerbate underlying diseases, jeopardizing patient prognosis. Most existing methods are restricted to predicting whether two drugs interact or the type of drug-drug interactions, while very few studies endeavor to predict the specific risk levels of side effects of drug combinations. METHODS: In this study, we propose MathEagle, a novel approach to predict accurate risk levels of drug combinations based on multi-head attention and heterogeneous attribute graph learning. Initially, we model drugs and three distinct risk levels between drugs as a heterogeneous information graph. Subsequently, behavioral and chemical structure features of drugs are utilized by message passing neural networks and graph embedding algorithms, respectively. Ultimately, MathEagle employs heterogeneous graph convolution and multi-head attention mechanisms to learn efficient latent representations of drug nodes and estimates the risk levels of pairwise drugs in an end-to-end manner. RESULTS: To assess the effectiveness and robustness of the model, five-fold cross-validation, ablation experiments, and case studies were conducted. MathEagle achieved an accuracy of 85.85 % and an AUC of 0.9701 on the drug risk level prediction task and is superior to all comparative models. The MathEagle predictor is freely accessible at http://120.77.11.78/MathEagle/. CONCLUSIONS: The experimental results indicate that MathEagle can function as an effective tool for predicting accurate risk of drug combinations, aiding in guiding clinical medication, and enhancing patient outcomes.
Asunto(s)
Interacciones Farmacológicas , Humanos , Algoritmos , Redes Neurales de la Computación , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Aprendizaje AutomáticoRESUMEN
Predicting drug-drug interactions (DDIs) is a significant concern in the field of deep learning. It can effectively reduce potential adverse consequences and improve therapeutic safety. Graph neural network (GNN)-based models have made satisfactory progress in DDI event prediction. However, most existing models overlook crucial drug structure and interaction information, which is necessary for accurate DDI event prediction. To tackle this issue, we introduce a new method called MRGCDDI. This approach employs contrastive learning, but unlike conventional methods, it does not require data augmentation, thereby avoiding additional noise. MRGCDDI maintains the semantics of the graphical data during encoder perturbation through a simple yet effective contrastive learning approach, without the need for manual trial and error, tedious searching, or expensive domain knowledge to select enhancements. The approach presented in this study effectively integrates drug features extracted from drug molecular graphs and information from multi-relational drug-drug interaction (DDI) networks. Extensive experimental results demonstrate that MRGCDDI outperforms state-of-the-art methods on both datasets. Specifically, on Deng's dataset, MRGCDDI achieves an average increase of 4.33% in accuracy, 11.57% in Macro-F1, 10.97% in Macro-Recall, and 10.64% in Macro-Precision. Similarly, on Ryu's dataset, the model shows improvements with an average increase of 2.42% in accuracy, 3.86% in Macro-F1, 3.49% in Macro-Recall, and 2.75% in Macro-Precision. All the data and codes of this work are available at https://github.com/Nokeli/MRGCDDI.
RESUMEN
A key aim of post-genomic biomedical research is to systematically understand molecules and their interactions in human cells. Multiple biomolecules coordinate to sustain life activities, and interactions between various biomolecules are interconnected. However, existing studies usually only focusing on associations between two or very limited types of molecules. In this study, we propose a network representation learning based computational framework MAN-SDNE to predict any intermolecular associations. More specifically, we constructed a large-scale molecular association network of multiple biomolecules in human by integrating associations among long non-coding RNA, microRNA, protein, drug, and disease, containing 6,528 molecular nodes, 9 kind of,105,546 associations. And then, the feature of each node is represented by its network proximity and attribute features. Furthermore, these features are used to train Random Forest classifier to predict intermolecular associations. MAN-SDNE achieves a remarkable performance with an AUC of 0.9552 and an AUPR of 0.9338 under five-fold cross-validation. To indicate the ability to predict specific types of interactions, a case study for predicting lncRNA-protein interactions using MAN-SDNE is also executed. Experimental results demonstrate this work offers a systematic insight for understanding the synergistic associations between molecules and complex diseases and provides a network-based computational tool to systematically explore intermolecular interactions.
Asunto(s)
Modelos Biológicos , Biología de Sistemas/métodos , Simulación por Computador , Humanos , MicroARNs/genética , MicroARNs/metabolismo , Preparaciones Farmacéuticas/metabolismo , ARN Largo no Codificante/genética , ARN Largo no Codificante/metabolismoRESUMEN
Protein-protein interaction (PPI) is the basis of the whole molecular mechanisms of living cells. Although traditional experiments are able to detect PPIs accurately, they often encounter high cost and require more time. As a result, computational methods have been used to predict PPIs to avoid these problems. Graph structure, as the important and pervasive data carriers, is considered as the most suitable structure to present biomedical entities and relationships. Although graph embedding is the most popular approach for graph representation learning, it usually suffers from high computational and space cost, especially in large-scale graphs. Therefore, developing a framework, which can accelerate graph embedding and improve the accuracy of embedding results, is important to large-scale PPIs prediction. In this paper, we propose a multi-level model LPPI to improve both the quality and speed of large-scale PPIs prediction. Firstly, protein basic information is collected as its attribute, including positional gene sets, motif gene sets, and immunological signatures. Secondly, we construct a weighted graph by using protein attributes to calculate node similarity. Then GraphZoom is used to accelerate the embedding process by reducing the size of the weighted graph. Next, graph embedding methods are used to learn graph topology features from the reconstructed graph. Finally, the linear Logistic Regression (LR) model is used to predict the probability of interactions of two proteins. LPPI achieved a high accuracy of 0.99997 and 0.9979 on the PPI network dataset and GraphSAGE-PPI dataset, respectively. Our further results show that the LPPI is promising for large-scale PPI prediction in both accuracy and efficiency, which is beneficial to other large-scale biomedical molecules interactions detection.
RESUMEN
Detecting whether a pair of biomolecules associate is of great significance in the study of molecular biology. Hence, computational methods are urgently needed as guidance for practice. However, most of the previous prediction models influenced by reductionism focused on isolated research objects, which have their own inherent defects. Inspired by holism, a machine-learning-based framework called MAN-node2vec is proposed to predict multi-type relationships in the molecular associations network (MAN). Specifically, we constructed a large-scale MAN composed of 1,023 miRNAs, 1,649 proteins, 769 long non-coding RNAs (lncRNAs), 1,025 drugs, and 2,062 diseases. Then, each biomolecule in MAN can be represented as a vector by its attribute learned by k-mer, etc. and its behavior learned by node2vec. Finally, the random forest classifier is applied to carry out the relationship prediction task. The proposed model achieved a reliable performance with 0.9677 areas under the curve (AUCs) and 0.9562 areas under the precision curve (AUPRs) under 5-fold cross-validation. Also, additional experiments proved that the proposed global model shows more competitive performance than the traditional local method. All of these provided a systematic insight for understanding the synergistic interactions between various molecules and diseases. It is anticipated that this work can bring beneficial inspiration and advance to related systems biology and biomedical research.
RESUMEN
The long noncoding RNAs (lncRNAs) are ubiquitous in organisms and play crucial role in a variety of biological processes and complex diseases. Emerging evidences suggest that lncRNAs interact with corresponding proteins to perform their regulatory functions. Therefore, identifying interacting lncRNA-protein pairs is the first step in understanding the function and mechanism of lncRNA. Since it is time-consuming and expensive to determine lncRNA-protein interactions by high-throughput experiments, more robust and accurate computational methods need to be developed. In this study, we developed a new sequence distributed representation learning based method for potential lncRNA-Protein Interactions Prediction, named LPI-Pred, which is inspired by the similarity between natural language and biological sequences. More specifically, lncRNA and protein sequences were divided into k-mer segmentation, which can be regard as "word" in natural language processing. Then, we trained out the RNA2vec and Pro2vec model using word2vec and human genome-wide lncRNA and protein sequences to mine distribution representation of RNA and protein. Then, the dimension of complex features is reduced by using feature selection based on Gini information impurity measure. Finally, these discriminative features are used to train a Random Forest classifier to predict lncRNA-protein interactions. Five-fold cross-validation was adopted to evaluate the performance of LPI-Pred on three benchmark datasets, including RPI369, RPI488 and RPI2241. The results demonstrate that LPI-Pred can be a useful tool to provide reliable guidance for biological research.
RESUMEN
Analysis of miRNA-target mRNA interaction (MTI) is of crucial significance in discovering new target candidates for miRNAs. However, the biological experiments for identifying MTIs have a high false positive rate and are high-priced, time-consuming, and arduous. It is an urgent task to develop effective computational approaches to enhance the investigation of miRNA-target mRNA relationships. In this study, a novel method called MIPDH is developed for miRNA-mRNA interaction prediction by using DeepWalk on a heterogeneous network. More specifically, MIPDH extracts two kinds of features, in which a biological behavior feature is learned using a network embedding algorithm on a constructed heterogeneous network derived from 17 kinds of associations among drug, disease, and 6 kinds of biomolecules, and the attribute feature is learned using the k-mer method on sequences of miRNAs and target mRNAs. Then, a random forest classifier is trained on the features combined with the biological behavior feature and attribute feature. When implementing a 5-fold cross-validation experiment, MIPDH achieved an average accuracy, sensitivity, specificity and AUC of 75.85, 74.37, 77.33%, and 0.8044, respectively. To further evaluate the performance of MIPDH, other classifiers and feature descriptors are conducted for comparisons. MIPDH can achieve a better performance. Additionally, case studies on hsa-miR-106b-5p, hsa-let-7d-5p, and hsa-let-7e-5p are also implemented. As a result, 14, 9, and 9 out of the top 15 targets that interacted with these miRNAs were verified using the experimental literature or other databases. All these prediction results indicate that MIPDH is an effective method for predicting miRNA-target mRNA interactions.