Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 28
Filtrar
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38622356

RESUMO

Identifying disease-associated microRNAs (miRNAs) could help understand the deep mechanism of diseases, which promotes the development of new medicine. Recently, network-based approaches have been widely proposed for inferring the potential associations between miRNAs and diseases. However, these approaches ignore the importance of different relations in meta-paths when learning the embeddings of miRNAs and diseases. Besides, they pay little attention to screening out reliable negative samples which is crucial for improving the prediction accuracy. In this study, we propose a novel approach named MGCNSS with the multi-layer graph convolution and high-quality negative sample selection strategy. Specifically, MGCNSS first constructs a comprehensive heterogeneous network by integrating miRNA and disease similarity networks coupled with their known association relationships. Then, we employ the multi-layer graph convolution to automatically capture the meta-path relations with different lengths in the heterogeneous network and learn the discriminative representations of miRNAs and diseases. After that, MGCNSS establishes a highly reliable negative sample set from the unlabeled sample set with the negative distance-based sample selection strategy. Finally, we train MGCNSS under an unsupervised learning manner and predict the potential associations between miRNAs and diseases. The experimental results fully demonstrate that MGCNSS outperforms all baseline methods on both balanced and imbalanced datasets. More importantly, we conduct case studies on colon neoplasms and esophageal neoplasms, further confirming the ability of MGCNSS to detect potential candidate miRNAs. The source code is publicly available on GitHub https://github.com/15136943622/MGCNSS/tree/master.


Assuntos
Neoplasias do Colo , MicroRNAs , Humanos , MicroRNAs/genética , Algoritmos , Biologia Computacional/métodos , Software , Neoplasias do Colo/genética
2.
Brief Bioinform ; 23(5)2022 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-35524503

RESUMO

MOTIVATION: In recent years, a large number of biological experiments have strongly shown that miRNAs play an important role in understanding disease pathogenesis. The discovery of miRNA-disease associations is beneficial for disease diagnosis and treatment. Since inferring these associations through biological experiments is time-consuming and expensive, researchers have sought to identify the associations utilizing computational approaches. Graph Convolutional Networks (GCNs), which exhibit excellent performance in link prediction problems, have been successfully used in miRNA-disease association prediction. However, GCNs only consider 1st-order neighborhood information at one layer but fail to capture information from high-order neighbors to learn miRNA and disease representations through information propagation. Therefore, how to aggregate information from high-order neighborhood effectively in an explicit way is still challenging. RESULTS: To address such a challenge, we propose a novel method called mixed neighborhood information for miRNA-disease association (MINIMDA), which could fuse mixed high-order neighborhood information of miRNAs and diseases in multimodal networks. First, MINIMDA constructs the integrated miRNA similarity network and integrated disease similarity network respectively with their multisource information. Then, the embedding representations of miRNAs and diseases are obtained by fusing mixed high-order neighborhood information from multimodal network which are the integrated miRNA similarity network, integrated disease similarity network and the miRNA-disease association networks. Finally, we concentrate the multimodal embedding representations of miRNAs and diseases and feed them into the multilayer perceptron (MLP) to predict their underlying associations. Extensive experimental results show that MINIMDA is superior to other state-of-the-art methods overall. Moreover, the outstanding performance on case studies for esophageal cancer, colon tumor and lung cancer further demonstrates the effectiveness of MINIMDA. AVAILABILITY AND IMPLEMENTATION: https://github.com/chengxu123/MINIMDA and http://120.79.173.96/.


Assuntos
Neoplasias do Colo , MicroRNAs , Algoritmos , Biologia Computacional/métodos , Humanos , MicroRNAs/genética , Redes Neurais de Computação
3.
Methods ; 212: 21-30, 2023 04.
Artigo em Inglês | MEDLINE | ID: mdl-36813016

RESUMO

Long non-coding RNAs are a class of essential non-coding RNAs with a length of more than 200 nts. Recent studies have indicated that lncRNAs have various complex regulatory functions, which play great impacts on many fundamental biological processes. However, measuring the functional similarity between lncRNAs by traditional wet-experiments is time-consuming and labor intensive, computational-based approaches have been an effective choice to tackle this problem. Meanwhile, most sequences-based computation methods measure the functional similarity of lncRNAs with their fixed length vector representations, which could not capture the features on larger k-mers. Therefore, it is urgent to improve the predict performance of the potential regulatory functions of lncRNAs. In this study, we propose a novel approach called MFSLNC to comprehensively measure functional similarity of lncRNAs based on variable k-mer profiles of nucleotide sequences. MFSLNC employs the dictionary tree storage, which could comprehensively represent lncRNAs with long k-mers. The functional similarity between lncRNAs is evaluated by the Jaccard similarity. MFSLNC verified the similarity between two lncRNAs with the same mechanism, detecting homologous sequence pairs between human and mouse. Besides, MFSLNC is also applied to lncRNA-disease associations, combined with the association prediction model WKNKN. Moreover, we also proved that our method can more effectively calculate the similarity of lncRNAs by comparing with the classical methods based on the lncRNA-mRNA association data. The detected AUC value of prediction is 0.867, which achieves good performance in the comparison of similar models.


Assuntos
RNA Longo não Codificante , Humanos , Animais , Camundongos , RNA Longo não Codificante/genética , Sequência de Bases , Biologia Computacional/métodos , Algoritmos
4.
BMC Bioinformatics ; 22(1): 545, 2021 Nov 09.
Artigo em Inglês | MEDLINE | ID: mdl-34753427

RESUMO

BACKGROUND: Amyloids are insoluble fibrillar aggregates that are highly associated with complex human diseases, such as Alzheimer's disease, Parkinson's disease, and type II diabetes. Recently, many studies reported that some specific regions of amino acid sequences may be responsible for the amyloidosis of proteins. It has become very important for elucidating the mechanism of amyloids that identifying the amyloidogenic regions. Accordingly, several computational methods have been put forward to discover amyloidogenic regions. The majority of these methods predicted amyloidogenic regions based on the physicochemical properties of amino acids. In fact, position, order, and correlation of amino acids may also influence the amyloidosis of proteins, which should be also considered in detecting amyloidogenic regions. RESULTS: To address this problem, we proposed a novel machine-learning approach for predicting amyloidogenic regions, called ReRF-Pred. Firstly, the pseudo amino acid composition (PseAAC) was exploited to characterize physicochemical properties and correlation of amino acids. Secondly, tripeptides composition (TPC) was employed to represent the order and position of amino acids. To improve the distinguishability of TPC, all possible tripeptides were analyzed by the binomial distribution method, and only those which have significantly different distribution between positive and negative samples remained. Finally, all samples were characterized by PseAAC and TPC of their amino acid sequence, and a random forest-based amyloidogenic regions predictor was trained on these samples. It was proved by validation experiments that the feature set consisted of PseAAC and TPC is the most distinguishable one for detecting amyloidosis. Meanwhile, random forest is superior to other concerned classifiers on almost all metrics. To validate the effectiveness of our model, ReRF-Pred is compared with a series of gold-standard methods on two datasets: Pep-251 and Reg33. The results suggested our method has the best overall performance and makes significant improvements in discovering amyloidogenic regions. CONCLUSIONS: The advantages of our method are mainly attributed to that PseAAC and TPC can describe the differences between amyloids and other proteins successfully. The ReRF-Pred server can be accessed at http://106.12.83.135:8080/ReRF-Pred/.


Assuntos
Aminoácidos , Diabetes Mellitus Tipo 2 , Algoritmos , Sequência de Aminoácidos , Biologia Computacional , Humanos , Proteínas
5.
BMC Bioinformatics ; 17(1): 445, 2016 Nov 04.
Artigo em Inglês | MEDLINE | ID: mdl-27814675

RESUMO

BACKGROUND: In recent years, many measures of gene functional similarity have been proposed and widely used in all kinds of essential research. These methods are mainly divided into two categories: pairwise approaches and group-wise approaches. However, a common problem with these methods is their time consumption, especially when measuring the gene functional similarities of a large number of gene pairs. The problem of computational efficiency for pairwise approaches is even more prominent because they are dependent on the combination of semantic similarity. Therefore, the efficient measurement of gene functional similarity remains a challenging problem. RESULTS: To speed current gene functional similarity calculation methods, a novel two-step computing strategy is proposed: (1) establish a hash table for each method to store essential information obtained from the Gene Ontology (GO) graph and (2) measure gene functional similarity based on the corresponding hash table. There is no need to traverse the GO graph repeatedly for each method with the help of the hash table. The analysis of time complexity shows that the computational efficiency of these methods is significantly improved. We also implement a novel Speeding Gene Functional Similarity Calculation tool, namely SGFSC, which is bundled with seven typical measures using our proposed strategy. Further experiments show the great advantage of SGFSC in measuring gene functional similarity on the whole genomic scale. CONCLUSIONS: The proposed strategy is successful in speeding current gene functional similarity calculation methods. SGFSC is an efficient tool that is freely available at http://nclab.hit.edu.cn/SGFSC . The source code of SGFSC can be downloaded from http://pan.baidu.com/s/1dFFmvpZ .


Assuntos
Algoritmos , Biomarcadores/metabolismo , Biologia Computacional/métodos , Ontologia Genética , Semântica , Simulação por Computador , Genômica/métodos , Humanos , Anotação de Sequência Molecular , Linguagens de Programação
6.
Bioinformatics ; 29(11): 1424-32, 2013 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-23572412

RESUMO

MOTIVATION: Compared with sequence and structure similarity, functional similarity is more informative for understanding the biological roles and functions of genes. Many important applications in computational molecular biology require functional similarity, such as gene clustering, protein function prediction, protein interaction evaluation and disease gene prioritization. Gene Ontology (GO) is now widely used as the basis for measuring gene functional similarity. Some existing methods combined semantic similarity scores of single term pairs to estimate gene functional similarity, whereas others compared terms in groups to measure it. However, these methods may make error-prone judgments about gene functional similarity. It remains a challenge that measuring gene functional similarity reliably. RESULT: We propose a novel method called SORA to measure gene functional similarity in GO context. First of all, SORA computes the information content (IC) of a term making use of semantic specificity and coverage. Second, SORA measures the IC of a term set by means of combining inherited and extended IC of the terms based on the structure of GO. Finally, SORA estimates gene functional similarity using the IC overlap ratio of term sets. SORA is evaluated against five state-of-the-art methods in the file on the public platform for collaborative evaluation of GO-based semantic similarity measure. The carefully comparisons show SORA is superior to other methods in general. Further analysis suggests that it primarily benefits from the structure of GO, which implies expressive information about gene function. SORA offers an effective and reliable way to compare gene function. AVAILABILITY: The web service of SORA is freely available at http://nclab.hit.edu.cn/SORA/


Assuntos
Genes , Anotação de Sequência Molecular , Vocabulário Controlado , Algoritmos , Biologia Computacional/métodos , Proteínas/genética , Proteínas/fisiologia , Semântica
7.
IEEE/ACM Trans Comput Biol Bioinform ; 20(2): 1053-1064, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-35687647

RESUMO

The measurement of gene functional similarity plays a critical role in numerous biological applications, such as gene clustering, the construction of gene similarity networks. However, most existing approaches still rely heavily on traditional computational strategies, which are not guaranteed to achieve satisfactory performance. In this study, we propose a novel computational approach called GOGCN to measure gene functional similarity by modeling the Gene Ontology (GO) through Graph Convolutional Network (GCN). GOGCN is a graph-based approach that performs sufficient representation learning for terms and relations in the GO graph. First, GOGCN employs the GCN-based knowledge graph embedding (KGE) model to learn vector representations (i.e., embeddings) for all entities (i.e., terms). Second, GOGCN calculates the semantic similarity between two terms based on their corresponding vector representations. Finally, GOGCN estimates gene functional similarity by making use of the pair-wise strategy. During the representation learning period, GOGCN promotes semantic interaction between terms through GCN, thereby capturing the rich structural information of the GO graph. Further experimental results on various datasets suggest that GOGCN is superior to the other state-of-the-art approaches, which shows its reliability and effectiveness.


Assuntos
Redes Reguladoras de Genes , Bases de Conhecimento , Ontologia Genética , Reprodutibilidade dos Testes , Análise por Conglomerados , Redes Reguladoras de Genes/genética
8.
Comput Biol Med ; 157: 106711, 2023 05.
Artigo em Inglês | MEDLINE | ID: mdl-36924738

RESUMO

Long non-coding RNAs (lncRNAs) play important roles by regulating proteins in many biological processes and life activities. To uncover molecular mechanisms of lncRNA, it is very necessary to identify interactions of lncRNA with proteins. Recently, some machine learning methods were proposed to detect lncRNA-protein interactions according to the distribution of known interactions. The performances of these methods were largely dependent upon: (1) how exactly the distribution of known interactions was characterized by feature space; (2) how discriminative the feature space was for distinguishing lncRNA-protein interactions. Because the known interactions may be multiple and complex model, it remains a challenge to construct discriminative feature space for lncRNA-protein interactions. To resolve this problem, a novel method named DFRPI was developed based on deep autoencoder and marginal fisher analysis in this paper. Firstly, some initial features of lncRNA-protein interactions were extracted from the primary sequences and secondary structures of lncRNA and protein. Secondly, a deep autoencoder was exploited to learn encode parameters of the initial features to describe the known interactions precisely. Next, the marginal fisher analysis was employed to optimize the encode parameters of features to characterize a discriminative feature space of the lncRNA-protein interactions. Finally, a random forest-based predictor was trained on the discriminative feature space to detect lncRNA-protein interactions. Verified by a series of experiments, the results showed that our predictor achieved the precision of 0.920, recall of 0.916, accuracy of 0.918, MCC of 0.836, specificity of 0.920, sensitivity of 0.916 and AUC of 0.906 respectively, which outperforms the concerned methods for predicting lncRNA-protein interaction. It may be suggested that the proposed method can generate a reasonable and effective feature space for distinguishing lncRNA-protein interactions accurately. The code and data are available on https://github.com/D0ub1e-D/DFRPI.


Assuntos
RNA Longo não Codificante , RNA Longo não Codificante/genética , Algoritmos , Aprendizado de Máquina , Biologia Computacional/métodos
9.
Front Plant Sci ; 13: 845835, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35237293

RESUMO

DNA N6-Methyladenine (6mA) is a common epigenetic modification, which plays some significant roles in the growth and development of plants. It is crucial to identify 6mA sites for elucidating the functions of 6mA. In this article, a novel model named i6mA-vote is developed to predict 6mA sites of plants. Firstly, DNA sequences were coded into six feature vectors with diverse strategies based on density, physicochemical properties, and position of nucleotides, respectively. To find the best coding strategy, the feature vectors were compared on several machine learning classifiers. The results suggested that the position of nucleotides has a significant positive effect on 6mA sites identification. Thus, the dinucleotide one-hot strategy which can describe position characteristics of nucleotides well was employed to extract DNA features in our method. Secondly, DNA sequences of Rosaceae were divided into a training dataset and a test dataset randomly. Finally, i6mA-vote was constructed by combining five different base-classifiers under a majority voting strategy and trained on the Rosaceae training dataset. The i6mA-vote was evaluated on the task of predicting 6mA sites from the genome of the Rosaceae, Rice, and Arabidopsis separately. In Rosaceae, the performances of i6mA-vote were 0.955 on accuracy (ACC), 0.909 on Matthew correlation coefficients (MCC), 0.955 on sensitivity (SN), and 0.954 on specificity (SP). Those indicators, in the order of ACC, MCC, SN, SP, were 0.882, 0.774, 0.961, and 0.803 on Rice while they were 0.798, 0.617, 0.666, and 0.929 on Arabidopsis. According to the indicators, our method was effectiveness and better than other concerned methods. The results also illustrated that i6mA-vote does not only well in 6mA sites prediction of intraspecies but also interspecies plants. Moreover, it can be seen that the specificity is distinctly lower than the sensitivity in Rice while it is just the opposite in Arabidopsis. It may be resulted from sequence similarity among Rosaceae, Rice and Arabidopsis.

10.
Front Genet ; 12: 773202, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34917130

RESUMO

Recently, several anti-inflammatory peptides (AIPs) have been found in the process of the inflammatory response, and these peptides have been used to treat some inflammatory and autoimmune diseases. Therefore, identifying AIPs accurately from a given amino acid sequences is critical for the discovery of novel and efficient anti-inflammatory peptide-based therapeutics and the acceleration of their application in therapy. In this paper, a random forest-based model called iAIPs for identifying AIPs is proposed. First, the original samples were encoded with three feature extraction methods, including g-gap dipeptide composition (GDC), dipeptide deviation from the expected mean (DDE), and amino acid composition (AAC). Second, the optimal feature subset is generated by a two-step feature selection method, in which the feature is ranked by the analysis of variance (ANOVA) method, and the optimal feature subset is generated by the incremental feature selection strategy. Finally, the optimal feature subset is inputted into the random forest classifier, and the identification model is constructed. Experiment results showed that iAIPs achieved an AUC value of 0.822 on an independent test dataset, which indicated that our proposed model has better performance than the existing methods. Furthermore, the extraction of features for peptide sequences provides the basis for evolutionary analysis. The study of peptide identification is helpful to understand the diversity of species and analyze the evolutionary history of species.

11.
Biomed Res Int ; 2021: 5515342, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34159192

RESUMO

As one of important epigenetic modifications, DNA N4-methylcytosine (4mC) plays a crucial role in controlling gene replication, expression, cell cycle, DNA replication, and differentiation. The accurate identification of 4mC sites is necessary to understand biological functions. In the paper, we use ensemble learning to develop a model named i4mC-EL to identify 4mC sites in the mouse genome. Firstly, a multifeature encoding scheme consisting of Kmer and EIIP was adopted to describe the DNA sequences. Secondly, on the basis of the multifeature encoding scheme, we developed a stacked ensemble model, in which four machine learning algorithms, namely, BayesNet, NaiveBayes, LibSVM, and Voted Perceptron, were utilized to implement an ensemble of base classifiers that produce intermediate results as input of the metaclassifier, Logistic. The experimental results on the independent test dataset demonstrate that the overall rate of predictive accurate of i4mC-EL is 82.19%, which is better than the existing methods. The user-friendly website implementing i4mC-EL can be accessed freely at the following.


Assuntos
Citosina , Metilação de DNA , Genoma , Aprendizado de Máquina , Algoritmos , Animais , Área Sob a Curva , Sequência de Bases , Teorema de Bayes , Biologia Computacional , DNA/metabolismo , Mineração de Dados , Epigênese Genética , Técnicas Genéticas , Internet , Camundongos , Curva ROC , Reprodutibilidade dos Testes
12.
Front Genet ; 12: 665498, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33833783

RESUMO

Enhancers are regulatory DNA sequences that could be bound by specific proteins named transcription factors (TFs). The interactions between enhancers and TFs regulate specific genes by increasing the target gene expression. Therefore, enhancer identification and classification have been a critical issue in the enhancer field. Unfortunately, so far there has been a lack of suitable methods to identify enhancers. Previous research has mainly focused on the features of the enhancer's function and interactions, which ignores the sequence information. As we know, the recurrent neural network (RNN) and long short-term memory (LSTM) models are currently the most common methods for processing time series data. LSTM is more suitable than RNN to address the DNA sequence. In this paper, we take the advantages of LSTM to build a method named iEnhancer-EBLSTM to identify enhancers. iEnhancer-ensembles of bidirectional LSTM (EBLSTM) consists of two steps. In the first step, we extract subsequences by sliding a 3-mer window along the DNA sequence as features. Second, EBLSTM model is used to identify enhancers from the candidate input sequences. We use the dataset from the study of Quang H et al. as the benchmarks. The experimental results from the datasets demonstrate the efficiency of our proposed model.

13.
Comput Math Methods Med ; 2021: 6691096, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33680070

RESUMO

Preeclampsia (PE) is a maternal disease that causes maternal and child death. Treatment and preventive measures are not sound enough. The problem of PE screening has attracted much attention. The purpose of this study is to screen placental mRNA to obtain the best PE biomarkers for identifying patients with PE. We use Limma in the R language to screen out the 48 differentially expressed genes with the largest differences and used correlation-based feature selection algorithms to reduce the dimensionality and avoid attribute redundancy arising from too many mRNA samples participating in the classification. After reducing the mRNA attributes, the mRNA samples are sorted from large to small according to information gain. In this study, a classifier model is designed to identify whether samples had PE through mRNA in the placenta. To improve the accuracy of classification and avoid overfitting, three classifiers, including C4.5, AdaBoost, and multilayer perceptron, are used. We use the majority voting strategy integrated with the differentially expressed genes and the genes filtered by the best subset method as comparison methods to train the classifier. The results show that the classification accuracy rate has increased from 79% to 82.2%, and the number of mRNA features has decreased from 48 to 13. This study provides clues for the main PE biomarkers of mRNA in the placenta and provides ideas for the treatment and screening of PE.


Assuntos
Aprendizado de Máquina , Placenta/metabolismo , Pré-Eclâmpsia/diagnóstico , Pré-Eclâmpsia/genética , RNA Mensageiro/genética , Algoritmos , Biomarcadores/metabolismo , Biologia Computacional , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Árvores de Decisões , Diagnóstico por Computador , Feminino , Marcadores Genéticos , Testes Genéticos , Humanos , Redes Neurais de Computação , Gravidez , RNA Mensageiro/metabolismo , Transcriptoma
14.
Comput Math Methods Med ; 2020: 8926750, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33133228

RESUMO

With the development of computer technology, many machine learning algorithms have been applied to the field of biology, forming the discipline of bioinformatics. Protein function prediction is a classic research topic in this subject area. Though many scholars have made achievements in identifying protein by different algorithms, they often extract a large number of feature types and use very complex classification methods to obtain little improvement in the classification effect, and this process is very time-consuming. In this research, we attempt to utilize as few features as possible to classify vesicular transportation proteins and to simultaneously obtain a comparative satisfactory classification result. We adopt CTDC which is a submethod of the method of composition, transition, and distribution (CTD) to extract only 39 features from each sequence, and LibSVM is used as the classification method. We use the SMOTE method to deal with the problem of dataset imbalance. There are 11619 protein sequences in our dataset. We selected 4428 sequences to train our classification model and selected other 1832 sequences from our dataset to test the classification effect and finally achieved an accuracy of 71.77%. After dimension reduction by MRMD, the accuracy is 72.16%.


Assuntos
Algoritmos , Máquina de Vetores de Suporte , Proteínas de Transporte Vesicular/classificação , Biologia Computacional/métodos , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Conceitos Matemáticos , Proteínas de Transporte Vesicular/genética , Proteínas de Transporte Vesicular/metabolismo
15.
Comput Math Methods Med ; 2020: 8845133, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33294004

RESUMO

Amyloid is generally an aggregate of insoluble fibrin; its abnormal deposition is the pathogenic mechanism of various diseases, such as Alzheimer's disease and type II diabetes. Therefore, accurately identifying amyloid is necessary to understand its role in pathology. We proposed a machine learning-based prediction model called PredAmyl-MLP, which consists of the following three steps: feature extraction, feature selection, and classification. In the step of feature extraction, seven feature extraction algorithms and different combinations of them are investigated, and the combination of SVMProt-188D and tripeptide composition (TPC) is selected according to the experimental results. In the step of feature selection, maximum relevant maximum distance (MRMD) and binomial distribution (BD) are, respectively, used to remove the redundant or noise features, and the appropriate features are selected according to the experimental results. In the step of classification, we employed multilayer perceptron (MLP) to train the prediction model. The 10-fold cross-validation results show that the overall accuracy of PredAmyl-MLP reached 91.59%, and the performance was better than the existing methods.


Assuntos
Proteínas Amiloidogênicas/química , Redes Neurais de Computação , Algoritmos , Doença de Alzheimer/etiologia , Doença de Alzheimer/metabolismo , Sequência de Aminoácidos , Aminoácidos/química , Proteínas Amiloidogênicas/genética , Biologia Computacional , Diabetes Mellitus Tipo 2/etiologia , Diabetes Mellitus Tipo 2/metabolismo , Humanos , Aprendizado de Máquina , Conceitos Matemáticos , Máquina de Vetores de Suporte
16.
Front Cell Dev Biol ; 8: 591487, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33195258

RESUMO

Excessive oxidative stress responses can threaten our health, and thus it is essential to produce antioxidant proteins to regulate the body's oxidative responses. The low number of antioxidant proteins makes it difficult to extract their representative features. Our experimental method did not use structural information but instead studied antioxidant proteins from a sequenced perspective while focusing on the impact of data imbalance on sensitivity, thus greatly improving the model's sensitivity for antioxidant protein recognition. We developed a method based on the Composition of k-spaced Amino Acid Pairs (CKSAAP) and the Conjoint Triad (CT) features derived from the amino acid composition and protein-protein interactions. SMOTE and the Max-Relevance-Max-Distance algorithm (MRMD) were utilized to unbalance the training data and select the optimal feature subset, respectively. The test set used 10-fold crossing validation and a random forest algorithm for classification according to the selected feature subset. The sensitivity was 0.792, the specificity was 0.808, and the average accuracy was 0.8.

17.
Biomed Res Int ; 2020: 9235920, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32596396

RESUMO

Enzymes are proteins that can efficiently catalyze specific biochemical reactions, and they are widely present in the human body. Developing an efficient method to identify human enzymes is vital to select enzymes from the vast number of human proteins and to investigate their functions. Nevertheless, only a limited amount of research has been conducted on the classification of human enzymes and nonenzymes. In this work, we developed a support vector machine- (SVM-) based predictor to classify human enzymes using the amino acid composition (AAC), the composition of k-spaced amino acid pairs (CKSAAP), and selected informative amino acid pairs through the use of a feature selection technique. A training dataset including 1117 human enzymes and 2099 nonenzymes and a test dataset including 684 human enzymes and 1270 nonenzymes were constructed to train and test the proposed model. The results of jackknife cross-validation showed that the overall accuracy was 76.46% for the training set and 76.21% for the test set, which are higher than the 72.6% achieved in previous research. Furthermore, various feature extraction methods and mainstream classifiers were compared in this task, and informative feature parameters of k-spaced amino acid pairs were selected and compared. The results suggest that our classifier can be used in human enzyme identification effectively and efficiently and can help to understand their functions and develop new drugs.


Assuntos
Aminoácidos/química , Enzimas/química , Proteínas/química , Algoritmos , Biologia Computacional , Bases de Dados de Proteínas , Enzimas/classificação , Humanos , Proteínas/classificação , Máquina de Vetores de Suporte
18.
Front Genet ; 10: 18, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30774646

RESUMO

Non-coding RNA (ncRNA) plays important roles in many critical regulation processes. Many ncRNAs perform their regulatory functions by the form of RNA-protein complexes. Therefore, identifying the interaction between ncRNA and protein is fundamental to understand functions of ncRNA. Under pressures from expensive cost of experimental techniques, developing an accuracy computational predictive model has become an indispensable way to identify ncRNA-protein interaction. A powerful predicting model of ncRNA-protein interaction needs a good feature set of characterizing the interaction. In this paper, a novel method is put forward to generate complex features for characterizing ncRNA-protein interaction (named CFRP). To obtain a comprehensive description of ncRNA-protein interaction, complex features are generated by non-linear transformations from the traditional k-mer features of ncRNA and protein sequences. To further reduce the dimensions of complex features, a group of discriminative features are selected by random forest. To validate the performances of the proposed method, a series of experiments are carried on several widely-used public datasets. Compared with the traditional k-mer features, the CFRP complex features can boost the performances of ncRNA-protein interaction prediction model. Meanwhile, the CFRP-based prediction model is compared with several state-of-the-art methods, and the results show that the proposed method achieves better performances than the others in term of the evaluation metrics. In conclusion, the complex features generated by CFRP are beneficial for building a powerful predicting model of ncRNA-protein interaction.

19.
BMC Syst Biol ; 12(Suppl 9): 134, 2018 12 31.
Artigo em Inglês | MEDLINE | ID: mdl-30598084

RESUMO

BACKGROUND: Drug repositioning is a promising and efficient way to discover new indications for existing drugs, which holds the great potential for precision medicine in the post-genomic era. Many network-based approaches have been proposed for drug repositioning based on similarity networks, which integrate multiple sources of drugs and diseases. However, these methods may simply view nodes as the same-typed and neglect the semantic meanings of different meta-paths in the heterogeneous network. Therefore, it is urgent to develop a rational method to infer new indications for approved drugs. RESULTS: In this study, we proposed a novel methodology named HeteSim_DrugDisease (HSDD) for the prediction of drug repositioning. Firstly, we build the drug-drug similarity network and disease-disease similarity network by integrating the information of drugs and diseases. Secondly, a drug-disease heterogeneous network is constructed, which combines the drug similarity network, disease similarity network as well as the known drug-disease association network. Finally, HSDD predicts novel drug-disease associations based on the HeteSim scores of different meta-paths. The experimental results show that HSDD performs significantly better than the existing state-of-the-art approaches. HSDD achieves an AUC score of 0.8994 in the leave-one-out cross validation experiment. Moreover, case studies for selected drugs further illustrate the practical usefulness of HSDD. CONCLUSIONS: HSDD can be an effective and feasible way to infer the associations between drugs and diseases using on meta-path-based semantic network analysis.


Assuntos
Biologia Computacional/métodos , Reposicionamento de Medicamentos/métodos , Semântica , Doença , Aprendizado de Máquina
20.
J Biomed Semantics ; 8(Suppl 1): 27, 2017 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-29297388

RESUMO

BACKGROUND: In recent years, numerous computational methods predicted protein function based on the protein-protein interaction (PPI) network. These methods supposed that two proteins share the same function if they interact with each other. However, it is reported by recent studies that the functions of two interacting proteins may be just related. It will mislead the prediction of protein function. Therefore, there is a need for investigating the functional relationship between interacting proteins. RESULTS: In this paper, the functional relationship between interacting proteins is studied and a novel method, called as GoDIN, is advanced to annotate functions of interacting proteins in Gene Ontology (GO) context. It is assumed that the functional difference between interacting proteins can be expressed by semantic difference between GO term and its relatives. Thus, the method uses GO term and its relatives to annotate the interacting proteins separately according to their functional roles in the PPI network. The method is validated by a series of experiments and compared with the concerned method. The experimental results confirm the assumption and suggest that GoDIN is effective on predicting functions of protein. CONCLUSIONS: This study demonstrates that: (1) interacting proteins are not equal in the PPI network, and their function may be same or similar, or just related; (2) functional difference between interacting proteins can be measured by their degrees in the PPI network; (3) functional relationship between interacting proteins can be expressed by relationship between GO term and its relatives.


Assuntos
Biologia Computacional/métodos , Ontologia Genética , Mapas de Interação de Proteínas , Proteínas/metabolismo , Bases de Dados de Proteínas , Proteínas de Saccharomyces cerevisiae/metabolismo
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA