Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 35
Filter
Add more filters











Publication year range
1.
Front Immunol ; 14: 1267755, 2023.
Article in English | MEDLINE | ID: mdl-38094296

ABSTRACT

N4-acetylcytidine (ac4C) is a modification of cytidine at the nitrogen-4 position, playing a significant role in the translation process of mRNA. However, the precise mechanism and details of how ac4C modifies translated mRNA remain unclear. Since identifying ac4C sites using conventional experimental methods is both labor-intensive and time-consuming, there is an urgent need for a method that can promptly recognize ac4C sites. In this paper, we propose a comprehensive ensemble learning model, the Stacking-based heterogeneous integrated ac4C model, engineered explicitly to identify ac4C sites. This innovative model integrates three distinct feature extraction methodologies: Kmer, electron-ion interaction pseudo-potential values (PseEIIP), and pseudo-K-tuple nucleotide composition (PseKNC). The model also incorporates the robust Cluster Centroids algorithm to enhance its performance in dealing with imbalanced data and alleviate underfitting issues. Our independent testing experiments indicate that our proposed model improves the Mcc by 15.61% and the ROC by 5.97% compared to existing models. To test our model's adaptability, we also utilized a balanced dataset assembled by the authors of iRNA-ac4C. Our model showed an increase in Sn of 4.1%, an increase in Acc of nearly 1%, and ROC improvement of 0.35% on this balanced dataset. The code for our model is freely accessible at https://github.com/louliliang/ST-ac4C.git, allowing users to quickly build their model without dealing with complicated mathematical equations.


Subject(s)
Cytidine , Nucleotides , RNA, Messenger/genetics , Cytidine/genetics , Algorithms
2.
Math Biosci Eng ; 20(11): 19133-19151, 2023 Oct 13.
Article in English | MEDLINE | ID: mdl-38052593

ABSTRACT

Malignancies such as bladder urothelial carcinoma, colon adenocarcinoma, liver hepatocellular carcinoma, lung adenocarcinoma and prostate adenocarcinoma significantly impact men's well-being. Accurate cancer classification is vital in determining treatment strategies and improving patient prognosis. This study introduced an innovative method that utilizes gene selection from high-dimensional datasets to enhance the performance of the male tumor classification algorithm. The method assesses the reliability of DNA methylation data to distinguish the five most prevalent types of male cancers from normal tissues by employing DNA methylation 450K data obtained from The Cancer Genome Atlas (TCGA) database. First, the chi-square test is used for dimensionality reduction and second, L1 penalized logistic regression is used for feature selection. Furthermore, the stacking ensemble learning technique was employed to integrate seven common multiclassification models. Experimental results demonstrated that the ensemble learning model utilizing multiple classification models outperformed any base classification model. The proposed ensemble model achieved an astonishing overall accuracy (ACC) of 99.2% in independent testing data. Moreover, it may present novel ideas and pathways for the early detection and treatment of future diseases.


Subject(s)
Adenocarcinoma , Carcinoma, Hepatocellular , Carcinoma, Transitional Cell , Colonic Neoplasms , Liver Neoplasms , Lung Neoplasms , Urinary Bladder Neoplasms , Humans , Male , DNA Methylation , Adenocarcinoma/genetics , Carcinoma, Transitional Cell/genetics , Reproducibility of Results , Urinary Bladder Neoplasms/genetics , Colonic Neoplasms/genetics , Carcinoma, Hepatocellular/diagnosis , Carcinoma, Hepatocellular/genetics , Lung Neoplasms/genetics , Liver Neoplasms/diagnosis , Liver Neoplasms/genetics
3.
Comput Biol Med ; 166: 107529, 2023 Sep 20.
Article in English | MEDLINE | ID: mdl-37748220

ABSTRACT

Accurate identification of inter-chain contacts in the protein complex is critical to determine the corresponding 3D structures and understand the biological functions. We proposed a new deep learning method, ICCPred, to deduce the inter-chain contacts from the amino acid sequences of the protein complex. This pipeline was built on the designed deep residual network architecture, integrating the pre-trained language model with three multiple sequence alignments (MSAs) from different biological views. Experimental results on 709 non-redundant benchmarking protein complexes showed that the proposed ICCPred significantly increased inter-chain contact prediction accuracy compared to the state-of-the-art approaches. Detailed data analyses showed that the significant advantage of ICCPred lies in the utilization of pre-trained transformer language models which can effectively extract the complementary co-evolution diversity from three MSAs. Meanwhile, the designed deep residual network enhances the correlation between the co-evolution diversity and the patterns of inter-chain contacts. These results demonstrated a new avenue for high-accuracy deep-learning inter-chain contact prediction that is applicable to large-scale protein-protein interaction annotations from sequence alone.

4.
Heliyon ; 9(4): e15096, 2023 Apr.
Article in English | MEDLINE | ID: mdl-37095983

ABSTRACT

The mortality rate from cervical cancer (CESC), a malignant tumor that affects women, has increased significantly globally in recent years. The discovery of biomarkers points to a direction for the diagnosis of cervical cancer with the advancement of bioinformatics technology. The goal of this study was to look for potential biomarkers for the diagnosis and prognosis of CESC using the GEO and TCGA databases. Because of the high dimension and small sample size of the omic data, or the use of biomarkers generated from a single omic data, the diagnosis of cervical cancer may be inaccurate and unreliable. The purpose of this study was to search the GEO and TCGA databases for potential biomarkers for the diagnosis and prognosis of CESC. We begin by downloading CESC (GSE30760) DNA methylation data from GEO, then perform differential analysis on the downloaded methylation data and screen out the differential genes. Then, using estimation algorithms, we score immune cells and stromal cells in the tumor microenvironment and perform survival analysis on the gene expression profile data and the most recent clinical data of CESC from TCGA. Then, using the 'limma' package and Venn plot in R language to perform differential analysis of genes and screen out overlapping genes, these overlapping genes were then subjected to GO and KEGG functional enrichment analysis. The differential genes screened by the GEO methylation data and the differential genes screened by the TCGA gene expression data were intersected to screen out the common differential genes. A protein-protein interaction (PPI) network of gene expression data was then created in order to discover important genes. The PPI network's key genes were crossed with previously identified common differential genes to further validate them. The Kaplan-Meier curve was then used to determine the prognostic importance of the key genes. Survival analysis has shown that CD3E and CD80 are important for the identification of cervical cancer and can be considered as potential biomarkers for cervical cancer.

5.
Front Physiol ; 14: 1105891, 2023.
Article in English | MEDLINE | ID: mdl-36998990

ABSTRACT

As one of the most common diseases in pediatric surgery, an inguinal hernia is usually diagnosed by medical experts based on clinical data collected from magnetic resonance imaging (MRI), computed tomography (CT), or B-ultrasound. The parameters of blood routine examination, such as white blood cell count and platelet count, are often used as diagnostic indicators of intestinal necrosis. Based on the medical numerical data on blood routine examination parameters and liver and kidney function parameters, this paper used machine learning algorithm to assist the diagnosis of intestinal necrosis in children with inguinal hernia before operation. In the work, we used clinical data consisting of 3,807 children with inguinal hernia symptoms and 170 children with intestinal necrosis and perforation caused by the disease. Three different models were constructed according to the blood routine examination and liver and kidney function. Some missing values were replaced by using the RIN-3M (median, mean, or mode region random interpolation) method according to the actual necessity, and the ensemble learning based on the voting principle was used to deal with the imbalanced datasets. The model trained after feature selection yielded satisfactory results with an accuracy of 86.43%, sensitivity of 84.34%, specificity of 96.89%, and AUC value of 0.91. Therefore, the proposed methods may be a potential idea for auxiliary diagnosis of inguinal hernia in children.

6.
Front Genet ; 13: 926927, 2022.
Article in English | MEDLINE | ID: mdl-35846148

ABSTRACT

The early symptoms of lung adenocarcinoma patients are inapparent, and the clinical diagnosis of lung adenocarcinoma is primarily through X-ray examination and pathological section examination, whereas the discovery of biomarkers points out another direction for the diagnosis of lung adenocarcinoma with the development of bioinformatics technology. However, it is not accurate and trustworthy to diagnose lung adenocarcinoma due to omics data with high-dimension and low-sample size (HDLSS) features or biomarkers produced by utilizing only single omics data. To address the above problems, the feature selection methods of biological analysis are used to reduce the dimension of gene expression data (GSE19188) and DNA methylation data (GSE139032, GSE49996). In addition, the Cartesian product method is used to expand the sample set and integrate gene expression data and DNA methylation data. The classification is built by using a deep neural network and is evaluated on K-fold cross validation. Moreover, gene ontology analysis and literature retrieving are used to analyze the biological relevance of selected genes, TCGA database is used for survival analysis of these potential genes through Kaplan-Meier estimates to discover the detailed molecular mechanism of lung adenocarcinoma. Survival analysis shows that COL5A2 and SERPINB5 are significant for identifying lung adenocarcinoma and are considered biomarkers of lung adenocarcinoma.

7.
Front Genet ; 13: 859188, 2022.
Article in English | MEDLINE | ID: mdl-35754843

ABSTRACT

Drug-target interactions (DTIs) are regarded as an essential part of genomic drug discovery, and computational prediction of DTIs can accelerate to find the lead drug for the target, which can make up for the lack of time-consuming and expensive wet-lab techniques. Currently, many computational methods predict DTIs based on sequential composition or physicochemical properties of drug and target, but further efforts are needed to improve them. In this article, we proposed a new sequence-based method for accurately identifying DTIs. For target protein, we explore using pre-trained Bidirectional Encoder Representations from Transformers (BERT) to extract sequence features, which can provide unique and valuable pattern information. For drug molecules, Discrete Wavelet Transform (DWT) is employed to generate information from drug molecular fingerprints. Then we concatenate the feature vectors of the DTIs, and input them into a feature extraction module consisting of a batch-norm layer, rectified linear activation layer and linear layer, called BRL block and a Convolutional Neural Networks module to extract DTIs features further. Subsequently, a BRL block is used as the prediction engine. After optimizing the model based on contrastive loss and cross-entropy loss, it gave prediction accuracies of the target families of G Protein-coupled receptors, ion channels, enzymes, and nuclear receptors up to 90.1, 94.7, 94.9, and 89%, which indicated that the proposed method can outperform the existing predictors. To make it as convenient as possible for researchers, the web server for the new predictor is freely accessible at: https://bioinfo.jcu.edu.cn/dtibert or http://121.36.221.79/dtibert/. The proposed method may also be a potential option for other DITs.

8.
Front Endocrinol (Lausanne) ; 13: 849549, 2022.
Article in English | MEDLINE | ID: mdl-35557849

ABSTRACT

Pupylation is an important posttranslational modification in proteins and plays a key role in the cell function of microorganisms; an accurate prediction of pupylation proteins and specified sites is of great significance for the study of basic biological processes and development of related drugs since it would greatly save experimental costs and improve work efficiency. In this work, we first constructed a model for identifying pupylation proteins. To improve the pupylation protein prediction model, the KNN scoring matrix model based on functional domain GO annotation and the Word Embedding model were used to extract the features and Random Under-sampling (RUS) and Synthetic Minority Over-sampling Technique (SMOTE) were applied to balance the dataset. Finally, the balanced data sets were input into Extreme Gradient Boosting (XGBoost). The performance of 10-fold cross-validation shows that accuracy (ACC), Matthew's correlation coefficient (MCC), and area under the ROC curve (AUC) are 95.23%, 0.8100, and 0.9864, respectively. For the pupylation site prediction model, six feature extraction codes (i.e., TPC, AAI, One-hot, PseAAC, CKSAAP, and Word Embedding) served to extract protein sequence features, and the chi-square test was employed for feature selection. Rigorous 10-fold cross-validations indicated that the accuracies are very high and outperformed its existing counterparts. Finally, for the convenience of researchers, PUP-PS-Fuse has been established at https://bioinfo.jcu.edu.cn/PUP-PS-Fuse and http://121.36.221.79/PUP-PS-Fuse/as a backup.


Subject(s)
Algorithms , Proteins , Amino Acid Sequence , Area Under Curve , Protein Processing, Post-Translational , Proteins/metabolism
9.
Math Biosci Eng ; 18(6): 9132-9147, 2021 10 25.
Article in English | MEDLINE | ID: mdl-34814339

ABSTRACT

Protein S-nitrosylation is one of the most important post-translational modifications, a well-grounded understanding of S-nitrosylation is very significant since it plays a key role in a variety of biological processes. For an uncharacterized protein sequence, it is a very meaningful problem for both basic research and drug development when we can firstly identify whether it is a S-nitrosylation protein or not, and then predict the specific S-nitrosylation site(s). This work has proposed two models for identifying S-nitrosylation protein and its PTM sites. Firstly, three kinds of features are extracted from protein sequence: KNN scoring of functional domain annotation, PseAAC and bag-of-words based on the physical and chemical properties of amino acids. Secondly, the synthetic minority oversampling technique is used to balance the data sets, and some state-of-the-art classifiers and feature fusion strategies are performed on the balanced data sets. In the five-fold cross-validation for predicting S-nitrosylation proteins, the results of Accuracy (ACC), Matthew's correlation coefficient (MCC) and area under ROC curve (AUC) are 81.84%, 0.5178, 0.8635, respectively. Finally, a model for predicting S-nitrosylation sites has been constructed on the basis of tripeptide composition (TPC) and the composition of k-spaced amino acid pairs (CKSAAP). To eliminate redundant information and improve work efficiency, elastic nets are employed for feature selection. The five-fold cross-validation tests have indicated the promising success rates of the proposed model. For the convenience of related researchers, the web-server named "RF-SNOPS" has been established at http://www.jci-bioinfo.cn/RF-SNOPS.


Subject(s)
Amino Acids , Proteins , Algorithms , Amino Acid Sequence , Area Under Curve , Computational Biology , Protein Processing, Post-Translational
10.
Front Genet ; 12: 738274, 2021.
Article in English | MEDLINE | ID: mdl-34567088

ABSTRACT

Ion channels are the second largest drug target family. Ion channel dysfunction may lead to a number of diseases such as Alzheimer's disease, epilepsy, cephalagra, and type II diabetes. In the research work for predicting ion channel-drug, computational approaches are effective and efficient compared with the costly, labor-intensive, and time-consuming experimental methods. Most of the existing methods can only be used to deal with the ion channels of knowing 3D structures; however, the 3D structures of most ion channels are still unknown. Many predictors based on protein sequence were developed to address the challenge, while most of their results need to be improved, or predicting web servers are missing. In this paper, a sequence-based classifier, called "iCDI-W2vCom," was developed to identify the interactions between ion channels and drugs. In the predictor, the drug compound was formulated by SMILES-word2vec, FP2-word2vec, SMILES-node2vec, and ECFPs via a 1184D vector, ion channel was represented by the word2vec via a 64D vector, and the prediction engine was operated by the LightGBM classifier. The accuracy and AUC achieved by iCDI-W2vCom via the fivefold cross validation were 91.95% and 0.9703, which outperformed other existing predictors in this area. A user-friendly web server for iCDI-W2vCom was established at http://www.jci-bioinfo.cn/icdiw2v. The proposed method may also be a potential method for predicting target-drug interaction.

11.
Comput Math Methods Med ; 2021: 6652288, 2021.
Article in English | MEDLINE | ID: mdl-33505514

ABSTRACT

Intestinal obstruction is a common surgical emergency in children. However, it is challenging to seek appropriate treatment for childhood ileus since many diagnostic measures suitable for adults are not applicable to children. The rapid development of machine learning has spurred much interest in its application to medical imaging problems but little in medical text mining. In this paper, a two-layer model based on text data such as routine blood count and urine tests is proposed to provide guidance on the diagnosis and assist in clinical decision-making. The samples of this study were 526 children with intestinal obstruction. Firstly, the samples were divided into two groups according to whether they had intestinal obstruction surgery, and then, the surgery group was divided into two groups according to whether the intestinal tube was necrotic. Specifically, we combined 63 physiological indexes of each child with their corresponding label and fed them into a deep learning neural network which contains multiple fully connected layers. Subsequently, the corresponding value was obtained by activation function. The 5-fold cross-validation was performed in the first layer and demonstrated a mean accuracy (Acc) of 80.04%, and the corresponding sensitivity (Se), specificity (Sp), and MCC were 67.48%, 87.46%, and 0.57, respectively. Additionally, the second layer can also reach an accuracy of 70.4%. This study shows that the proposed algorithm has direct meaning to processing of clinical text data of childhood ileus.


Subject(s)
Artificial Intelligence , Deep Learning , Intestinal Obstruction/diagnosis , Intestinal Obstruction/surgery , Algorithms , Biomarkers/blood , Child , Computational Biology , Data Mining , Databases, Factual , Diagnosis, Computer-Assisted , Humans , Ileus/diagnosis , Ileus/surgery , Intestinal Obstruction/blood , Retrospective Studies
12.
Front Genet ; 11: 515, 2020.
Article in English | MEDLINE | ID: mdl-32582278

ABSTRACT

Proteins play primary roles in important biological processes such as catalysis, physiological functions, and immune system functions. Thus, the research on how proteins evolved has been a nuclear question in the field of evolutionary biology. General models of protein evolution help to determine the baseline expectations for evolution of sequences, and these models have been extensively useful in sequence analysis as well as for the computer simulation of artificial sequence data sets. We have developed a new method of simulating multi-domain protein evolution, including fusions of domains, insertion, and deletion. It has been observed via the simulation test that the success rates achieved by the proposed predictor are remarkably high. For the convenience of the most experimental scientists, a user-friendly web server has been established at http://jci-bioinfo.cn/domainevo, by which users can easily get their desired results without having to go through the detailed mathematics. Through the simulation results of this website, users can predict the evolution trend of the protein domain architecture.

13.
Protein Pept Lett ; 27(4): 313-320, 2020.
Article in English | MEDLINE | ID: mdl-31749418

ABSTRACT

BACKGROUND: The information of quaternary structure attributes of proteins is very important because it is closely related to the biological functions of proteins. With the rapid development of new generation sequencing technology, we are facing a challenge: how to automatically identify the four-level attributes of new polypeptide chains according to their sequence information (i.e., whether they are formed as just as a monomer, or as a hetero-oligomer, or a homo-oligomer). OBJECTIVE: In this article, our goal is to find a new way to represent protein sequences, thereby improving the prediction rate of protein quaternary structure. METHODS: In this article, we developed a prediction system for protein quaternary structural type in which a protein sequence was expressed by combining the Pfam functional-domain and gene ontology. turn protein features into digital sequences, and complete the prediction of quaternary structure through specific machine learning algorithms and verification algorithm. RESULTS: Our data set contains 5495 protein samples. Through the method provided in this paper, we classify proteins into monomer, or as a hetero-oligomer, or a homo-oligomer, and the prediction rate is 74.38%, which is 3.24% higher than that of previous studies. Through this new feature extraction method, we can further classify the four-level structure of proteins, and the results are also correspondingly improved. CONCLUSION: After the applying the new prediction system, compared with the previous results, we have successfully improved the prediction rate. We have reason to believe that the feature extraction method in this paper has better practicability and can be used as a reference for other protein classification problems.


Subject(s)
Amino Acid Sequence/genetics , Computational Biology , Protein Structure, Quaternary , Proteins/ultrastructure , Algorithms , Gene Ontology , Models, Molecular , Protein Conformation , Proteins/genetics , Sequence Analysis, Protein/methods
14.
Article in English | MEDLINE | ID: mdl-31867311

ABSTRACT

Acetylation is one of post-translational modification (PTM), which often reacts with acetic acid and brings an acetyl radical to an organic compound. It is helpful to identify acetylation protein correctly for understanding the mechanism of acetylation in biological systems. Although many acetylation sites have been identified by high throughput experimental studies via mass spectrometry, there still are lots of acetylation sites need to be discovered. Computational methods have showed their power for identifying acetylation sites with informatics techniques which usually reduce experiment cost and improve the effectiveness and efficiency. In fact, if there is an approach can distinguish the acetylated proteins from the non-acetylated ones, it is no doubt a very meaningful and effective method for this issue. Here, we proposed a novel computational method for identifying acetylation proteins by extracting features from the conservation information of sequence via gray system model and KNN scores based on the information of functional domain annotation and subcellular localization. The authors have performed the 5-fold cross-validation on three datasets along with much analysis of features and the Relief feature selection algorithm. The obtained accuracies are all satisfactory, as the mean performance, the accuracy is 77.10%, the Matthew's correlation coefficient is 0.5457, and the AUC value is 0.8389. These works might provide useful insights for the related experimental validation, and further studies of other PTM process. For the convenience of related researchers, the web-server named "iACetyP" was established and is accessible at http://www.jci-bioinfo.cn/iAcetyP.

15.
Bioinformatics ; 35(23): 4922-4929, 2019 12 01.
Article in English | MEDLINE | ID: mdl-31077296

ABSTRACT

MOTIVATION: Dihydrouridine (D) is a common RNA post-transcriptional modification found in eukaryotes, bacteria and a few archaea. The modification can promote the conformational flexibility of individual nucleotide bases. And its levels are increased in cancerous tissues. Therefore, it is necessary to detect D in RNA for further understanding its functional roles. Since wet-experimental techniques for the aim are time-consuming and laborious, it is urgent to develop computational models to identify D modification sites in RNA. RESULTS: We constructed a predictor, called iRNAD, for identifying D modification sites in RNA sequence. In this predictor, the RNA samples derived from five species were encoded by nucleotide chemical property and nucleotide density. Support vector machine was utilized to perform the classification. The final model could produce the overall accuracy of 96.18% with the area under the receiver operating characteristic curve of 0.9839 in jackknife cross-validation test. Furthermore, we performed a series of validations from several aspects and demonstrated the robustness and reliability of the proposed model. AVAILABILITY AND IMPLEMENTATION: A user-friendly web-server called iRNAD can be freely accessible at http://lin-group.cn/server/iRNAD, which will provide convenience and guide to users for further studying D modification.


Subject(s)
Support Vector Machine , Base Sequence , Computational Biology , Nucleotides , RNA , Reproducibility of Results
16.
Genomics ; 111(6): 1785-1793, 2019 12.
Article in English | MEDLINE | ID: mdl-30529532

ABSTRACT

The promoter is a regulatory DNA region about 81-1000 base pairs long, usually located near the transcription start site (TSS) along upstream of a given gene. By combining a certain protein called transcription factor, the promoter provides the starting point for regulated gene transcription, and hence plays a vitally important role in gene transcriptional regulation. With explosive growth of DNA sequences in the post-genomic age, it has become an urgent challenge to develop computational method for effectively identifying promoters because the information thus obtained is very useful for both basic research and drug development. Although some prediction methods were developed in this regard, most of them were limited at merely identifying whether a query DNA sequence being of a promoter or not. However, based on their strength-distinct levels for transcriptional activation and expression, promoter should be divided into two categories: strong and weak types. Here a new two-layer predictor, called "iPSW(2L)-PseKNC", was developed by fusing the physicochemical properties of nucleotides and their nucleotide density into PseKNC (pseudo K-tuple nucleotide composition). Its 1st-layer serves to predict whether a query DNA sequence sample is of promoter or not, while its 2nd-layer is able to predict the strength of promoters. It has been observed through rigorous cross-validations that the 1st-layer sub-predictor is remarkably superior to the existing state-of-the-art predictors in identifying the promoters and non-promoters, and that the 2nd-layer sub-predictor can do what is beyond the reach of the existing predictors. Moreover, the web-server for iPSW(2L)-PseKNC has been established at http://www.jci-bioinfo.cn/iPSW(2L)-PseKNC, by which the majority of experimental scientists can easily get the results they need.


Subject(s)
Base Sequence , Promoter Regions, Genetic , Sequence Analysis, DNA , Software , Transcription Initiation Site , Transcriptional Activation
17.
Int J Biol Sci ; 14(8): 883-891, 2018.
Article in English | MEDLINE | ID: mdl-29989083

ABSTRACT

Meiotic recombination caused by meiotic double-strand DNA breaks. In some regions the frequency of DNA recombination is relatively higher, while in other regions the frequency is lower: the former is usually called "recombination hotspot", while the latter the "recombination coldspot". Information of the hot and cold spots may provide important clues for understanding the mechanism of genome revolution. Therefore, it is important to accurately predict these spots. In this study, we rebuilt the benchmark dataset by unifying its samples with a same length (131 bp). Based on such a foundation and using SVM (Support Vector Machine) classifier, a new predictor called "iRSpot-Pse6NC" was developed by incorporating the key hexamer features into the general PseKNC (Pseudo K-tuple Nucleotide Composition) via the binomial distribution approach. It has been observed via rigorous cross-validations that the proposed predictor is superior to its counterparts in overall accuracy, stability, sensitivity and specificity. For the convenience of most experimental scientists, the web-server for iRSpot-Pse6NC has been established at http://lin-group.cn/server/iRSpot-Pse6NC, by which users can easily obtain their desired result without the need to go through the detailed mathematical equations involved.


Subject(s)
Computational Biology/methods , Recombination, Genetic/genetics , Saccharomyces cerevisiae/genetics , Algorithms , Sequence Analysis, DNA , Software
18.
Genomics ; 110(5): 239-246, 2018 09.
Article in English | MEDLINE | ID: mdl-29107015

ABSTRACT

Lysine crotonylation (Kcr) is an evolution-conserved histone posttranslational modification (PTM), occurring in both human somatic and mouse male germ cell genomes. It is important for male germ cell differentiation. Information of Kcr sites in proteins is very useful for both basic research and drug development. But it is time-consuming and expensive to determine them by experiments alone. Here, we report a novel predictor called iKcr-PseEns that is established by incorporating five tiers of amino acid pairwise couplings into the general pseudo amino acid composition. It has been observed via rigorous cross-validations that the new predictor's sensitivity (Sn), specificity (Sp), accuracy (Acc), and stability (MCC) are 90.53%, 95.27%, 94.49%, and 0.826, respectively. For the convenience of most experimental scientists, a user-friendly web-server for iKcr-PseEns has been established at http://www.jci-bioinfo.cn/iKcr-PseEns, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved.


Subject(s)
Histones/metabolism , Protein Processing, Post-Translational , Sequence Analysis, Protein/methods , Software , Crotonates/chemistry , Crotonates/metabolism , Histones/chemistry , Humans , Lysine/chemistry , Lysine/metabolism
19.
Sci Rep ; 7(1): 8222, 2017 Aug 15.
Article in English | MEDLINE | ID: mdl-28811565

ABSTRACT

Gene splicing is one of the most significant biological processes in eukaryotic gene expression, such as RNA splicing, which can cause a pre-mRNA to produce one or more mature messenger RNAs containing the coded information with multiple biological functions. Thus, identifying splicing sites in DNA/RNA sequences is significant for both the bio-medical research and the discovery of new drugs. However, it is expensive and time consuming based only on experimental technique, so new computational methods are needed. To identify the splice donor sites and splice acceptor sites accurately and quickly, a deep sparse auto-encoder model with two hidden layers, called iSS-PC, was constructed based on minimum error law, in which we incorporated twelve physical-chemical properties of the dinucleotides within DNA into PseDNC to formulate given sequence samples via a battery of cross-covariance and auto-covariance transformations. In this paper, five-fold cross-validation test results based on the same benchmark data-sets indicated that the new predictor remarkably outperformed the existing prediction methods in this field. Furthermore, it is expected that many other related problems can be also studied by this approach. To implement classification accurately and quickly, an easy-to-use web-server for identifying slicing sites has been established for free access at: http://www.jci-bioinfo.cn/iSS-PC.

20.
Med Chem ; 13(8): 734-743, 2017.
Article in English | MEDLINE | ID: mdl-28641529

ABSTRACT

OBJECTIVE: Being a kind of post-transcriptional modification (PTCM) in RNA, the 2'-Omethylation modification occurs in the processes of life development and disease formation as well. Accordingly, from the angles of both basic research and drug development, we are facing a challenging problem: given an uncharacterized RNA sequence formed by many nucleotides of A (adenine), C (cytosine), G (guanine), and U (uracil), which one can be of 2-O'-methylation modification, and which one cannot? Unfortunately, so far no computational method whatsoever has been developed to address such a problem. METHOD: To fill this empty area, we propose a predictor called iRNA-2methyl. It is formed by incorporating a series of sequence-coupled factors into the general PseKNC (pseudo nucleotide composition), followed by fusing 12 basic random forest classifier into four ensemble predictors, with each aimed to identify the cases of A, C, G, and U along the RNA sequence concerned, respectively. RESULTS: Rigorous jackknife cross-validations have indicated that the success rates are very high (>93%). For the convenience of most experimental scientists, a user-friendly web-server for iRNA-2methyl has been established at http://www.jci-bioinfo.cn/iRNA-2methyl, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved. CONCLUSION: The proposed predictor iRNA-2methyl will become a very useful bioinformatics tool for medicinal chemistry, helping to design effective drugs against the diseases related to the 2'-Omethylation modification.


Subject(s)
Nucleotides/chemistry , RNA/chemistry , User-Computer Interface , Chemistry, Pharmaceutical , Computational Biology , Humans , Methylation , Nucleic Acid Conformation
SELECTION OF CITATIONS
SEARCH DETAIL