Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 15 de 15
Filter
Add more filters










Publication year range
1.
Front Genet ; 12: 669328, 2021.
Article in English | MEDLINE | ID: mdl-33959153

ABSTRACT

Antimicrobial peptides (AMPs) are considered as potential substitutes of antibiotics in the field of new anti-infective drug design. There have been several machine learning algorithms and web servers in identifying AMPs and their functional activities. However, there is still room for improvement in prediction algorithms and feature extraction methods. The reduced amino acid (RAA) alphabet effectively solved the problems of simplifying protein complexity and recognizing the structure conservative region. This article goes into details about evaluating the performances of more than 5,000 amino acid reduced descriptors generated from 74 types of amino acid reduced alphabet in the first stage and the second stage to construct an excellent two-stage classifier, Identification of Antimicrobial Peptides by Reduced Amino Acid Cluster (iAMP-RAAC), for identifying AMPs and their functional activities, respectively. The results show that the first stage AMP classifier is able to achieve the accuracy of 97.21 and 97.11% for the training data set and independent test dataset. In the second stage, our classifier still shows good performance. At least three of the four metrics, sensitivity (SN), specificity (SP), accuracy (ACC), and Matthews correlation coefficient (MCC), exceed the calculation results in the literature. Further, the ANOVA with incremental feature selection (IFS) is used for feature selection to further improve prediction performance. The prediction performance is further improved after the feature selection of each stage. At last, a user-friendly web server, iAMP-RAAC, is established at http://bioinfor.imu.edu. cn/iampraac.

2.
Genomics ; 112(1): 853-858, 2020 01.
Article in English | MEDLINE | ID: mdl-31170440

ABSTRACT

Abnormal histone modifications (HMs) and transcription factors (TFs) can alter the expression of cancer-related genes to promote tumorigenesis. We studied the variations of 11 HMs and 2 TFs in human breast cancer cells (MCF-7) compared to human normal mammary epithelial cells (HMEC), and the effects of HMs/TFs in various regions of the genome on the expression changes of breast cancer-related genes. Based on HMs and TFs signals' differences between MCF-7 and HMEC flanking TSSs, the up- and down-regulated genes in MCF-7 were predicted by Random Forest, and important HMs and regions were found. Results indicate that H3K79me2, H3K27ac, and H3K4me1 are particularly important for the changes of gene expression in MCF-7. Especially, H3K79me2 around the 60-th bin flanking TSSs may be the key for regulating gene expression. Our studies reveal H3K79me2 may be a core HM for breast cancer.


Subject(s)
Breast Neoplasms/genetics , Gene Expression Regulation, Neoplastic , Histone Code , Breast Neoplasms/metabolism , Female , Humans , MCF-7 Cells , Transcription Factors/metabolism , Transcription Initiation Site
3.
DNA Cell Biol ; 38(1): 49-62, 2019 Jan.
Article in English | MEDLINE | ID: mdl-30346835

ABSTRACT

Breast cancer has a high mortality rate for females. Aberrant DNA methylation plays a crucial role in the occurrence and progression of breast carcinoma. By comparing DNA methylation differences between tumor breast tissue and normal breast tissue, we calculate and analyze the distributions of the hyper- and hypomethylation sites in different function regions. Results indicate that enhancer regions are often hypomethylated in breast cancer. CpG islands (CGIs) are mainly hypermethylated, while the flanking CGI (shores and shelves) is more easily hypomethylated. The hypomethylation in gene body region is related to the upregulation of gene expression, and the hypomethylation of enhancer regions is closely associated with gene expression upregulation in breast cancer. Some key hypomethylation sites in enhancer regions and key hypermethylation sites in CGIs for regulating key genes are, respectively, found, such as oncogenes ESR1 and ERBB2 and tumor suppressor genes FBLN2, CEBPA, and FAT4. This suggests that the recognizing methylation status of these genes will be useful for the diagnosis of breast cancer.


Subject(s)
Breast Neoplasms/genetics , DNA Methylation/genetics , Gene Expression Regulation, Neoplastic/genetics , Breast Neoplasms/metabolism , CpG Islands , Female , Humans
4.
Gene ; 592(1): 227-234, 2016 Oct 30.
Article in English | MEDLINE | ID: mdl-27468948

ABSTRACT

Epigenetic factors are known to correlate with gene expression in the existing studies. However, quantitative models that accurately classify the highly and lowly expressed genes based on epigenetic factors are currently lacking. In this study, a new machine learning method combines histone modifications, DNA methylation, DNA accessibility, transcription factors, and trinucleotide composition with support vector machines (SVM) is developed in the context of human embryonic stem cell line (H1). The results indicate that the predictive accuracy will be markedly improved when the epigenetic features are considered. The predictive accuracy and Matthews correlation coefficient of the best model are as high as 95.96% and 0.92 for 10-fold cross-validation test, and 95.58% and 0.92 for independent dataset test, respectively. Our model provides a good way to judge a gene is either highly or lowly expressed gene by using genetic and epigenetic data, when the expression data of the gene is lacking. And a web-server GECES for our analysis method is established at http://202.207.14.87:8032/fuwu/GECES/index.asp, so that other scientists can easily get their desired results by our web-server, without going through the mathematical details.


Subject(s)
Base Composition , Embryonic Stem Cells/metabolism , Epigenesis, Genetic , Machine Learning , Cell Line , Humans
5.
Gene ; 575(1): 90-100, 2016 Jan 01.
Article in English | MEDLINE | ID: mdl-26302750

ABSTRACT

It is well known that histone modifications are associated with gene expression. In order to further study this relationship, 16 kinds of Chip-seq histone modification data and mRNA-seq data of the human embryonic stem cell H1 are chosen. The distributions of histone modifications in the regions flanking transcription start sites (TSSs) for highly expressed and lowly expressed genes are computed, respectively. And four types of distributions of histone modifications in regions flanking TSSs and the spatial patterning of the correlations between histone modifications and gene expression are detected. Our results suggest that the correlations between the regions overlapped by peaks are higher than the non-overlapped ones for each histone modification. In addition, to obtain the effect of the cooperative action of histone modification on gene expression, five histone modification clusters are found in highly expressed and lowly expressed genes, histone modification and gene expression interaction network is constructed. To further explore which region is the main target region for the specific histone modification, the human genes are divided into five functional regions. The results indicate that histone modifications are mostly located in the promoters of highly expressed genes versus the exons of lowly expressed genes, and exons have a smaller range of normalized tag counts than other gene elements in the two groups of genes. Finally, the type specificity and regional bias of histone modifications for 11 key transcription factor genes regulating the stem cell renewal are analyzed.


Subject(s)
Embryonic Stem Cells/metabolism , Gene Expression Regulation/physiology , Histones/metabolism , Protein Processing, Post-Translational/physiology , Transcription Factors/metabolism , Cell Line , Embryonic Stem Cells/cytology , Humans
6.
Mol Biosyst ; 11(3): 950-7, 2015 Mar.
Article in English | MEDLINE | ID: mdl-25607774

ABSTRACT

Membrane transporters play crucial roles in the fundamental cellular processes of living organisms. Computational techniques are very necessary to annotate the transporter functions. In this study, a multi-class K nearest neighbor classifier based on the increment of diversity (KNN-ID) was developed to discriminate the membrane transporter types when the increment of diversity (ID) was introduced as one of the novel similarity distances. Comparisons with multiple recently published methods showed that the proposed KNN-ID method outperformed the other methods, obtaining more than 20% improvement for overall accuracy. The overall prediction accuracy reached was 83.1%, when the K was selected as 2. The prediction sensitivity achieved 76.7%, 89.1%, 80.1% for channels/pores, electrochemical potential-driven transporters, primary active transporters, respectively. Discrimination and comparison between any two different classes of transporters further demonstrated that the proposed method is a potential classifier and will play a complementary role for facilitating the functional assignment of transporters.


Subject(s)
Computational Biology/methods , Membrane Transport Proteins/chemistry , Algorithms , Amino Acids/chemistry , Databases, Protein , Membrane Transport Proteins/classification , Reproducibility of Results
7.
ScientificWorldJournal ; 2014: 864135, 2014.
Article in English | MEDLINE | ID: mdl-25110749

ABSTRACT

The chemical shift is sensitive to changes in the local environments and can report the structural changes. The structure information of a protein can be represented by the average chemical shifts (ACS) composition, which has been broadly applied for enhancing the prediction accuracy in protein subcellular locations and protein classification. However, different kinds of ACS composition can solve different problems. We established an online web server named acACS, which can convert secondary structure into average chemical shift and then compose the vector for representing a protein by using the algorithm of auto covariance. Our solution is easy to use and can meet the needs of users.


Subject(s)
Intracellular Space/metabolism , Models, Chemical , Proteins/chemistry , Proteins/metabolism , Algorithms , Models, Biological , Protein Structure, Secondary , Protein Transport
8.
Anal Biochem ; 458: 14-9, 2014 Aug 01.
Article in English | MEDLINE | ID: mdl-24802134

ABSTRACT

Peroxidases as universal enzymes are essential for the regulation of reactive oxygen species levels and play major roles in both disease prevention and human pathologies. Automated prediction of functional protein localization is rarely reported and also is important for designing new drugs and drug targets. In this study, we first propose a support vector machine (SVM)-based method to predict peroxidase subcellular localization. Various Chou' pseudo amino acid descriptors and gene ontology (GO)-homology patterns were selected as input features to multiclass SVM. Prediction results showed that the smoothed PSSM encoding pattern performed better than the other approaches. The best overall prediction accuracy was 87.0% in a jackknife test using a PSSM profile of pattern with width=5. We also demonstrate that the present GO annotation is far from complete or deep enough for annotating proteins with a specific function.


Subject(s)
Peroxidase/analysis , Support Vector Machine , Amino Acids/chemistry , Amino Acids/metabolism , Chloroplasts/enzymology , Chloroplasts/metabolism , Cytoplasm/enzymology , Cytoplasm/metabolism , Databases, Factual , Dipeptides/chemistry , Dipeptides/metabolism , Humans , Mitochondria/enzymology , Mitochondria/metabolism , Peroxisomes/enzymology , Peroxisomes/metabolism
9.
Genomics ; 102(4): 215-22, 2013 Oct.
Article in English | MEDLINE | ID: mdl-23891614

ABSTRACT

For a successful RNA interference (RNAi) experiment, selecting the small interference RNA (siRNA) candidates which maximize the knock down effect of the given gene is the critical step. Although various computational approaches have been attempted, the design of efficient siRNA candidates is far from satisfactory yet. In this study, we proposed a novel feature selection algorithm of combined random forest and support vector machine to predict active siRNAs. Using a publically available dataset, we demonstrated that the predictive accuracy would be markedly improved when the context sequence features outside the target site were included. The Pearson correlation coefficient for regression is as high as 0.721, compared to 0.671, 0.668, 0.680, and 0.645, for Biopredsi, i-score, ThermoComposition21 and DSIR, respectively. It revealed that siRNA-target interaction requires appropriate sequence context not only in the target site but also in a broad region flanking the target site.


Subject(s)
Computational Biology/methods , RNA Interference , RNA, Small Interfering/genetics , Algorithms , Base Sequence , Databases, Genetic , Models, Molecular , Regression Analysis , Sequence Analysis, RNA , Support Vector Machine
10.
Bioinformatics ; 29(6): 678-85, 2013 Mar 15.
Article in English | MEDLINE | ID: mdl-23335013

ABSTRACT

MOTIVATION: Protein-DNA interactions often take part in various crucial processes, which are essential for cellular function. The identification of DNA-binding sites in proteins is important for understanding the molecular mechanisms of protein-DNA interaction. Thus, we have developed an improved method to predict DNA-binding sites by integrating structural alignment algorithm and support vector machine-based methods. RESULTS: Evaluated on a new non-redundant protein set with 224 chains, the method has 80.7% sensitivity and 82.9% specificity in the 5-fold cross-validation test. In addition, it predicts DNA-binding sites with 85.1% sensitivity and 85.3% specificity when tested on a dataset with 62 protein-DNA complexes. Compared with a recently published method, BindN+, our method predicts DNA-binding sites with a 7% better area under the receiver operating characteristic curve value when tested on the same dataset. Many important problems in cell biology require the dense non-linear interactions between functional modules be considered. Thus, our prediction method will be useful in detecting such complex interactions.


Subject(s)
Algorithms , DNA-Binding Proteins/chemistry , DNA/chemistry , Binding Sites , DNA/metabolism , DNA-Binding Proteins/metabolism , Protein Structure, Secondary , ROC Curve , Sequence Analysis, Protein , Support Vector Machine
11.
Amino Acids ; 44(2): 573-80, 2013 Feb.
Article in English | MEDLINE | ID: mdl-22851052

ABSTRACT

The successful prediction of thermophilic proteins is useful for designing stable enzymes that are functional at high temperature. We have used the increment of diversity (ID), a novel amino acid composition-based similarity distance, in a 2-class K-nearest neighbor classifier to classify thermophilic and mesophilic proteins. And the KNN-ID classifier was successfully developed to predict the thermophilic proteins. Instead of extracting features from protein sequences as done previously, our approach was based on a diversity measure of symbol sequences. The similarity distance between each pair of protein sequences was first calculated to quantitatively measure the similarity level of one given sequence and the other. The query protein is then determined using the K-nearest neighbor algorithm. Comparisons with multiple recently published methods showed that the KNN-ID proposed in this study outperforms the other methods. The improved predictive performance indicated it is a simple and effective classifier for discriminating thermophilic and mesophilic proteins. At last, the influence of protein length and protein identity on prediction accuracy was discussed further. The prediction model and dataset used in this article can be freely downloaded from http://wlxy.imu.edu.cn/college/biostation/fuwu/KNN-ID/index.htm .


Subject(s)
Proteins/chemistry , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Algorithms , Databases, Protein , Sequence Homology, Amino Acid
12.
PLoS One ; 7(10): e47843, 2012.
Article in English | MEDLINE | ID: mdl-23144709

ABSTRACT

Nucleosome positioning has important roles in key cellular processes. Although intensive efforts have been made in this area, the rules defining nucleosome positioning is still elusive and debated. In this study, we carried out a systematic comparison among the profiles of twelve DNA physicochemical features between the nucleosomal and linker sequences in the Saccharomyces cerevisiae genome. We found that nucleosomal sequences have some position-specific physicochemical features, which can be used for in-depth studying nucleosomes. Meanwhile, a new predictor, called iNuc-PhysChem, was developed for identification of nucleosomal sequences by incorporating these physicochemical properties into a 1788-D (dimensional) feature vector, which was further reduced to a 884-D vector via the IFS (incremental feature selection) procedure to optimize the feature set. It was observed by a cross-validation test on a benchmark dataset that the overall success rate achieved by iNuc-PhysChem was over 96% in identifying nucleosomal or linker sequences. As a web-server, iNuc-PhysChem is freely accessible to the public at http://lin.uestc.edu.cn/server/iNuc-PhysChem. For the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results without the need to follow the complicated mathematics that were presented just for the integrity in developing the predictor. Meanwhile, for those who prefer to run predictions in their own computers, the predictor's code can be easily downloaded from the web-server. It is anticipated that iNuc-PhysChem may become a useful high throughput tool for both basic research and drug design.


Subject(s)
Algorithms , Computational Biology/methods , DNA, Fungal/genetics , Nucleosomes/genetics , Saccharomyces cerevisiae/genetics , Base Sequence , DNA, Fungal/chemistry , DNA, Fungal/metabolism , Histones/chemistry , Histones/genetics , Histones/metabolism , Internet , Models, Genetic , Models, Molecular , Molecular Sequence Data , Nucleosomes/chemistry , Nucleosomes/metabolism , Saccharomyces cerevisiae/metabolism , Saccharomyces cerevisiae Proteins/chemistry , Saccharomyces cerevisiae Proteins/genetics , Saccharomyces cerevisiae Proteins/metabolism
13.
Genomics ; 97(2): 112-20, 2011 Feb.
Article in English | MEDLINE | ID: mdl-21112384

ABSTRACT

Accurate identification of core promoters is important for gaining more insight about the understanding of the eukaryotic transcription regulation. In this study, the authors focused on the biologically realistic promoter prediction of plant genomes. By analyzing the correlative conservation, GC-compositional bias and specific structural patterns of TATA and TATA-less promoters in PlantPromDB, a hybrid multi-feature approach based on support vector machine (SVM) for predicting the two types of promoters were developed by integrating local word content, GC-Skew and DNA geometric flexibility. Compared with the TSSP-TCM program on the same test dataset, better prediction results were obtained. Especially for the TATA-less promoter, the accuracy is 10% higher than the result of TSSP-TCM program. The good performance of the hybrid promoters and the experimental data also indicate that our method has the ability to locate the promoter region of the plant genome.


Subject(s)
Arabidopsis/genetics , Gene Expression Regulation, Plant , Genome, Plant/genetics , Promoter Regions, Genetic , Base Composition , Base Sequence , Computational Biology , Data Mining , Sequence Analysis, DNA
14.
Amino Acids ; 38(3): 859-67, 2010 Mar.
Article in English | MEDLINE | ID: mdl-19387791

ABSTRACT

Due to the complexity of Plasmodium falciparumis genome, predicting secretory proteins of P. falciparum is more difficult than other species. In this study, based on the measure of diversity definition, a new K-nearest neighbor method, K-minimum increment of diversity (K-MID), is introduced to predict secretory proteins. The prediction performance of the K-MID by using amino acids composition as the only input vector achieves 88.89% accuracy with 0.78 Mathew's correlation coefficient (MCC). Further, the several reduced amino acids alphabets are applied to predict secretory proteins and the results show that the prediction results are improved to 90.67% accuracy with 0.83 MCC by using the 169 dipeptide compositions of the reduced amino acids alphabets obtained from Protein Blocks method.


Subject(s)
Amino Acids/chemistry , Plasmodium falciparum/chemistry , Plasmodium falciparum/genetics , Protozoan Proteins/chemistry , Algorithms , Amino Acid Sequence , Amino Acids/classification , Artificial Intelligence , Computational Biology/methods , Databases, Protein , Dipeptides/chemistry , Genetic Variation , Malaria Vaccines/immunology , Models, Biological , Plasmodium falciparum/immunology , Protozoan Proteins/classification , Protozoan Proteins/metabolism
15.
Peptides ; 30(10): 1788-93, 2009 Oct.
Article in English | MEDLINE | ID: mdl-19591890

ABSTRACT

Defensins are essentially ancient natural antibiotics with potent activity extending from lower organisms to humans. They can inhibit the growth or virulence of micro-organisms directly or indirectly enhance the host's immune system. The successful prediction of defensin peptides will provide very useful information and insights for the basic research of defensins. In this study, by selecting the N-peptide composition of reduced amino acid alphabet (RAAA) obtained from structural alphabet named Protein Blocks as the feature parameters, the increment of diversity (ID) is firstly developed to predict defensins family and subfamily. The jackknife test based on 2-peptide composition of reduced amino acid alphabet (RAAA) with 13 reduced amino acids shows that the overall accuracy of prediction are 91.36% for defensin family, and 94.21% for defensin subfamily. The results indicate that ID_RAAA is a simple and efficient prediction method for defensin peptides.


Subject(s)
Amino Acids , Amino Acid Sequence , Amino Acids/chemistry , Amino Acids/genetics , Animals , Defensins/chemistry , Defensins/genetics , Models, Genetic , Molecular Sequence Data , Plant Proteins/chemistry , Plant Proteins/genetics , Sequence Homology, Amino Acid
SELECTION OF CITATIONS
SEARCH DETAIL