Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 38
Filtrar
Más filtros











Base de datos
Intervalo de año de publicación
1.
Artículo en Inglés | MEDLINE | ID: mdl-37957897

RESUMEN

BACKGROUND: Colorectal cancer (CRC) has a very high incidence and lethality rate and is one of the most dangerous cancer types. Timely diagnosis can effectively reduce the incidence of colorectal cancer. Changes in para-cancerous tissues may serve as an early signal for tumorigenesis. Comparison of the differences in gene expression between para-cancerous and normal mucosa can help in the diagnosis of CRC and understanding the mechanisms of development. OBJECTIVES: This study aimed to identify specific genes at the level of gene expression, which are expressed in normal mucosa and may be predictive of CRC risk. METHODS: A machine learning approach was used to analyze transcriptomic data in 459 samples of normal colonic mucosal tissue from 322 CRC cases and 137 non-CRC, in which each sample contained 28,706 gene expression levels. The genes were ranked using four ranking methods based on importance estimation (LASSO, LightGBM, MCFS, mRMR, and RF) and four classification algorithms (decision tree [DT], K-nearest neighbor [KNN], random forest [RF], and support vector machine [SVM]) were combined with incremental feature selection [IFS] methods to construct a prediction model with excellent performance. RESULT: The top-ranked genes, namely, HOXD12, CDH1, and S100A12, were associated with tumorigenesis based on previous studies. CONCLUSION: This study summarized four sets of quantitative classification rules based on the DT algorithm, providing clues for understanding the microenvironmental changes caused by CRC. According to the rules, the effect of CRC on normal mucosa can be determined.

2.
Biology (Basel) ; 12(7)2023 Jul 02.
Artículo en Inglés | MEDLINE | ID: mdl-37508378

RESUMEN

As COVID-19 develops, dynamic changes occur in the patient's immune system. Changes in molecular levels in different immune cells can reflect the course of COVID-19. This study aims to uncover the molecular characteristics of different immune cell subpopulations at different stages of COVID-19. We designed a machine learning workflow to analyze scRNA-seq data of three immune cell types (B, T, and myeloid cells) in four levels of COVID-19 severity/outcome. The datasets for three cell types included 403,700 B-cell, 634,595 T-cell, and 346,547 myeloid cell samples. Each cell subtype was divided into four groups, control, convalescence, progression mild/moderate, and progression severe/critical, and each immune cell contained 27,943 gene features. A feature analysis procedure was applied to the data of each cell type. Irrelevant features were first excluded according to their relevance to the target variable measured by mutual information. Then, four ranking algorithms (last absolute shrinkage and selection operator, light gradient boosting machine, Monte Carlo feature selection, and max-relevance and min-redundancy) were adopted to analyze the remaining features, resulting in four feature lists. These lists were fed into the incremental feature selection, incorporating three classification algorithms (decision tree, k-nearest neighbor, and random forest) to extract key gene features and construct classifiers with superior performance. The results confirmed that genes such as PFN1, RPS26, and FTH1 played important roles in SARS-CoV-2 infection. These findings provide a useful reference for the understanding of the ongoing effect of COVID-19 development on the immune system.

3.
Life (Basel) ; 13(6)2023 May 31.
Artículo en Inglés | MEDLINE | ID: mdl-37374086

RESUMEN

Vaccines trigger an immunological response that includes B and T cells, with B cells producing antibodies. SARS-CoV-2 immunity weakens over time after vaccination. Discovering key changes in antigen-reactive antibodies over time after vaccination could help improve vaccine efficiency. In this study, we collected data on blood antibody levels in a cohort of healthcare workers vaccinated for COVID-19 and obtained 73 antigens in samples from four groups according to the duration after vaccination, including 104 unvaccinated healthcare workers, 534 healthcare workers within 60 days after vaccination, 594 healthcare workers between 60 and 180 days after vaccination, and 141 healthcare workers over 180 days after vaccination. Our work was a reanalysis of the data originally collected at Irvine University. This data was obtained in Orange County, California, USA, with the collection process commencing in December 2020. British variant (B.1.1.7), South African variant (B.1.351), and Brazilian/Japanese variant (P.1) were the most prevalent strains during the sampling period. An efficient machine learning based framework containing four feature selection methods (least absolute shrinkage and selection operator, light gradient boosting machine, Monte Carlo feature selection, and maximum relevance minimum redundancy) and four classification algorithms (decision tree, k-nearest neighbor, random forest, and support vector machine) was designed to select essential antibodies against specific antigens. Several efficient classifiers with a weighted F1 value around 0.75 were constructed. The antigen microarray used for identifying antibody levels in the coronavirus features ten distinct SARS-CoV-2 antigens, comprising various segments of both nucleocapsid protein (NP) and spike protein (S). This study revealed that S1 + S2, S1.mFcTag, S1.HisTag, S1, S2, Spike.RBD.His.Bac, Spike.RBD.rFc, and S1.RBD.mFc were most highly ranked among all features, where S1 and S2 are the subunits of Spike, and the suffixes represent the tagging information of different recombinant proteins. Meanwhile, the classification rules were obtained from the optimal decision tree to explain quantitatively the roles of antigens in the classification. This study identified antibodies associated with decreased clinical immunity based on populations with different time spans after vaccination. These antibodies have important implications for maintaining long-term immunity to SARS-CoV-2.

4.
Int J Mol Sci ; 20(9)2019 May 02.
Artículo en Inglés | MEDLINE | ID: mdl-31052553

RESUMEN

Small nucleolar RNAs (snoRNAs) are a new type of functional small RNAs involved in the chemical modifications of rRNAs, tRNAs, and small nuclear RNAs. It is reported that they play important roles in tumorigenesis via various regulatory modes. snoRNAs can both participate in the regulation of methylation and pseudouridylation and regulate the expression pattern of their host genes. This research investigated the expression pattern of snoRNAs in eight major cancer types in TCGA via several machine learning algorithms. The expression levels of snoRNAs were first analyzed by a powerful feature selection method, Monte Carlo feature selection (MCFS). A feature list and some informative features were accessed. Then, the incremental feature selection (IFS) was applied to the feature list to extract optimal features/snoRNAs, which can make the support vector machine (SVM) yield best performance. The discriminative snoRNAs included HBII-52-14, HBII-336, SNORD123, HBII-85-29, HBII-420, U3, HBI-43, SNORD116, SNORA73B, SCARNA4, HBII-85-20, etc., on which the SVM can provide a Matthew's correlation coefficient (MCC) of 0.881 for predicting these eight cancer types. On the other hand, the informative features were fed into the Johnson reducer and repeated incremental pruning to produce error reduction (RIPPER) algorithms to generate classification rules, which can clearly show different snoRNAs expression patterns in different cancer types. The analysis results indicated that extracted discriminative snoRNAs can be important for identifying cancer samples in different types and the expression pattern of snoRNAs in different cancer types can be partly uncovered by quantitative recognition rules.


Asunto(s)
Regulación Neoplásica de la Expresión Génica , Aprendizaje Automático , Neoplasias/genética , ARN Nucleolar Pequeño/genética , Algoritmos , Humanos , Método de Montecarlo , Máquina de Vectores de Soporte
5.
PLoS One ; 9(2): e87791, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-24498372

RESUMEN

Cancer, which is a leading cause of death worldwide, places a big burden on health-care system. In this study, an order-prediction model was built to predict a series of cancer drug indications based on chemical-chemical interactions. According to the confidence scores of their interactions, the order from the most likely cancer to the least one was obtained for each query drug. The 1(st) order prediction accuracy of the training dataset was 55.93%, evaluated by Jackknife test, while it was 55.56% and 59.09% on a validation test dataset and an independent test dataset, respectively. The proposed method outperformed a popular method based on molecular descriptors. Moreover, it was verified that some drugs were effective to the 'wrong' predicted indications, indicating that some 'wrong' drug indications were actually correct indications. Encouraged by the promising results, the method may become a useful tool to the prediction of drugs indications.


Asunto(s)
Antineoplásicos/farmacología , Interacciones Farmacológicas , Informática/métodos , Modelos Teóricos , Neoplasias/tratamiento farmacológico , Humanos
6.
Mol Genet Genomics ; 289(3): 489-99, 2014 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-24448651

RESUMEN

Protein-DNA interactions play important roles in many biological processes. To understand the molecular mechanisms of protein-DNA interaction, it is necessary to identify the DNA-binding sites in DNA-binding proteins. In the last decade, computational approaches have been developed to predict protein-DNA-binding sites based solely on protein sequences. In this study, we developed a novel predictor based on support vector machine algorithm coupled with the maximum relevance minimum redundancy method followed by incremental feature selection. We incorporated not only features of physicochemical/biochemical properties, sequence conservation, residual disorder, secondary structure, solvent accessibility, but also five three-dimensional (3D) structural features calculated from PDB data to predict the protein-DNA interaction sites. Feature analysis showed that 3D structural features indeed contributed to the prediction of DNA-binding site and it was demonstrated that the prediction performance was better with 3D structural features than without them. It was also shown via analysis of features from each site that the features of DNA-binding site itself contribute the most to the prediction. Our prediction method may become a useful tool for identifying the DNA-binding sites and the feature analysis described in this paper may provide useful insights for in-depth investigations into the mechanisms of protein-DNA interaction.


Asunto(s)
Sitios de Unión , Biología Computacional/métodos , Proteínas de Unión al ADN/química , ADN/química , Máquina de Vectores de Soporte , Algoritmos , ADN/metabolismo , Proteínas de Unión al ADN/metabolismo , Conformación Molecular , Unión Proteica , Reproducibilidad de los Resultados
7.
Biochim Biophys Acta ; 1844(1 Pt B): 207-13, 2014 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-23732562

RESUMEN

Drug-target interaction is a key research topic in drug discovery since correct identification of target proteins of drug candidates can help screen out those with unacceptable toxicities, thereby saving expense. In this study, we developed a novel computational approach to predict drug target groups that may reduce the number of candidate target proteins associated with a query drug. A benchmark dataset, consisting of 3028 drugs assigned within nine categories, was constructed by collecting data from KEGG. The nine categories are (1) G protein-coupled receptors, (2) cytokine receptors, (3) nuclear receptors, (4) ion channels, (5) transporters, (6) enzymes, (7) protein kinases, (8) cellular antigens and (9) pathogens. The proposed method combines the data gleaned from chemical-chemical similarities, chemical-chemical connections and chemical-protein connections to allocate drugs to each of the nine target groups. A jackknife test applied to the training dataset that was constructed from the benchmark dataset, provided an overall correct prediction rate of 87.45%, as compared to 87.79% for the test dataset that was constructed by randomly selecting 10% of samples from the benchmark dataset. These prediction rates are much higher than the 11.11% achieved by random guesswork. These promising results suggest that the proposed method can become a useful tool in identifying drug target groups. This article is part of a Special Issue entitled: Computational Proteomics, Systems Biology & Clinical Implications. Guest Editor: Yudong Cai.


Asunto(s)
Bases de Datos de Proteínas , Diseño de Fármacos , Proteínas/química , Receptores Acoplados a Proteínas G/química , Algoritmos , Interacciones Farmacológicas , Humanos , Canales Iónicos/química , Terapia Molecular Dirigida , Receptores Citoplasmáticos y Nucleares/química
8.
Biomed Res Int ; 2013: 132724, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-24350241

RESUMEN

Most drugs have beneficial as well as adverse effects and exert their biological functions by adjusting and altering the functions of their target proteins. Thus, knowledge of drugs target proteins is essential for the improvement of therapeutic effects and mitigation of undesirable side effects. In the study, we proposed a novel prediction method based on drug/compound ontology information extracted from ChEBI to identify drugs target groups from which the kind of functions of a drug may be deduced. By collecting data in KEGG, a benchmark dataset consisting of 876 drugs, categorized into four target groups, was constructed. To evaluate the method more thoroughly, the benchmark dataset was divided into a training dataset and an independent test dataset. It is observed by jackknife test that the overall prediction accuracy on the training dataset was 83.12%, while it was 87.50% on the test dataset-the predictor exhibited an excellent generalization. The good performance of the method indicates that the ontology information of the drugs contains rich information about their target groups, and the study may become an inspiration to solve the problems of this sort and bridge the gap between ChEBI ontology and drugs target groups.


Asunto(s)
Sistemas de Liberación de Medicamentos/métodos , Ontologías Biológicas , Bases de Datos Factuales , Proteínas/metabolismo
9.
Biomed Res Int ; 2013: 723780, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-24083237

RESUMEN

Drug combinatorial therapy could be more effective in treating some complex diseases than single agents due to better efficacy and reduced side effects. Although some drug combinations are being used, their underlying molecular mechanisms are still poorly understood. Therefore, it is of great interest to deduce a novel drug combination by their molecular mechanisms in a robust and rigorous way. This paper attempts to predict effective drug combinations by a combined consideration of: (1) chemical interaction between drugs, (2) protein interactions between drugs' targets, and (3) target enrichment of KEGG pathways. A benchmark dataset was constructed, consisting of 121 confirmed effective combinations and 605 random combinations. Each drug combination was represented by 465 features derived from the aforementioned three properties. Some feature selection techniques, including Minimum Redundancy Maximum Relevance and Incremental Feature Selection, were adopted to extract the key features. Random forest model was built with its performance evaluated by 5-fold cross-validation. As a result, 55 key features providing the best prediction result were selected. These important features may help to gain insights into the mechanisms of drug combinations, and the proposed prediction model could become a useful tool for screening possible drug combinations.


Asunto(s)
Biología Computacional/métodos , Combinación de Medicamentos , Interacciones Farmacológicas , Preparaciones Farmacéuticas/metabolismo , Proteínas/metabolismo , Transducción de Señal , Algoritmos , Curva ROC
10.
Biomed Res Int ; 2013: 485034, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-24078917

RESUMEN

A drug side effect is an undesirable effect which occurs in addition to the intended therapeutic effect of the drug. The unexpected side effects that many patients suffer from are the major causes of large-scale drug withdrawal. To address the problem, it is highly demanded by pharmaceutical industries to develop computational methods for predicting the side effects of drugs. In this study, a novel computational method was developed to predict the side effects of drug compounds by hybridizing the chemical-chemical and protein-chemical interactions. Compared to most of the previous works, our method can rank the potential side effects for any query drug according to their predicted level of risk. A training dataset and test datasets were constructed from the benchmark dataset that contains 835 drug compounds to evaluate the method. By a jackknife test on the training dataset, the 1st order prediction accuracy was 86.30%, while it was 89.16% on the test dataset. It is expected that the new method may become a useful tool for drug design, and that the findings obtained by hybridizing various interactions in a network system may provide useful insights for conducting in-depth pharmacological research as well, particularly at the level of systems biomedicine.


Asunto(s)
Interacciones Farmacológicas , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos/metabolismo , Preparaciones Farmacéuticas/metabolismo , Proteínas/metabolismo , Bases de Datos como Asunto , Humanos
11.
Protein Pept Lett ; 20(3): 318-23, 2013 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-22591471

RESUMEN

Ubiquitination, a reversible protein post-translational modification (PTM), occurs when an amide bond is formed between ubiquitin (a small protein) and the targeted protein. It involves in a wide variety of cellular processes and is associated with various diseases such as Alzheimer's disease. In order to understand ubiquitination at the molecular level, it is important to identify the ubiquitination site by which the ubiquitin binds to. Since experimental methods to determine ubiquitination sites are both expensive and time-consuming, it is necessary to develop in-silico methods to predict ubiquitination sites based on merely the sequential information of the target protein. In this paper, we apply a new classifier called weighted passive nearest neighbor algorithm (WPNNA) to predict the ubiquitination sites. WPNNA was demonstrated to be insensitive to the varied datum densities between different classes. A hybrid of features, including PSSM conservation scores, amino acid factors and disorder scores, are employed to code the protein fragments centered on the possible ubiquitination sites. The Mathew's correlation coefficient (MCC) of our predictor on a training dataset is 0.169 with sensitivity of 31.6% and specificity of 82.9%, and on an independent test dataset is 0.403 with sensitivity of 64.3% and specificity of 75.7%. We compare our predictor with that of a recent published paper which also made predictions on the same datasets. Our predictor achieves much better sensitivities on both datasets than the paper and achieves much better MCC than the paper on the independent test dataset, indicating that the predictor based on WPNNA is as least a good complement to the current state of art in ubiquitination site prediction.


Asunto(s)
Aminoácidos/química , Proteínas , Ubiquitina , Ubiquitinación , Algoritmos , Animales , Sitios de Unión , Biología Computacional/métodos , Humanos , Procesamiento Proteico-Postraduccional , Proteínas/química , Proteínas/metabolismo , Análisis de Secuencia de Proteína , Ubiquitina/química , Ubiquitina/metabolismo
12.
Protein Pept Lett ; 20(3): 324-35, 2013 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-22591475

RESUMEN

Protein disulfide bond is formed during post-translational modifications, and has been implicated in various physiological and pathological processes. Proper localization of disulfide bonds also facilitates the prediction of protein three-dimensional (3D) structure. However, it is both time-consuming and labor-intensive using conventional experimental approaches to determine disulfide bonds, especially for large-scale data sets. Since there are also some limitations for disulfide bond prediction based on 3D structure features, developing sequence-based, convenient and fast-speed computational methods for both inter- and intra-chain disulfide bond prediction is necessary. In this study, we developed a computational method for both types of disulfide bond prediction based on maximum relevance and minimum redundancy (mRMR) method followed by incremental feature selection (IFS), with nearest neighbor algorithm as its prediction model. Features of sequence conservation, residual disorder, and amino acid factor are used for inter-chain disulfide bond prediction. And in addition to these features, sequential distance between a pair of cysteines is also used for intra-chain disulfide bond prediction. Our approach achieves a prediction accuracy of 0.8702 for inter-chain disulfide bond prediction using 128 features and 0.9219 for intra-chain disulfide bond prediction using 261 features. Analysis of optimal feature set indicated key features and key sites for the disulfide bond formation. Interestingly, comparison of top features between interand intra-chain disulfide bonds revealed the similarities and differences of the mechanisms of forming these two types of disulfide bonds, which might help understand more of the mechanisms and provide clues to further experimental studies in this research field.


Asunto(s)
Aminoácidos/química , Cisteína/química , Disulfuros/química , Proteínas/química , Algoritmos , Biología Computacional , Conformación Molecular , Pliegue de Proteína , Procesamiento Proteico-Postraduccional
13.
Protein Pept Lett ; 20(3): 336-45, 2013 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-22591478

RESUMEN

Computational approaches are able to analyze protein-protein interactions (PPIs) from a different angle of view by complementing the experimental ones. And they are very efficient in determining whether two proteins can interact with each other. In this paper, KNNs (K-nearest neighbors) is applied to predict the PPIs by coding each protein with the physical and chemical properties of its residues, predicted secondary structures and amino acid compositions. mRMR (minimum-redundancy maximum-relevance) feature selection is adopted to select a compact feature set, features of which are considered to be important for the determination of PPI-nesses. Because the size of the negative dataset (containing non-interactive protein pairs) is much larger than that of the positive dataset (containing interactive protein pairs), the negative dataset is divided into 5 portions and each portion is combined with the positive dataset for one prediction. Thus 5 predictions are performed and the final results are obtained through voting. As a result, the prediction achieves an overall accuracy of 0.8369 with sensitivity of 0.7356. The predictor, developed by this research for the prediction of the fruit fly PPI-nesses, is available for public use at http://chemdata.shu.edu.cn/ppip.


Asunto(s)
Aminoácidos/química , Biología Computacional/métodos , Unión Proteica , Proteínas/química , Algoritmos , Mapas de Interacción de Proteínas
14.
Biomed Res Int ; 2013: 523415, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-24455700

RESUMEN

This study attempted to find novel age-related macular degeneration (AMD) related genes based on 36 known AMD genes. The well-known shortest path algorithm, Dijkstra's algorithm, was applied to find the shortest path connecting each pair of known AMD related genes in protein-protein interaction (PPI) network. The genes occurring in any shortest path were considered as candidate AMD related genes. As a result, 125 novel AMD genes were predicted. The further analysis based on betweenness and permutation test indicates that there are 10 genes involved in the formation or development of AMD and may be the actual AMD related genes with high probability. We hope that this contribution would promote the study of age-related macular degeneration and discovery of novel effective treatments.


Asunto(s)
Biología Computacional/métodos , Predisposición Genética a la Enfermedad , Degeneración Macular/genética , Mapas de Interacción de Proteínas/genética , Factores de Edad , Algoritmos , Humanos , Degeneración Macular/patología , Modelos Teóricos
15.
Mol Biosyst ; 9(1): 61-9, 2013 Jan 27.
Artículo en Inglés | MEDLINE | ID: mdl-23117653

RESUMEN

Identification of catalytic residues plays a key role in understanding how enzymes work. Although numerous computational methods have been developed to predict catalytic residues and active sites, the prediction accuracy remains relatively low with high false positives. In this work, we developed a novel predictor based on the Random Forest algorithm (RF) aided by the maximum relevance minimum redundancy (mRMR) method and incremental feature selection (IFS). We incorporated features of physicochemical/biochemical properties, sequence conservation, residual disorder, secondary structure and solvent accessibility to predict active sites of enzymes and achieved an overall accuracy of 0.885687 and MCC of 0.689226 on an independent test dataset. Feature analysis showed that every category of the features except disorder contributed to the identification of active sites. It was also shown via the site-specific feature analysis that the features derived from the active site itself contributed most to the active site determination. Our prediction method may become a useful tool for identifying the active sites and the key features identified by the paper may provide valuable insights into the mechanism of catalysis.


Asunto(s)
Biología Computacional/métodos , Enzimas/química , Enzimas/metabolismo , Modelos Químicos , Dominio Catalítico , Fenómenos Químicos , Secuencia Conservada , Bases de Datos de Proteínas , Árboles de Decisión , Estructura Secundaria de Proteína , Análisis de Secuencia de Proteína , Relación Estructura-Actividad , Máquina de Vectores de Soporte
16.
PLoS One ; 7(9): e45854, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-23029276

RESUMEN

Proteinases play critical roles in both intra and extracellular processes by binding and cleaving their protein substrates. The cleavage can either be non-specific as part of degradation during protein catabolism or highly specific as part of proteolytic cascades and signal transduction events. Identification of these targets is extremely challenging. Current computational approaches for predicting cleavage sites are very limited since they mainly represent the amino acid sequences as patterns or frequency matrices. In this work, we developed a novel predictor based on Random Forest algorithm (RF) using maximum relevance minimum redundancy (mRMR) method followed by incremental feature selection (IFS). The features of physicochemical/biochemical properties, sequence conservation, residual disorder, amino acid occurrence frequency, secondary structure and solvent accessibility were utilized to represent the peptides concerned. Here, we compared existing prediction tools which are available for predicting possible cleavage sites in candidate substrates with ours. It is shown that our method makes much more reliable predictions in terms of the overall prediction accuracy. In addition, this predictor allows the use of a wide range of proteinases.


Asunto(s)
Modelos Moleculares , Proteolisis , Algoritmos , Secuencias de Aminoácidos , Secuencia de Aminoácidos , Secuencia Conservada , Árboles de Decisión , Datos de Secuencia Molecular , Péptido Hidrolasas/química , Complejo de la Endopetidasa Proteasomal/química , Análisis de Secuencia de Proteína
17.
PLoS One ; 7(9): e45944, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-23029334

RESUMEN

Metabolic pathway analysis, one of the most important fields in biochemistry, is pivotal to understanding the maintenance and modulation of the functions of an organism. Good comprehension of metabolic pathways is critical to understanding the mechanisms of some fundamental biological processes. Given a small molecule or an enzyme, how may one identify the metabolic pathways in which it may participate? Answering such a question is a first important step in understanding a metabolic pathway system. By utilizing the information provided by chemical-chemical interactions, chemical-protein interactions, and protein-protein interactions, a novel method was proposed by which to allocate small molecules and enzymes to 11 major classes of metabolic pathways. A benchmark dataset consisting of 3,348 small molecules and 654 enzymes of yeast was constructed to test the method. It was observed that the first order prediction accuracy evaluated by the jackknife test was 79.56% in identifying the small molecules and enzymes in a benchmark dataset. Our method may become a useful vehicle in predicting the metabolic pathways of small molecules and enzymes, providing a basis for some further analysis of the pathway systems.


Asunto(s)
Redes y Vías Metabólicas , Proteínas de Saccharomyces cerevisiae/metabolismo , Saccharomyces cerevisiae/metabolismo , Algoritmos , Bases de Datos Factuales , Metabolómica , Modelos Biológicos , Mapeo de Interacción de Proteínas
18.
PLoS One ; 7(8): e43927, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22937126

RESUMEN

Prediction of protein-protein interaction (PPI) sites is one of the most challenging problems in computational biology. Although great progress has been made by employing various machine learning approaches with numerous characteristic features, the problem is still far from being solved. In this study, we developed a novel predictor based on Random Forest (RF) algorithm with the Minimum Redundancy Maximal Relevance (mRMR) method followed by incremental feature selection (IFS). We incorporated features of physicochemical/biochemical properties, sequence conservation, residual disorder, secondary structure and solvent accessibility. We also included five 3D structural features to predict protein-protein interaction sites and achieved an overall accuracy of 0.672997 and MCC of 0.347977. Feature analysis showed that 3D structural features such as Depth Index (DPX) and surface curvature (SC) contributed most to the prediction of protein-protein interaction sites. It was also shown via site-specific feature analysis that the features of individual residues from PPI sites contribute most to the determination of protein-protein interaction sites. It is anticipated that our prediction method will become a useful tool for identifying PPI sites, and that the feature analysis described in this paper will provide useful insights into the mechanisms of interaction.


Asunto(s)
Biología Computacional/métodos , Proteínas/metabolismo , Algoritmos , Conformación Proteica
19.
PLoS One ; 7(8): e42517, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22880014

RESUMEN

Bacterial pathogens continue to threaten public health worldwide today. Identification of bacterial virulence factors can help to find novel drug/vaccine targets against pathogenicity. It can also help to reveal the mechanisms of the related diseases at the molecular level. With the explosive growth in protein sequences generated in the postgenomic age, it is highly desired to develop computational methods for rapidly and effectively identifying virulence factors according to their sequence information alone. In this study, based on the protein-protein interaction networks from the STRING database, a novel network-based method was proposed for identifying the virulence factors in the proteomes of UPEC 536, UPEC CFT073, P. aeruginosa PAO1, L. pneumophila Philadelphia 1, C. jejuni NCTC 11168 and M. tuberculosis H37Rv. Evaluated on the same benchmark datasets derived from the aforementioned species, the identification accuracies achieved by the network-based method were around 0.9, significantly higher than those by the sequence-based methods such as BLAST, feature selection and VirulentPred. Further analysis showed that the functional associations such as the gene neighborhood and co-occurrence were the primary associations between these virulence factors in the STRING database. The high success rates indicate that the network-based method is quite promising. The novel approach holds high potential for identifying virulence factors in many other various organisms as well because it can be easily extended to identify the virulence factors in many other bacterial species, as long as the relevant significant statistical data are available for them.


Asunto(s)
Biología Computacional/métodos , Factores de Virulencia/química , Algoritmos , Bacterias/patogenicidad , Proteínas Bacterianas/química , Bases de Datos de Proteínas , Mapas de Interacción de Proteínas , Curva ROC , Alineación de Secuencia , Análisis de Secuencia de Proteína
20.
PLoS One ; 7(6): e39308, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22720092

RESUMEN

The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28-40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine.


Asunto(s)
Proteínas/química , Conformación Proteica , Solventes/química
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA