RESUMEN
Recent advances in computing power and machine learning empower functional annotation of protein sequences and their transcript variations. Here, we present an automated prediction system UniGOPred, for GO annotations and a database of GO term predictions for proteomes of several organisms in UniProt Knowledgebase (UniProtKB). UniGOPred provides function predictions for 514 molecular function (MF), 2909 biological process (BP), and 438 cellular component (CC) GO terms for each protein sequence. UniGOPred covers nearly the whole functionality spectrum in Gene Ontology system and it can predict both generic and specific GO terms. UniGOPred was run on CAFA2 challenge target protein sequences and it is categorized within the top 10 best performing methods for the molecular function category. In addition, the performance of UniGOPred is higher compared to the baseline BLAST classifier in all categories of GO. UniGOPred predictions are compared with UniProtKB/TrEMBL database annotations as well. Furthermore, the proposed tool's ability to predict negatively associated GO terms that defines the functions that a protein does not possess, is discussed. UniGOPred annotations were also validated by case studies on PTEN protein variants experimentally and on CHD8 protein variants with literature. UniGOPred protein functional annotation system is available as an open access tool at http://cansyl.metu.edu.tr/UniGOPred.html.
Asunto(s)
Aprendizaje Automático , Fosfohidrolasa PTEN/metabolismo , Proteómica/métodos , Animales , Bases de Datos de Proteínas , Ontología de Genes , Humanos , Modelos Biológicos , Fosfohidrolasa PTEN/química , Fosfohidrolasa PTEN/genética , Análisis de Secuencia de Proteína , TranscriptomaRESUMEN
BACKGROUND: Hepatitis B virus (HBV) is a global health problem, and infected patients if left untreated may develop cirrhosis and eventually hepatocellular carcinoma. This study aims to enlighten pathways associated with HBV related liver fibrosis for delineation of potential new therapeutic targets and biomarkers. METHODS: Tissue samples from 47 HBV infected patients with different fibrotic stages (F1 to F6) were enrolled for 2D-DIGE proteomic screening. Differentially expressed proteins were identified by mass spectrometry and verified by western blotting. Functional proteomic associations were analyzed by EnrichNet application. RESULTS: Fibrotic stage variations were observed for apolipoprotein A1 (APOA1), pyruvate kinase PKM (KPYM), glyceraldehyde 3-phospahate dehydrogenase (GAPDH), glutamate dehydrogenase (DHE3), aldehyde dehydrogenase (ALDH2), alcohol dehydrogenase (ALDH1A1), transferrin (TRFE), peroxiredoxin 3 (PRDX3), phenazine biosynthesis-like domain-containing protein (PBLD), immuglobulin kappa chain C region (IGKC), annexin A4 (ANXA4), keratin 5 (KRT5). Enrichment analysis with Reactome and Kegg databases highlighted the possible involvement of platelet release, glycolysis and HDL mediated lipid transport pathways. Moreover, string analysis revealed that HIF-1α (Hypoxia-inducible factor 1-alpha), one of the interacting partners of HBx (Hepatitis B X protein), may play a role in the altered glycolytic response and oxidative stress observed in liver fibrosis. CONCLUSIONS: To our knowledge, this is the first protomic research that studies HBV infected fibrotic human liver tissues to investigate alterations in protein levels and affected pathways among different fibrotic stages. Observed changes in the glycolytic pathway caused by HBx presence and therefore its interactions with HIF-1α can be a target pathway for novel therapeutic purposes.
RESUMEN
We report a high quality and system-wide proteome catalogue covering 71% (3,542 proteins) of the predicted genes of fission yeast, Schizosaccharomyces pombe, presenting the largest protein dataset to date for this important model organism. We obtained this high proteome and peptide (11.4 peptides/protein) coverage by a combination of extensive sample fractionation, high resolution Orbitrap mass spectrometry, and combined database searching using the iProphet software as part of the Trans-Proteomics Pipeline. All raw and processed data are made accessible in the S. pombe PeptideAtlas. The identified proteins showed no biases in functional properties and allowed global estimation of protein abundances. The high coverage of the PeptideAtlas allowed correlation with transcriptomic data in a system-wide manner indicating that post-transcriptional processes control the levels of at least half of all identified proteins. Interestingly, the correlation was not equally tight for all functional categories ranging from r(s) >0.80 for proteins involved in translation to r(s) <0.45 for signal transduction proteins. Moreover, many proteins involved in DNA damage repair could not be detected in the PeptideAtlas despite their high mRNA levels, strengthening the translation-on-demand hypothesis for members of this protein class. In summary, the extensive and publicly available S. pombe PeptideAtlas together with the generated proteotypic peptide spectral library will be a useful resource for future targeted, in-depth, and quantitative proteomic studies on this microorganism.
Asunto(s)
Regulación Fúngica de la Expresión Génica , Péptidos/aislamiento & purificación , Procesamiento Proteico-Postraduccional , Proteoma/metabolismo , ARN Mensajero/metabolismo , Proteínas de Schizosaccharomyces pombe/metabolismo , Schizosaccharomyces/metabolismo , Bases de Datos de Proteínas , Espectrometría de Masas , Familia de Multigenes , Mapeo Peptídico , Proteoma/química , Proteoma/genética , ARN Mensajero/genética , Schizosaccharomyces/química , Schizosaccharomyces/genética , Proteínas de Schizosaccharomyces pombe/química , Proteínas de Schizosaccharomyces pombe/genética , Transducción de SeñalRESUMEN
MOTIVATION: It has been recognized that the topology of molecular networks provides information about the certainty and nature of individual interactions. Thus, network motifs have been used for predicting missing links in biological networks and for removing false positives. However, various different measures can be inferred from the structure of a given network and their predictive power varies depending on the task at hand. RESULTS: Herein, we present a systematic assessment of seven different network features extracted from the topology of functional genetic networks and we quantify their ability to classify interactions into different types of physical protein associations. Using machine learning, we combine features based on network topology with non-network features and compare their importance of the classification of interactions. We demonstrate the utility of network features based on human and budding yeast networks; we show that network features can distinguish different sub-types of physical protein associations and we apply the framework to fission yeast, which has a much sparser known physical interactome than the other two species. Our analysis shows that network features are at least as predictive for the tasks we tested as non-network features. However, feature importance varies between species owing to different topological characteristics of the networks. The application to fission yeast shows that small maps of physical interactomes can be extended based on functional networks, which are often more readily available. AVAILABILITY AND IMPLEMENTATION: The R-code for computing the network features is available from www.cellularnetworks.org
Asunto(s)
Inteligencia Artificial , Biología Computacional/métodos , Mapeo de Interacción de Proteínas/métodos , Proteínas/química , Área Bajo la Curva , Humanos , Unión Proteica , Curva ROC , Saccharomyces cerevisiae , Schizosaccharomyces , Programas InformáticosRESUMEN
Information about the physical association of proteins is extensively used for studying cellular processes and disease mechanisms. However, complete experimental mapping of the human interactome will remain prohibitively difficult in the near future. Here we present a map of predicted human protein interactions that distinguishes functional association from physical binding. Our network classifies more than 5 million protein pairs predicting 94,009 new interactions with high confidence. We experimentally tested a subset of these predictions using yeast two-hybrid analysis and affinity purification followed by quantitative mass spectrometry. Thus we identified 462 new protein-protein interactions and confirmed the predictive power of the network. These independent experiments address potential issues of circular reasoning and are a distinctive feature of this work. Analysis of the physical interactome unravels subnetworks mediating between different functional and physical subunits of the cell. Finally, we demonstrate the utility of the network for the analysis of molecular mechanisms of complex diseases by applying it to genome-wide association studies of neurodegenerative diseases. This analysis provides new evidence implying TOMM40 as a factor involved in Alzheimer's disease. The network provides a high-quality resource for the analysis of genomic data sets and genetic association studies in particular. Our interactome is available via the hPRINT web server at: www.print-db.org.
Asunto(s)
Simulación por Computador , Modelos Moleculares , Mapeo de Interacción de Proteínas/métodos , Algoritmos , Animales , Teorema de Bayes , Células HeLa , Humanos , Ratones , Enfermedades Neurodegenerativas/genética , Enfermedades Neurodegenerativas/metabolismo , Dominios y Motivos de Interacción de Proteínas , Mapas de Interacción de Proteínas , Proteoma/genética , Proteoma/metabolismo , Curva ROC , Proteínas Recombinantes/metabolismo , Estadísticas no ParamétricasRESUMEN
Automated classification of proteins is indispensable for further in vivo investigation of excessive number of unknown sequences generated by large scale molecular biology techniques. This study describes a discriminative system based on feature space mapping, called subsequence profile map (SPMap) for functional classification of protein sequences. SPMap takes into account the information coming from the subsequences of a protein. A group of protein sequences that belong to the same level of classification is decomposed into fixed-length subsequences and they are clustered to obtain a representative feature space mapping. Mapping is defined as the distribution of the subsequences of a protein sequence over these clusters. The resulting feature space representation is used to train discriminative classifiers for functional families. The aim of this approach is to incorporate information coming from important subregions that are conserved over a family of proteins while avoiding the difficult task of explicit motif identification. The performance of the method was assessed through tests on various protein classification tasks. Our results showed that SPMap is capable of high accuracy classification in most of these tasks. Furthermore SPMap is fast and scalable enough to handle large datasets.
Asunto(s)
Biología Computacional/métodos , Mapeo de Interacción de Proteínas/métodos , Proteínas/química , Proteínas/clasificación , Algoritmos , Análisis por Conglomerados , Simulación por Computador , Enzimas/química , Enzimas/clasificación , Receptores Acoplados a Proteínas G/química , Receptores Acoplados a Proteínas G/clasificación , Sensibilidad y EspecificidadRESUMEN
Crowdsourcing has been used to address computational challenges in systems biology and assess translation of findings across species. Sub-challenge 2 of the sbv IMPROVER Systems Toxicology Challenge was designed to determine whether a common set of genes can be used to identify exposure to cigarette smoke in both human and mouse. Participating teams used a training set of human and mouse blood gene expression data to derive parsimonious models (up to 40 genes) that classify subjects into exposure groups: smokers, former smokers, and never-smokers. Teams were ranked based on two classification performance metrics evaluated on a blinded test dataset. Prediction of current exposure to cigarette smoke in human and mouse by a common prediction model was achieved by the top ranked team (Team 219) with 89% balanced accuracy (BAC), while past exposure was predicted with only 57% BAC. The prediction model of the top ranked team was a random forest classifier trained on sets of genes that appeared best for each species separately with no overlap between species. By contrast, Team 264, ranked second (tied with Team 250), selected genes that were simultaneously predictive in both species and achieved 80% and 59% BAC when predicting current and past exposure, respectively. These performance values were lower than the 96.5% and 61% BAC estimates for current and past exposure, respectively, obtained by Team 264 (top ranked in sub-challenge 1) when using only human data. Unlike past exposure, current exposure to cigarette smoke can be accurately assessed in both human and mouse with a common prediction model based on blood mRNAs. However, requiring a common gene signature to be predictive in both species resulted in a substantial decrease in balanced accuracy for prediction of current exposure to cigarette smoke (from 96.5% to 80%), suggesting species-specific responses exist.
RESUMEN
Cigarette smoking entails chronic exposure to a mixture of harmful chemicals that trigger molecular changes over time, and is known to increase the risk of developing diseases. Risk assessment in the context of 21st century toxicology relies on the elucidation of mechanisms of toxicity and the identification of exposure response markers, usually from high-throughput data, using advanced computational methodologies. The sbv IMPROVER Systems Toxicology computational challenge (Fall 2015-Spring 2016) aimed to evaluate whether robust and sparse (≤40 genes) human (sub-challenge 1, SC1) and species-independent (sub-challenge 2, SC2) exposure response markers (so called gene signatures) could be extracted from human and mouse blood transcriptomics data of current (S), former (FS) and never (NS) smoke-exposed subjects as predictors of smoking and cessation status. Best-performing computational methods were identified by scoring anonymized participants' predictions. Worldwide participation resulted in 12 (SC1) and six (SC2) final submissions qualified for scoring. The results showed that blood gene expression data were informative to predict smoking exposure (i.e. discriminating smoker versus never or former smokers) status in human and across species with a high level of accuracy. By contrast, the prediction of cessation status (i.e. distinguishing FS from NS) remained challenging, as reflected by lower classification performances. Participants successfully developed inductive predictive models and extracted human and species-independent gene signatures, including genes with high consensus across teams. Post-challenge analyses highlighted "feature selection" as a key step in the process of building a classifier and confirmed the importance of testing a gene signature in independent cohorts to ensure the generalized applicability of a predictive model at a population-based level. In conclusion, the Systems Toxicology challenge demonstrated the feasibility of extracting a consistent blood-based smoke exposure response gene signature and further stressed the importance of independent and unbiased data and method evaluations to provide confidence in systems toxicology-based scientific conclusions.
RESUMEN
Functional protein annotation is an important matter for in vivo and in silico biology. Several computational methods have been proposed that make use of a wide range of features such as motifs, domains, homology, structure and physicochemical properties. There is no single method that performs best in all functional classification problems because information obtained using any of these features depends on the function to be assigned to the protein. In this study, we portray a novel approach that combines different methods to better represent protein function. First, we formulated the function annotation problem as a classification problem defined on 300 different Gene Ontology (GO) terms from molecular function aspect. We presented a method to form positive and negative training examples while taking into account the directed acyclic graph (DAG) structure and evidence codes of GO. We applied three different methods and their combinations. Results show that combining different methods improves prediction accuracy in most cases. The proposed method, GOPred, is available as an online computational annotation tool (http://kinaz.fen.bilkent.edu.tr/gopred).