Your browser doesn't support javascript.
loading
: 20 | 50 | 100
1 - 17 de 17
1.
J Proteome Res ; 2024 May 29.
Article En | MEDLINE | ID: mdl-38810119

Phosphorylation is the most studied post-translational modification, and has multiple biological functions. In this study, we have reanalyzed publicly available mass spectrometry proteomics data sets enriched for phosphopeptides from Asian rice (Oryza sativa). In total we identified 15,565 phosphosites on serine, threonine, and tyrosine residues on rice proteins. We identified sequence motifs for phosphosites, and link motifs to enrichment of different biological processes, indicating different downstream regulation likely caused by different kinase groups. We cross-referenced phosphosites against the rice 3,000 genomes, to identify single amino acid variations (SAAVs) within or proximal to phosphosites that could cause loss of a site in a given rice variety and clustered the data to identify groups of sites with similar patterns across rice family groups. The data has been loaded into UniProt Knowledge-Base─enabling researchers to visualize sites alongside other data on rice proteins, e.g., structural models from AlphaFold2, PeptideAtlas, and the PRIDE database─enabling visualization of source evidence, including scores and supporting mass spectra.

2.
J Proteome Res ; 23(6): 1948-1959, 2024 Jun 07.
Article En | MEDLINE | ID: mdl-38717300

The availability of an increasingly large amount of public proteomics data sets presents an opportunity for performing combined analyses to generate comprehensive organism-wide protein expression maps across different organisms and biological conditions. Sus scrofa, a domestic pig, is a model organism relevant for food production and for human biomedical research. Here, we reanalyzed 14 public proteomics data sets from the PRIDE database coming from pig tissues to assess baseline (without any biological perturbation) protein abundance in 14 organs, encompassing a total of 20 healthy tissues from 128 samples. The analysis involved the quantification of protein abundance in 599 mass spectrometry runs. We compared protein expression patterns among different pig organs and examined the distribution of proteins across these organs. Then, we studied how protein abundances were compared across different data sets and studied the tissue specificity of the detected proteins. Of particular interest, we conducted a comparative analysis of protein expression between pig and human tissues, revealing a high degree of correlation in protein expression among orthologs, particularly in brain, kidney, heart, and liver samples. We have integrated the protein expression results into the Expression Atlas resource for easy access and visualization of the protein expression data individually or alongside gene expression data.


Kidney , Proteomics , Animals , Proteomics/methods , Humans , Swine , Kidney/metabolism , Kidney/chemistry , Organ Specificity , Liver/metabolism , Liver/chemistry , Databases, Protein , Brain/metabolism , Myocardium/metabolism , Myocardium/chemistry , Sus scrofa/metabolism , Sus scrofa/genetics , Proteome/metabolism , Proteome/analysis , Mass Spectrometry
3.
PLoS Comput Biol ; 20(1): e1011828, 2024 Jan.
Article En | MEDLINE | ID: mdl-38252632

The cancer biomarker field has been an object of thorough investigation in the last decades. Despite this, colorectal cancer (CRC) heterogeneity makes it challenging to identify and validate effective prognostic biomarkers for patient classification according to outcome and treatment response. Although a massive amount of proteomics data has been deposited in public data repositories, this rich source of information is vastly underused. Here, we attempted to reuse public proteomics datasets with two main objectives: i) to generate hypotheses (detection of biomarkers) for their posterior/downstream validation, and (ii) to validate, using an orthogonal approach, a previously described biomarker panel. Twelve CRC public proteomics datasets (mostly from the PRIDE database) were re-analysed and integrated to create a landscape of protein expression. Samples from both solid and liquid biopsies were included in the reanalysis. Integrating this data with survival annotation data, we have validated in silico a six-gene signature for CRC classification at the protein level, and identified five new blood-detectable biomarkers (CD14, PPIA, MRC2, PRDX1, and TXNDC5) associated with CRC prognosis. The prognostic value of these blood-derived proteins was confirmed using additional public datasets, supporting their potential clinical value. As a conclusion, this proof-of-the-concept study demonstrates the value of re-using public proteomics datasets as the basis to create a useful resource for biomarker discovery and validation. The protein expression data has been made available in the public resource Expression Atlas.


Colorectal Neoplasms , Proteomics , Humans , Colorectal Neoplasms/diagnosis , Colorectal Neoplasms/genetics , Colorectal Neoplasms/metabolism , Biomarkers, Tumor/metabolism , Blood Proteins , Protein Disulfide-Isomerases
4.
Nucleic Acids Res ; 52(D1): D107-D114, 2024 Jan 05.
Article En | MEDLINE | ID: mdl-37992296

Expression Atlas (www.ebi.ac.uk/gxa) and its newest counterpart the Single Cell Expression Atlas (www.ebi.ac.uk/gxa/sc) are EMBL-EBI's knowledgebases for gene and protein expression and localisation in bulk and at single cell level. These resources aim to allow users to investigate their expression in normal tissue (baseline) or in response to perturbations such as disease or changes to genotype (differential) across multiple species. Users are invited to search for genes or metadata terms across species or biological conditions in a standardised consistent interface. Alongside these data, new features in Single Cell Expression Atlas allow users to query metadata through our new cell type wheel search. At the experiment level data can be explored through two types of dimensionality reduction plots, t-distributed Stochastic Neighbor Embedding (tSNE) and Uniform Manifold Approximation and Projection (UMAP), overlaid with either clustering or metadata information to assist users' understanding. Data are also visualised as marker gene heatmaps identifying genes that help confer cluster identity. For some data, additional visualisations are available as interactive cell level anatomograms and cell type gene expression heatmaps.


Databases, Genetic , Gene Expression Profiling , Proteomics , Genotype , Metadata , Single-Cell Analysis , Internet , Humans , Animals
5.
bioRxiv ; 2023 Nov 17.
Article En | MEDLINE | ID: mdl-38014076

Phosphorylation is the most studied post-translational modification, and has multiple biological functions. In this study, we have re-analysed publicly available mass spectrometry proteomics datasets enriched for phosphopeptides from Asian rice (Oryza sativa). In total we identified 15,522 phosphosites on serine, threonine and tyrosine residues on rice proteins. We identified sequence motifs for phosphosites, and link motifs to enrichment of different biological processes, indicating different downstream regulation likely caused by different kinase groups. We cross-referenced phosphosites against the rice 3,000 genomes, to identify single amino acid variations (SAAVs) within or proximal to phosphosites that could cause loss of a site in a given rice variety. The data was clustered to identify groups of sites with similar patterns across rice family groups, for example those highly conserved in Japonica, but mostly absent in Aus type rice varieties - known to have different responses to drought. These resources can assist rice researchers to discover alleles with significantly different functional effects across rice varieties. The data has been loaded into UniProt Knowledge-Base - enabling researchers to visualise sites alongside other data on rice proteins e.g. structural models from AlphaFold2, PeptideAtlas and the PRIDE database - enabling visualisation of source evidence, including scores and supporting mass spectra.

6.
J Proteome Res ; 22(3): 729-742, 2023 03 03.
Article En | MEDLINE | ID: mdl-36577097

The availability of proteomics datasets in the public domain, and in the PRIDE database, in particular, has increased dramatically in recent years. This unprecedented large-scale availability of data provides an opportunity for combined analyses of datasets to get organism-wide protein abundance data in a consistent manner. We have reanalyzed 24 public proteomics datasets from healthy human individuals to assess baseline protein abundance in 31 organs. We defined tissue as a distinct functional or structural region within an organ. Overall, the aggregated dataset contains 67 healthy tissues, corresponding to 3,119 mass spectrometry runs covering 498 samples from 489 individuals. We compared protein abundances between different organs and studied the distribution of proteins across these organs. We also compared the results with data generated in analogous studies. Additionally, we performed gene ontology and pathway-enrichment analyses to identify organ-specific enriched biological processes and pathways. As a key point, we have integrated the protein abundance results into the resource Expression Atlas, where they can be accessed and visualized either individually or together with gene expression data coming from transcriptomics datasets. We believe this is a good mechanism to make proteomics data more accessible for life scientists.


Proteome , Proteomics , Humans , Proteome/analysis , Proteomics/methods , Gene Expression Profiling , Databases, Factual , Mass Spectrometry/methods , Databases, Protein
7.
PLoS Comput Biol ; 18(6): e1010174, 2022 06.
Article En | MEDLINE | ID: mdl-35714157

The increasingly large amount of proteomics data in the public domain enables, among other applications, the combined analyses of datasets to create comparative protein expression maps covering different organisms and different biological conditions. Here we have reanalysed public proteomics datasets from mouse and rat tissues (14 and 9 datasets, respectively), to assess baseline protein abundance. Overall, the aggregated dataset contained 23 individual datasets, including a total of 211 samples coming from 34 different tissues across 14 organs, comprising 9 mouse and 3 rat strains, respectively. In all cases, we studied the distribution of canonical proteins between the different organs. The number of canonical proteins per dataset ranged from 273 (tendon) and 9,715 (liver) in mouse, and from 101 (tendon) and 6,130 (kidney) in rat. Then, we studied how protein abundances compared across different datasets and organs for both species. As a key point we carried out a comparative analysis of protein expression between mouse, rat and human tissues. We observed a high level of correlation of protein expression among orthologs between all three species in brain, kidney, heart and liver samples, whereas the correlation of protein expression was generally slightly lower between organs within the same species. Protein expression results have been integrated into the resource Expression Atlas for widespread dissemination.


Proteins , Proteomics , Animals , Brain/metabolism , Mice , Proteins/metabolism , Rats
8.
Sci Data ; 9(1): 335, 2022 06 14.
Article En | MEDLINE | ID: mdl-35701420

The number of mass spectrometry (MS)-based proteomics datasets in the public domain keeps increasing, particularly those generated by Data Independent Acquisition (DIA) approaches such as SWATH-MS. Unlike Data Dependent Acquisition datasets, the re-use of DIA datasets has been rather limited to date, despite its high potential, due to the technical challenges involved. We introduce a (re-)analysis pipeline for public SWATH-MS datasets which includes a combination of metadata annotation protocols, automated workflows for MS data analysis, statistical analysis, and the integration of the results into the Expression Atlas resource. Automation is orchestrated with Nextflow, using containerised open analysis software tools, rendering the pipeline readily available and reproducible. To demonstrate its utility, we reanalysed 10 public DIA datasets from the PRIDE database, comprising 1,278 SWATH-MS runs. The robustness of the analysis was evaluated, and the results compared to those obtained in the original publications. The final expression values were integrated into Expression Atlas, making SWATH-MS experiments more widely available and combining them with expression data originating from other proteomics and transcriptomics datasets.


Proteomics , Software , Data Analysis , Databases, Protein , Datasets as Topic , Mass Spectrometry/methods , Proteomics/methods
9.
J Proteome Res ; 21(7): 1603-1615, 2022 07 01.
Article En | MEDLINE | ID: mdl-35640880

Phosphoproteomic methods are commonly employed to identify and quantify phosphorylation sites on proteins. In recent years, various tools have been developed, incorporating scores or statistics related to whether a given phosphosite has been correctly identified or to estimate the global false localization rate (FLR) within a given data set for all sites reported. These scores have generally been calibrated using synthetic datasets, and their statistical reliability on real datasets is largely unknown, potentially leading to studies reporting incorrectly localized phosphosites, due to inadequate statistical control. In this work, we develop the concept of scoring modifications on a decoy amino acid, that is, one that cannot be modified, to allow for independent estimation of global FLR. We test a variety of amino acids, on both synthetic and real data sets, demonstrating that the selection can make a substantial difference to the estimated global FLR. We conclude that while several different amino acids might be appropriate, the most reliable FLR results were achieved using alanine and leucine as decoys. We propose the use of a decoy amino acid to control false reporting in the literature and in public databases that re-distribute the data. Data are available via ProteomeXchange with identifier PXD028840.


Amino Acids , Tandem Mass Spectrometry , Databases, Protein , Reproducibility of Results , Tandem Mass Spectrometry/methods
10.
Nucleic Acids Res ; 50(D1): D129-D140, 2022 01 07.
Article En | MEDLINE | ID: mdl-34850121

The EMBL-EBI Expression Atlas is an added value knowledge base that enables researchers to answer the question of where (tissue, organism part, developmental stage, cell type) and under which conditions (disease, treatment, gender, etc) a gene or protein of interest is expressed. Expression Atlas brings together data from >4500 expression studies from >65 different species, across different conditions and tissues. It makes these data freely available in an easy to visualise form, after expert curation to accurately represent the intended experimental design, re-analysed via standardised pipelines that rely on open-source community developed tools. Each study's metadata are annotated using ontologies. The data are re-analyzed with the aim of reproducing the original conclusions of the underlying experiments. Expression Atlas is currently divided into Bulk Expression Atlas and Single Cell Expression Atlas. Expression Atlas contains data from differential studies (microarray and bulk RNA-Seq) and baseline studies (bulk RNA-Seq and proteomics), whereas Single Cell Expression Atlas is currently dedicated to Single Cell RNA-Sequencing (scRNA-Seq) studies. The resource has been in continuous development since 2009 and it is available at https://www.ebi.ac.uk/gxa.


Databases, Genetic , Proteins/genetics , Proteomics , Software , Computational Biology , Gene Expression Profiling , Humans , Proteins/chemistry , RNA-Seq , Sequence Analysis, RNA , Single-Cell Analysis
11.
Nucleic Acids Res ; 50(D1): D543-D552, 2022 01 07.
Article En | MEDLINE | ID: mdl-34723319

The PRoteomics IDEntifications (PRIDE) database (https://www.ebi.ac.uk/pride/) is the world's largest data repository of mass spectrometry-based proteomics data. PRIDE is one of the founding members of the global ProteomeXchange (PX) consortium and an ELIXIR core data resource. In this manuscript, we summarize the developments in PRIDE resources and related tools since the previous update manuscript was published in Nucleic Acids Research in 2019. The number of submitted datasets to PRIDE Archive (the archival component of PRIDE) has reached on average around 500 datasets per month during 2021. In addition to continuous improvements in PRIDE Archive data pipelines and infrastructure, the PRIDE Spectra Archive has been developed to provide direct access to the submitted mass spectra using Universal Spectrum Identifiers. As a key point, the file format MAGE-TAB for proteomics has been developed to enable the improvement of sample metadata annotation. Additionally, the resource PRIDE Peptidome provides access to aggregated peptide/protein evidences across PRIDE Archive. Furthermore, we will describe how PRIDE has increased its efforts to reuse and disseminate high-quality proteomics data into other added-value resources such as UniProt, Ensembl and Expression Atlas.


Databases, Protein , Metadata/statistics & numerical data , Molecular Sequence Annotation/statistics & numerical data , Peptides/chemistry , Proteins/chemistry , Software , Amino Acid Sequence , Bibliometrics , Datasets as Topic , Humans , Information Storage and Retrieval , Internet , Mass Spectrometry , Peptides/genetics , Peptides/metabolism , Proteins/genetics , Proteins/metabolism , Proteomics/instrumentation , Proteomics/methods , Sequence Alignment
12.
Cell ; 176(1-2): 391-403.e19, 2019 01 10.
Article En | MEDLINE | ID: mdl-30528433

Proteins and RNA functionally and physically intersect in multiple biological processes, however, currently no universal method is available to purify protein-RNA complexes. Here, we introduce XRNAX, a method for the generic purification of protein-crosslinked RNA, and demonstrate its versatility to study the composition and dynamics of protein-RNA interactions by various transcriptomic and proteomic approaches. We show that XRNAX captures all RNA biotypes and use this to characterize the sub-proteomes that interact with coding and non-coding RNAs (ncRNAs) and to identify hundreds of protein-RNA interfaces. Exploiting the quantitative nature of XRNAX, we observe drastic remodeling of the RNA-bound proteome during arsenite-induced stress, distinct from autophagy-related changes in the total proteome. In addition, we combine XRNAX with crosslinking immunoprecipitation sequencing (CLIP-seq) to validate the interaction of ncRNA with lamin B1 and EXOSC2. Thus, XRNAX is a resourceful approach to study structural and compositional aspects of protein-RNA interactions to address fundamental questions in RNA-biology.


High-Throughput Nucleotide Sequencing/methods , RNA-Binding Proteins/isolation & purification , RNA/isolation & purification , Binding Sites , Exosome Multienzyme Ribonuclease Complex/metabolism , Humans , Immunoprecipitation/methods , Lamin Type B/metabolism , Protein Binding/genetics , Protein Binding/physiology , Protein Biosynthesis/genetics , Protein Biosynthesis/physiology , Protein Processing, Post-Translational , Proteins/isolation & purification , Proteins/metabolism , Proteome/metabolism , Proteomics/methods , RNA/genetics , RNA/metabolism , RNA, Messenger/metabolism , RNA, Untranslated/metabolism , RNA-Binding Proteins/metabolism , Transcriptome
13.
Curr Protoc Bioinformatics ; 60: 3.15.1-3.15.23, 2017 12 08.
Article En | MEDLINE | ID: mdl-29220076

Protein sequence similarity search is one of the most commonly used bioinformatics methods for identifying evolutionarily related proteins. In general, sequences that are evolutionarily related share some degree of similarity, and sequence-search algorithms use this principle to identify homologs. The requirement for a fast and sensitive sequence search method led to the development of the HMMER software, which in the latest version (v3.1) uses a combination of sophisticated acceleration heuristics and mathematical and computational optimizations to enable the use of profile hidden Markov models (HMMs) for sequence analysis. The HMMER Web server provides a common platform by linking the HMMER algorithms to databases, thereby enabling the search for homologs, as well as providing sequence and functional annotation by linking external databases. This unit describes three basic protocols and two alternate protocols that explain how to use the HMMER Web server using various input formats and user defined parameters. © 2017 by John Wiley & Sons, Inc.


Databases, Protein , Sequence Homology, Amino Acid , Software , Algorithms , Computational Biology , Humans , Internet , Markov Chains , Proteins , Sequence Alignment
14.
RNA ; 23(10): 1479-1492, 2017 10.
Article En | MEDLINE | ID: mdl-28701522

This article describes the creation of the first expert manually curated noncoding RNA interaction networks for S. cerevisiae The RNA-RNA and RNA-protein interaction networks have been carefully extracted from the experimental literature and made available through the IntAct database (www.ebi.ac.uk/intact). We provide an initial network analysis and compare their properties to the much larger protein-protein interaction network. We find that the proteins that bind to ncRNAs in the network contain only a small proportion of classical RNA binding domains. We also see an enrichment of WD40 domains suggesting their direct involvement in ncRNA interactions. We discuss the challenges in collecting noncoding RNA interaction data and the opportunities for worldwide collaboration to fill the unmet need for this data.


Computational Biology/methods , Gene Regulatory Networks , RNA, Untranslated/genetics , Saccharomyces cerevisiae/genetics , Gene Ontology , RNA, Fungal , RNA-Binding Proteins/genetics , RNA-Binding Proteins/metabolism , Saccharomyces cerevisiae Proteins/genetics , Saccharomyces cerevisiae Proteins/metabolism
15.
J Exp Med ; 214(4): 1111-1128, 2017 04 03.
Article En | MEDLINE | ID: mdl-28351984

The phagocyte respiratory burst is crucial for innate immunity. The transfer of electrons to oxygen is mediated by a membrane-bound heterodimer, comprising gp91phox and p22phox subunits. Deficiency of either subunit leads to severe immunodeficiency. We describe Eros (essential for reactive oxygen species), a protein encoded by the previously undefined mouse gene bc017643, and show that it is essential for host defense via the phagocyte NAPDH oxidase. Eros is required for expression of the NADPH oxidase components, gp91phox and p22phox Consequently, Eros-deficient mice quickly succumb to infection. Eros also contributes to the formation of neutrophil extracellular traps (NETS) and impacts on the immune response to melanoma metastases. Eros is an ortholog of the plant protein Ycf4, which is necessary for expression of proteins of the photosynthetic photosystem 1 complex, itself also an NADPH oxio-reductase. We thus describe the key role of the previously uncharacterized protein Eros in host defense.


Membrane Proteins/physiology , Phagocytes/physiology , Reactive Oxygen Species/metabolism , Respiratory Burst/physiology , Animals , Cytochrome b Group/analysis , Cytochrome b Group/physiology , Endoplasmic Reticulum/metabolism , HEK293 Cells , Humans , Immunity, Innate , Macrophages/immunology , Membrane Glycoproteins/analysis , Membrane Glycoproteins/physiology , Mice , Mice, Inbred C57BL , NADPH Oxidase 2 , NADPH Oxidases/analysis , NADPH Oxidases/physiology , Neutrophils/immunology , Phagocytosis
16.
Genome Biol ; 16: 88, 2015 Apr 30.
Article En | MEDLINE | ID: mdl-25924720

BACKGROUND: Protein domains display a range of structural diversity, with numerous additions and deletions of secondary structural elements between related domains. We have observed a small number of cases of surprising large-scale deletions of core elements of structural domains. We propose a new concept called domain atrophy, where protein domains lose a significant number of core structural elements. RESULTS: Here, we implement a new pipeline to systematically identify new cases of domain atrophy across all known protein sequences. The output of this pipeline was carefully checked by hand, which filtered out partial domain instances that were unlikely to represent true domain atrophy due to misannotations or un-annotated sequence fragments. We identify 75 cases of domain atrophy, of which eight cases are found in a three-dimensional protein structure and 67 cases have been inferred based on mapping to a known homologous structure. Domains with structural variations include ancient folds such as the TIM-barrel and Rossmann folds. Most of these domains are observed to show structural loss that does not affect their functional sites. CONCLUSION: Our analysis has significantly increased the known cases of domain atrophy. We discuss specific instances of domain atrophy and see that there has often been a compensatory mechanism that helps to maintain the stability of the partial domain. Our study indicates that although domain atrophy is an extremely rare phenomenon, protein domains under certain circumstances can tolerate extreme mutations giving rise to partial, but functional, domains.


Bacterial Proteins/genetics , Gene Deletion , Genes, Bacterial , Luciferases/genetics , Oxidoreductases/genetics , Bacterial Proteins/metabolism , Burkholderia cenocepacia/enzymology , Burkholderia cenocepacia/genetics , Carrier Proteins/genetics , Carrier Proteins/metabolism , Cryptococcus/enzymology , Cryptococcus/genetics , Escherichia coli/enzymology , Escherichia coli/genetics , Evolution, Molecular , Humans , Lactobacillus/enzymology , Lactobacillus/genetics , Luciferases/metabolism , Models, Molecular , Oxidoreductases/metabolism , Photobacterium/enzymology , Photobacterium/genetics , Phylogeny , Protein Structure, Tertiary , Pyrococcus furiosus/enzymology , Pyrococcus furiosus/genetics , Staphylococcus aureus/enzymology , Staphylococcus aureus/genetics
17.
PLoS One ; 6(11): e25570, 2011.
Article En | MEDLINE | ID: mdl-22073138

Vibrio cholerae, the enteropathogenic gram negative bacteria is one of the main causative agents of waterborne diseases like cholera. About 1/3(rd) of the organism's genome is uncharacterised with many protein coding genes lacking structure and functional information. These proteins form significant fraction of the genome and are crucial in understanding the organism's complete functional makeup. In this study we report the general structure and function of a family of hypothetical proteins, Domain of Unknown Function 3233 (DUF3233), which are conserved across gram negative gammaproteobacteria (especially in Vibrio sp. and similar bacteria). Profile and HMM based sequence search methods were used to screen homologues of DUF3233. The I-TASSER fold recognition method was used to build a three dimensional structural model of the domain. The structure resembles the transmembrane beta-barrel with an axial N-terminal helix and twelve antiparallel beta-strands. Using a combination of amphipathy and discrimination analysis we analysed the potential transmembrane beta-barrel forming properties of DUF3233. Sequence, structure and phylogenetic analysis of DUF3233 indicates that this gram negative bacterial hypothetical protein resembles the beta-barrel translocation unit of autotransporter Va secretory mechanism with a gene organisation that differs from the conventional Va system.


Carrier Proteins/metabolism , Proteobacteria/metabolism , Carrier Proteins/chemistry , Models, Molecular , Protein Transport
...