Your browser doesn't support javascript.
loading
Montrer: 20 | 50 | 100
Résultats 1 - 20 de 102
Filtrer
1.
Nat Methods ; 2024 Jul 04.
Article de Anglais | MEDLINE | ID: mdl-38965444

RÉSUMÉ

The volume of public proteomics data is rapidly increasing, causing a computational challenge for large-scale reanalysis. Here, we introduce quantms ( https://quant,ms.org/ ), an open-source cloud-based pipeline for massively parallel proteomics data analysis. We used quantms to reanalyze 83 public ProteomeXchange datasets, comprising 29,354 instrument files from 13,132 human samples, to quantify 16,599 proteins based on 1.03 million unique peptides. quantms is based on standard file formats improving the reproducibility, submission and dissemination of the data to ProteomeXchange.

2.
J Proteome Res ; 23(7): 2518-2531, 2024 Jul 05.
Article de Anglais | MEDLINE | ID: mdl-38810119

RÉSUMÉ

Phosphorylation is the most studied post-translational modification, and has multiple biological functions. In this study, we have reanalyzed publicly available mass spectrometry proteomics data sets enriched for phosphopeptides from Asian rice (Oryza sativa). In total we identified 15,565 phosphosites on serine, threonine, and tyrosine residues on rice proteins. We identified sequence motifs for phosphosites, and link motifs to enrichment of different biological processes, indicating different downstream regulation likely caused by different kinase groups. We cross-referenced phosphosites against the rice 3,000 genomes, to identify single amino acid variations (SAAVs) within or proximal to phosphosites that could cause loss of a site in a given rice variety and clustered the data to identify groups of sites with similar patterns across rice family groups. The data has been loaded into UniProt Knowledge-Base─enabling researchers to visualize sites alongside other data on rice proteins, e.g., structural models from AlphaFold2, PeptideAtlas, and the PRIDE database─enabling visualization of source evidence, including scores and supporting mass spectra.


Sujet(s)
Génome végétal , Oryza , Phosphoprotéines , Protéines végétales , Protéomique , Transduction du signal , Oryza/génétique , Oryza/métabolisme , Oryza/composition chimique , Protéomique/méthodes , Phosphoprotéines/métabolisme , Phosphoprotéines/génétique , Phosphoprotéines/composition chimique , Phosphoprotéines/analyse , Protéines végétales/génétique , Protéines végétales/métabolisme , Phosphorylation , Maturation post-traductionnelle des protéines , Phosphopeptides/métabolisme , Phosphopeptides/analyse , Bases de données de protéines , Motifs d'acides aminés , Spectrométrie de masse
3.
Proteomics ; : e2400005, 2024 Mar 31.
Article de Anglais | MEDLINE | ID: mdl-38556628

RÉSUMÉ

We here present a chatbot assistant infrastructure (https://www.ebi.ac.uk/pride/chatbot/) that simplifies user interactions with the PRIDE database's documentation and dataset search functionality. The framework utilizes multiple Large Language Models (LLM): llama2, chatglm, mixtral (mistral), and openhermes. It also includes a web service API (Application Programming Interface), web interface, and components for indexing and managing vector databases. An Elo-ranking system-based benchmark component is included in the framework as well, which allows for evaluating the performance of each LLM and for improving PRIDE documentation. The chatbot not only allows users to interact with PRIDE documentation but can also be used to search and find PRIDE datasets using an LLM-based recommendation system, enabling dataset discoverability. Importantly, while our infrastructure is exemplified through its application in the PRIDE database context, the modular and adaptable nature of our approach positions it as a valuable tool for improving user experiences across a spectrum of bioinformatics and proteomics tools and resources, among other domains. The integration of advanced LLMs, innovative vector-based construction, the benchmarking framework, and optimized documentation collectively form a robust and transferable chatbot assistant infrastructure. The framework is open-source (https://github.com/PRIDE-Archive/pride-chatbot).

5.
Sci Data ; 11(1): 112, 2024 Jan 23.
Article de Anglais | MEDLINE | ID: mdl-38263211

RÉSUMÉ

Here we provide a curated, large scale, label free mass spectrometry-based proteomics data set derived from HeLa cell lines for general purpose machine learning and analysis. Data access and filtering is a tedious task, which takes up considerable amounts of time for researchers. Therefore we provide machine based metadata for easy selection and overview along the 7,444 raw files and MaxQuant search output. For convenience, we provide three filtered and aggregated development datasets on the protein groups, peptides and precursors level. Next to providing easy to access training data, we provide a SDRF file annotating each raw file with instrument settings allowing automated reprocessing. We encourage others to enlarge this data set by instrument runs of further HeLa samples from different machine types by providing our workflows and analysis scripts.


Sujet(s)
Cellules HeLa , Apprentissage machine , Protéomique , Humains , Spectrométrie de masse , Métadonnées
6.
Nucleic Acids Res ; 52(D1): D10-D17, 2024 Jan 05.
Article de Anglais | MEDLINE | ID: mdl-38015445

RÉSUMÉ

The European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) is one of the world's leading sources of public biomolecular data. Based at the Wellcome Genome Campus in Hinxton, UK, EMBL-EBI is one of six sites of the European Molecular Biology Laboratory (EMBL), Europe's only intergovernmental life sciences organisation. This overview summarises the latest developments in the services provided by EMBL-EBI data resources to scientific communities globally. These developments aim to ensure EMBL-EBI resources meet the current and future needs of these scientific communities, accelerating the impact of open biological data for all.


Sujet(s)
Académies et instituts , Biologie informatique , Biologie informatique/organisation et administration , Biologie informatique/tendances , Académies et instituts/organisation et administration , Académies et instituts/tendances , Bases de données d'acides nucléiques , Europe
7.
J Biomol Tech ; 34(3)2023 Sep 30.
Article de Anglais | MEDLINE | ID: mdl-37969874

RÉSUMÉ

Metaproteomics research using mass spectrometry data has emerged as a powerful strategy to understand the mechanisms underlying microbiome dynamics and the interaction of microbiomes with their immediate environment. Recent advances in sample preparation, data acquisition, and bioinformatics workflows have greatly contributed to progress in this field. In 2020, the Association of Biomolecular Research Facilities Proteome Informatics Research Group launched a collaborative study to assess the bioinformatics options available for metaproteomics research. The study was conducted in 2 phases. In the first phase, participants were provided with mass spectrometry data files and were asked to identify the taxonomic composition and relative taxa abundances in the samples without supplying any protein sequence databases. The most challenging question asked of the participants was to postulate the nature of any biological phenomena that may have taken place in the samples, such as interactions among taxonomic species. In the second phase, participants were provided a protein sequence database composed of the species present in the sample and were asked to answer the same set of questions as for phase 1. In this report, we summarize the data processing methods and tools used by participants, including database searching and software tools used for taxonomic and functional analysis. This study provides insights into the status of metaproteomics bioinformatics in participating laboratories and core facilities.


Sujet(s)
Protéome , Protéomique , Humains , Protéomique/méthodes , Logiciel , Biologie informatique , Bases de données de protéines
8.
Nat Commun ; 14(1): 6743, 2023 10 24.
Article de Anglais | MEDLINE | ID: mdl-37875519

RÉSUMÉ

Public proteomics data often lack essential metadata, limiting its potential. To address this, we present lesSDRF, a tool to simplify the process of metadata annotation, thereby ensuring that data leave a lasting, impactful legacy well beyond its initial publication.


Sujet(s)
Métadonnées , Protéomique
9.
Proteomics ; 23(20): e2300188, 2023 Oct.
Article de Anglais | MEDLINE | ID: mdl-37488995

RÉSUMÉ

Relative and absolute intensity-based protein quantification across cell lines, tissue atlases and tumour datasets is increasingly available in public datasets. These atlases enable researchers to explore fundamental biological questions, such as protein existence, expression location, quantity and correlation with RNA expression. Most studies provide MS1 feature-based label-free quantitative (LFQ) datasets; however, growing numbers of isobaric tandem mass tags (TMT) datasets remain unexplored. Here, we compare traditional intensity-based absolute quantification (iBAQ) proteome abundance ranking to an analogous method using reporter ion proteome abundance ranking with data from an experiment where LFQ and TMT were measured on the same samples. This new TMT method substitutes reporter ion intensities for MS1 feature intensities in the iBAQ framework. Additionally, we compared LFQ-iBAQ values to TMT-iBAQ values from two independent large-scale tissue atlas datasets (one LFQ and one TMT) using robust bottom-up proteomic identification, normalisation and quantitation workflows.

10.
J Proteome Res ; 22(6): 2114-2123, 2023 06 02.
Article de Anglais | MEDLINE | ID: mdl-37220883

RÉSUMÉ

Testing for significant differences in quantities at the protein level is a common goal of many LFQ-based mass spectrometry proteomics experiments. Starting from a table of protein and/or peptide quantities from a given proteomics quantification software, many tools and R packages exist to perform the final tasks of imputation, summarization, normalization, and statistical testing. To evaluate the effects of packages and settings in their substeps on the final list of significant proteins, we studied several packages on three public data sets with known expected protein fold changes. We found that the results between packages and even across different parameters of the same package can vary significantly. In addition to usability aspects and feature/compatibility lists of different packages, this paper highlights sensitivity and specificity trade-offs that come with specific packages and settings.


Sujet(s)
Peptides , Logiciel , Peptides/analyse , Protéines/analyse , Spectrométrie de masse/méthodes , Protéomique/méthodes
11.
J Proteome Res ; 22(2): 287-301, 2023 02 03.
Article de Anglais | MEDLINE | ID: mdl-36626722

RÉSUMÉ

The Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) has been successfully developing guidelines, data formats, and controlled vocabularies (CVs) for the proteomics community and other fields supported by mass spectrometry since its inception 20 years ago. Here we describe the general operation of the PSI, including its leadership, working groups, yearly workshops, and the document process by which proposals are thoroughly and publicly reviewed in order to be ratified as PSI standards. We briefly describe the current state of the many existing PSI standards, some of which remain the same as when originally developed, some of which have undergone subsequent revisions, and some of which have become obsolete. Then the set of proposals currently being developed are described, with an open call to the community for participation in the forging of the next generation of standards. Finally, we describe some synergies and collaborations with other organizations and look to the future in how the PSI will continue to promote the open sharing of data and thus accelerate the progress of the field of proteomics.


Sujet(s)
Protéome , Protéomique , Humains , Normes de référence , Vocabulaire contrôlé , Spectrométrie de masse , Bases de données de protéines
12.
J Proteome Res ; 22(2): 632-636, 2023 02 03.
Article de Anglais | MEDLINE | ID: mdl-36693629

RÉSUMÉ

Data set acquisition and curation are often the most difficult and time-consuming parts of a machine learning endeavor. This is especially true for proteomics-based liquid chromatography (LC) coupled to mass spectrometry (MS) data sets, due to the high levels of data reduction that occur between raw data and machine learning-ready data. Since predictive proteomics is an emerging field, when predicting peptide behavior in LC-MS setups, each lab often uses unique and complex data processing pipelines in order to maximize performance, at the cost of accessibility and reproducibility. For this reason we introduce ProteomicsML, an online resource for proteomics-based data sets and tutorials across most of the currently explored physicochemical peptide properties. This community-driven resource makes it simple to access data in easy-to-process formats, and contains easy-to-follow tutorials that allow new users to interact with even the most advanced algorithms in the field. ProteomicsML provides data sets that are useful for comparing state-of-the-art machine learning algorithms, as well as providing introductory material for teachers and newcomers to the field alike. The platform is freely available at https://www.proteomicsml.org/, and we welcome the entire proteomics community to contribute to the project at https://github.com/ProteomicsML/ProteomicsML.


Sujet(s)
Algorithmes , Protéomique , Protéomique/méthodes , Reproductibilité des résultats , Peptides/analyse , Spectrométrie de masse/méthodes , Logiciel
13.
Nucleic Acids Res ; 51(D1): D1539-D1548, 2023 01 06.
Article de Anglais | MEDLINE | ID: mdl-36370099

RÉSUMÉ

Mass spectrometry (MS) is by far the most used experimental approach in high-throughput proteomics. The ProteomeXchange (PX) consortium of proteomics resources (http://www.proteomexchange.org) was originally set up to standardize data submission and dissemination of public MS proteomics data. It is now 10 years since the initial data workflow was implemented. In this manuscript, we describe the main developments in PX since the previous update manuscript in Nucleic Acids Research was published in 2020. The six members of the Consortium are PRIDE, PeptideAtlas (including PASSEL), MassIVE, jPOST, iProX and Panorama Public. We report the current data submission statistics, showcasing that the number of datasets submitted to PX resources has continued to increase every year. As of June 2022, more than 34 233 datasets had been submitted to PX resources, and from those, 20 062 (58.6%) just in the last three years. We also report the development of the Universal Spectrum Identifiers and the improvements in capturing the experimental metadata annotations. In parallel, we highlight that data re-use activities of public datasets continue to increase, enabling connections between PX resources and other popular bioinformatics resources, novel research and also new data resources. Finally, we summarise the current state-of-the-art in data management practices for sensitive human (clinical) proteomics data.


Sujet(s)
Protéomique , Logiciel , Humains , Bases de données de protéines , Spectrométrie de masse , Protéomique/méthodes , Biologie informatique/méthodes
14.
Expert Rev Proteomics ; 19(7-12): 297-310, 2022.
Article de Anglais | MEDLINE | ID: mdl-36529941

RÉSUMÉ

INTRODUCTION: The creation of ProteomeXchange data workflows in 2012 transformed the field of proteomics, consisting of the standardization of data submission and dissemination and enabling the widespread reanalysis of public MS proteomics data worldwide. ProteomeXchange has triggered a growing trend toward public dissemination of proteomics data, facilitating the assessment, reuse, comparative analyses, and extraction of new findings from public datasets. By 2022, the consortium is integrated by PRIDE, PeptideAtlas, MassIVE, jPOST, iProX, and Panorama Public. AREAS COVERED: Here, we review and discuss the current ecosystem of resources, guidelines, and file formats for proteomics data dissemination and reanalysis. Special attention is drawn to new exciting quantitative and post-translational modification-oriented resources. The challenges and future directions on data depositions including the lack of metadata and cloud-based and high-performance software solutions for fast and reproducible reanalysis of the available data are discussed. EXPERT OPINION: The success of ProteomeXchange and the amount of proteomics data available in the public domain have triggered the creation and/or growth of other protein knowledgebase resources. Data reuse is a leading, active, and evolving field; supporting the creation of new formats, tools, and workflows to rediscover and reshape the public proteomics data.


Sujet(s)
Écosystème , Protéomique , Humains , Bases de données de protéines , Logiciel , Protéines/métabolisme
15.
J Proteome Res ; 21(6): 1566-1574, 2022 06 03.
Article de Anglais | MEDLINE | ID: mdl-35549218

RÉSUMÉ

Spectrum clustering is a powerful strategy to minimize redundant mass spectra by grouping them based on similarity, with the aim of forming groups of mass spectra from the same repeatedly measured analytes. Each such group of near-identical spectra can be represented by its so-called consensus spectrum for downstream processing. Although several algorithms for spectrum clustering have been adequately benchmarked and tested, the influence of the consensus spectrum generation step is rarely evaluated. Here, we present an implementation and benchmark of common consensus spectrum algorithms, including spectrum averaging, spectrum binning, the most similar spectrum, and the best-identified spectrum. We have analyzed diverse public data sets using two different clustering algorithms (spectra-cluster and MaRaCluster) to evaluate how the consensus spectrum generation procedure influences downstream peptide identification. The BEST and BIN methods were found the most reliable methods for consensus spectrum generation, including for data sets with post-translational modifications (PTM) such as phosphorylation. All source code and data of the present study are freely available on GitHub at https://github.com/statisticalbiotechnology/representative-spectra-benchmark.


Sujet(s)
Protéomique , Spectrométrie de masse en tandem , Algorithmes , Analyse de regroupements , Consensus , Bases de données de protéines , Protéomique/méthodes , Logiciel , Spectrométrie de masse en tandem/méthodes
16.
J Proteome Res ; 21(4): 1189-1195, 2022 04 01.
Article de Anglais | MEDLINE | ID: mdl-35290070

RÉSUMÉ

It is important for the proteomics community to have a standardized manner to represent all possible variations of a protein or peptide primary sequence, including natural, chemically induced, and artifactual modifications. The Human Proteome Organization Proteomics Standards Initiative in collaboration with several members of the Consortium for Top-Down Proteomics (CTDP) has developed a standard notation called ProForma 2.0, which is a substantial extension of the original ProForma notation developed by the CTDP. ProForma 2.0 aims to unify the representation of proteoforms and peptidoforms. ProForma 2.0 supports use cases needed for bottom-up and middle-/top-down proteomics approaches and allows the encoding of highly modified proteins and peptides using a human- and machine-readable string. ProForma 2.0 can be used to represent protein modifications in a specified or ambiguous location, designated by mass shifts, chemical formulas, or controlled vocabulary terms, including cross-links (natural and chemical) and atomic isotopes. Notational conventions are based on public controlled vocabularies and ontologies. The most up-to-date full specification document and information about software implementations are available at http://psidev.info/proforma.


Sujet(s)
Protéome , Protéomique , Humains , Maturation post-traductionnelle des protéines , Protéome/génétique , Normes de référence , Logiciel
17.
Sci Data ; 9(1): 126, 2022 03 30.
Article de Anglais | MEDLINE | ID: mdl-35354825

RÉSUMÉ

In the last decade, a revolution in liquid chromatography-mass spectrometry (LC-MS) based proteomics was unfolded with the introduction of dozens of novel instruments that incorporate additional data dimensions through innovative acquisition methodologies, in turn inspiring specialized data analysis pipelines. Simultaneously, a growing number of proteomics datasets have been made publicly available through data repositories such as ProteomeXchange, Zenodo and Skyline Panorama. However, developing algorithms to mine this data and assessing the performance on different platforms is currently hampered by the lack of a single benchmark experimental design. Therefore, we acquired a hybrid proteome mixture on different instrument platforms and in all currently available families of data acquisition. Here, we present a comprehensive Data-Dependent and Data-Independent Acquisition (DDA/DIA) dataset acquired using several of the most commonly used current day instrumental platforms. The dataset consists of over 700 LC-MS runs, including adequate replicates allowing robust statistics and covering over nearly 10 different data formats, including scanning quadrupole and ion mobility enabled acquisitions. Datasets are available via ProteomeXchange (PXD028735).


Sujet(s)
Référenciation , Protéomique , Animaux , Chromatographie en phase liquide/méthodes , Humains , Spectrométrie de masse/méthodes , Protéome
18.
Nucleic Acids Res ; 50(D1): D543-D552, 2022 01 07.
Article de Anglais | MEDLINE | ID: mdl-34723319

RÉSUMÉ

The PRoteomics IDEntifications (PRIDE) database (https://www.ebi.ac.uk/pride/) is the world's largest data repository of mass spectrometry-based proteomics data. PRIDE is one of the founding members of the global ProteomeXchange (PX) consortium and an ELIXIR core data resource. In this manuscript, we summarize the developments in PRIDE resources and related tools since the previous update manuscript was published in Nucleic Acids Research in 2019. The number of submitted datasets to PRIDE Archive (the archival component of PRIDE) has reached on average around 500 datasets per month during 2021. In addition to continuous improvements in PRIDE Archive data pipelines and infrastructure, the PRIDE Spectra Archive has been developed to provide direct access to the submitted mass spectra using Universal Spectrum Identifiers. As a key point, the file format MAGE-TAB for proteomics has been developed to enable the improvement of sample metadata annotation. Additionally, the resource PRIDE Peptidome provides access to aggregated peptide/protein evidences across PRIDE Archive. Furthermore, we will describe how PRIDE has increased its efforts to reuse and disseminate high-quality proteomics data into other added-value resources such as UniProt, Ensembl and Expression Atlas.


Sujet(s)
Bases de données de protéines , Métadonnées/statistiques et données numériques , Annotation de séquence moléculaire/statistiques et données numériques , Peptides/composition chimique , Protéines/composition chimique , Logiciel , Séquence d'acides aminés , Bibliométrie , Jeux de données comme sujet , Humains , Mémorisation et recherche des informations , Internet , Spectrométrie de masse , Peptides/génétique , Peptides/métabolisme , Protéines/génétique , Protéines/métabolisme , Protéomique/instrumentation , Protéomique/méthodes , Alignement de séquences
19.
Bioinformatics ; 38(5): 1470-1472, 2022 02 07.
Article de Anglais | MEDLINE | ID: mdl-34904638

RÉSUMÉ

SUMMARY: We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs and other non-canonical transcripts, such as those produced by alternative splicing events. It also includes exonic out-of-frame translation from otherwise canonical protein-coding mRNAs. Moreover, the tool enables the generation of variant protein sequences from multiple sources of genomic variants including COSMIC, cBioportal, gnomAD and mutations detected from sequencing of patient samples. pypgatk and pgdb provide multiple functionalities for database handling including optimized target/decoy generation by the algorithm DecoyPyrat. Finally, we have reanalyzed six public datasets in PRIDE by generating cell-type specific databases for 65 cell lines using the pypgatk and pgdb workflow, revealing a wealth of non-canonical or cryptic peptides amounting to >5% of the total number of peptides identified. AVAILABILITY AND IMPLEMENTATION: The software is freely available. pypgatk: https://github.com/bigbio/py-pgatk/ and pgdb: https://nf-co.re/pgdb. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Sujet(s)
Protéogénomique , Humains , Peptides/génétique , Logiciel , Algorithmes , Protéines
20.
Cancers (Basel) ; 13(24)2021 Dec 10.
Article de Anglais | MEDLINE | ID: mdl-34944842

RÉSUMÉ

Plasma analysis by mass spectrometry-based proteomics remains a challenge due to its large dynamic range of 10 orders in magnitude. We created a methodology for protein identification known as Wise MS Transfer (WiMT). Melanoma plasma samples from biobank archives were directly analyzed using simple sample preparation. WiMT is based on MS1 features between several MS runs together with custom protein databases for ID generation. This entails a multi-level dynamic protein database with different immunodepletion strategies by applying single-shot proteomics. The highest number of melanoma plasma proteins from undepleted and unfractionated plasma was reported, mapping >1200 proteins from >10,000 protein sequences with confirmed significance scoring. Of these, more than 660 proteins were annotated by WiMT from the resulting ~5800 protein sequences. We could verify 4000 proteins by MS1t analysis from HeLA extracts. The WiMT platform provided an output in which 12 previously well-known candidate markers were identified. We also identified low-abundant proteins with functions related to (i) cell signaling, (ii) immune system regulators, and (iii) proteins regulating folding, sorting, and degradation, as well as (iv) vesicular transport proteins. WiMT holds the potential for use in large-scale screening studies with simple sample preparation, and can lead to the discovery of novel proteins with key melanoma disease functions.

SÉLECTION CITATIONS
DÉTAIL DE RECHERCHE
...