RESUMEN
The volume of public proteomics data is rapidly increasing, causing a computational challenge for large-scale reanalysis. Here, we introduce quantms ( https://quant,ms.org/ ), an open-source cloud-based pipeline for massively parallel proteomics data analysis. We used quantms to reanalyze 83 public ProteomeXchange datasets, comprising 29,354 instrument files from 13,132 human samples, to quantify 16,599 proteins based on 1.03 million unique peptides. quantms is based on standard file formats improving the reproducibility, submission and dissemination of the data to ProteomeXchange.
Asunto(s)
Nube Computacional , Proteómica , Programas Informáticos , Proteómica/métodos , Humanos , Bases de Datos de Proteínas , Proteoma/análisis , Reproducibilidad de los Resultados , Biología Computacional/métodos , Péptidos/análisis , Péptidos/químicaRESUMEN
In protein-RNA cross-linking mass spectrometry, UV or chemical cross-linking introduces stable bonds between amino acids and nucleic acids in protein-RNA complexes that are then analyzed and detected in mass spectra. This analytical tool delivers valuable information about RNA-protein interactions and RNA docking sites in proteins, both in vitro and in vivo. The identification of cross-linked peptides with oligonucleotides of different length leads to a combinatorial increase in search space. We demonstrate that the peptide retention time prediction tasks can be transferred to the task of cross-linked peptide retention time prediction using a simple amino acid composition encoding, yielding improved identification rates when the prediction error is included in rescoring. For the more challenging task of including fragment intensity prediction of cross-linked peptides in the rescoring, we obtain, on average, a similar improvement. Further improvement in the encoding and fine-tuning of retention time and intensity prediction models might lead to further gains, and merit further research.
Asunto(s)
Ácidos Nucleicos , ARN , Aminoácidos , Espectrometría de Masas , PéptidosRESUMEN
Relative and absolute intensity-based protein quantification across cell lines, tissue atlases and tumour datasets is increasingly available in public datasets. These atlases enable researchers to explore fundamental biological questions, such as protein existence, expression location, quantity and correlation with RNA expression. Most studies provide MS1 feature-based label-free quantitative (LFQ) datasets; however, growing numbers of isobaric tandem mass tags (TMT) datasets remain unexplored. Here, we compare traditional intensity-based absolute quantification (iBAQ) proteome abundance ranking to an analogous method using reporter ion proteome abundance ranking with data from an experiment where LFQ and TMT were measured on the same samples. This new TMT method substitutes reporter ion intensities for MS1 feature intensities in the iBAQ framework. Additionally, we compared LFQ-iBAQ values to TMT-iBAQ values from two independent large-scale tissue atlas datasets (one LFQ and one TMT) using robust bottom-up proteomic identification, normalisation and quantitation workflows.
RESUMEN
Testing for significant differences in quantities at the protein level is a common goal of many LFQ-based mass spectrometry proteomics experiments. Starting from a table of protein and/or peptide quantities from a given proteomics quantification software, many tools and R packages exist to perform the final tasks of imputation, summarization, normalization, and statistical testing. To evaluate the effects of packages and settings in their substeps on the final list of significant proteins, we studied several packages on three public data sets with known expected protein fold changes. We found that the results between packages and even across different parameters of the same package can vary significantly. In addition to usability aspects and feature/compatibility lists of different packages, this paper highlights sensitivity and specificity trade-offs that come with specific packages and settings.
Asunto(s)
Péptidos , Programas Informáticos , Péptidos/análisis , Proteínas/análisis , Espectrometría de Masas/métodos , Proteómica/métodosRESUMEN
spectrum_utils is a Python package for mass spectrometry data processing and visualization. Since its introduction, spectrum_utils has grown into a fundamental software solution that powers various applications in proteomics and metabolomics, ranging from spectrum preprocessing prior to spectrum identification and machine learning applications to spectrum plotting from online data repositories and assisting data analysis tasks for dozens of other projects. Here, we present updates to spectrum_utils, which include new functionality to integrate mass spectrometry community data standards, enhanced mass spectral data processing, and unified mass spectral data visualization in Python. spectrum_utils is freely available as open source at https://github.com/bittremieux/spectrum_utils.
Asunto(s)
Proteómica , Programas Informáticos , Espectrometría de Masas , Proteómica/métodos , Metabolómica , Aprendizaje AutomáticoRESUMEN
SUMMARY: We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs and other non-canonical transcripts, such as those produced by alternative splicing events. It also includes exonic out-of-frame translation from otherwise canonical protein-coding mRNAs. Moreover, the tool enables the generation of variant protein sequences from multiple sources of genomic variants including COSMIC, cBioportal, gnomAD and mutations detected from sequencing of patient samples. pypgatk and pgdb provide multiple functionalities for database handling including optimized target/decoy generation by the algorithm DecoyPyrat. Finally, we have reanalyzed six public datasets in PRIDE by generating cell-type specific databases for 65 cell lines using the pypgatk and pgdb workflow, revealing a wealth of non-canonical or cryptic peptides amounting to >5% of the total number of peptides identified. AVAILABILITY AND IMPLEMENTATION: The software is freely available. pypgatk: https://github.com/bigbio/py-pgatk/ and pgdb: https://nf-co.re/pgdb. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Proteogenómica , Humanos , Péptidos/genética , Programas Informáticos , Algoritmos , ProteínasRESUMEN
Spectrum clustering is a powerful strategy to minimize redundant mass spectra by grouping them based on similarity, with the aim of forming groups of mass spectra from the same repeatedly measured analytes. Each such group of near-identical spectra can be represented by its so-called consensus spectrum for downstream processing. Although several algorithms for spectrum clustering have been adequately benchmarked and tested, the influence of the consensus spectrum generation step is rarely evaluated. Here, we present an implementation and benchmark of common consensus spectrum algorithms, including spectrum averaging, spectrum binning, the most similar spectrum, and the best-identified spectrum. We have analyzed diverse public data sets using two different clustering algorithms (spectra-cluster and MaRaCluster) to evaluate how the consensus spectrum generation procedure influences downstream peptide identification. The BEST and BIN methods were found the most reliable methods for consensus spectrum generation, including for data sets with post-translational modifications (PTM) such as phosphorylation. All source code and data of the present study are freely available on GitHub at https://github.com/statisticalbiotechnology/representative-spectra-benchmark.
Asunto(s)
Proteómica , Espectrometría de Masas en Tándem , Algoritmos , Análisis por Conglomerados , Consenso , Bases de Datos de Proteínas , Proteómica/métodos , Programas Informáticos , Espectrometría de Masas en Tándem/métodosRESUMEN
Cross-linking MS (XL-MS) has been recognized as an effective source of information about protein structures and interactions. In contrast to regular peptide identification, XL-MS has to deal with a quadratic search space, where peptides from every protein could potentially be cross-linked to any other protein. To cope with this search space, most tools apply different heuristics for search space reduction. We introduce a new open-source XL-MS database search algorithm, OpenPepXL, which offers increased sensitivity compared with other tools. OpenPepXL searches the full search space of an XL-MS experiment without using heuristics to reduce it. Because of efficient data structures and built-in parallelization OpenPepXL achieves excellent runtimes and can also be deployed on large compute clusters and cloud services while maintaining a slim memory footprint. We compared OpenPepXL to several other commonly used tools for identification of noncleavable labeled and label-free cross-linkers on a diverse set of XL-MS experiments. In our first comparison, we used a data set from a fraction of a cell lysate with a protein database of 128 targets and 128 decoys. At 5% FDR, OpenPepXL finds from 7% to over 50% more unique residue pairs (URPs) than other tools. On data sets with available high-resolution structures for cross-link validation OpenPepXL reports from 7% to over 40% more structurally validated URPs than other tools. Additionally, we used a synthetic peptide data set that allows objective validation of cross-links without relying on structural information and found that OpenPepXL reports at least 12% more validated URPs than other tools. It has been built as part of the OpenMS suite of tools and supports Windows, macOS, and Linux operating systems. OpenPepXL also supports the MzIdentML 1.2 format for XL-MS identification results. It is freely available under a three-clause BSD license at https://openms.org/openpepxl.
Asunto(s)
Reactivos de Enlaces Cruzados/química , Péptidos/análisis , Programas Informáticos , Algoritmos , Secuencia de Aminoácidos , Bases de Datos de Proteínas , Células HEK293 , Humanos , Espectrometría de Masas , Modelos Moleculares , Péptidos/química , Ribosomas/metabolismoRESUMEN
Data-independent acquisition (DIA) is becoming a leading analysis method in biomedical mass spectrometry. The main advantages include greater reproducibility and sensitivity and a greater dynamic range compared with data-dependent acquisition (DDA). However, the data analysis is complex and often requires expert knowledge when dealing with large-scale data sets. Here we present DIAproteomics, a multifunctional, automated, high-throughput pipeline implemented in the Nextflow workflow management system that allows one to easily process proteomics and peptidomics DIA data sets on diverse compute infrastructures. The central components are well-established tools such as the OpenSwathWorkflow for the DIA spectral library search and PyProphet for the false discovery rate assessment. In addition, it provides options to generate spectral libraries from existing DDA data and to carry out the retention time and chromatogram alignment. The output includes annotated tables and diagnostic visualizations from the statistical postprocessing and computation of fold-changes across pairwise conditions, predefined in an experimental design. DIAproteomics is well documented open-source software and is available under a permissive license to the scientific community at https://www.openms.de/diaproteomics/.
Asunto(s)
Análisis de Datos , Proteómica , Espectrometría de Masas , Reproducibilidad de los Resultados , Programas InformáticosRESUMEN
The PRoteomics IDEntifications (PRIDE) database (https://www.ebi.ac.uk/pride/) is the world's largest data repository of mass spectrometry-based proteomics data, and is one of the founding members of the global ProteomeXchange (PX) consortium. In this manuscript, we summarize the developments in PRIDE resources and related tools since the previous update manuscript was published in Nucleic Acids Research in 2016. In the last 3 years, public data sharing through PRIDE (as part of PX) has definitely become the norm in the field. In parallel, data re-use of public proteomics data has increased enormously, with multiple applications. We first describe the new architecture of PRIDE Archive, the archival component of PRIDE. PRIDE Archive and the related data submission framework have been further developed to support the increase in submitted data volumes and additional data types. A new scalable and fault tolerant storage backend, Application Programming Interface and web interface have been implemented, as a part of an ongoing process. Additionally, we emphasize the improved support for quantitative proteomics data through the mzTab format. At last, we outline key statistics on the current data contents and volume of downloads, and how PRIDE data are starting to be disseminated to added-value resources including Ensembl, UniProt and Expression Atlas.
Asunto(s)
Bases de Datos de Proteínas , Espectrometría de Masas , Proteómica , Péptidos/química , Programas InformáticosRESUMEN
The field of computational proteomics is approaching the big data age, driven both by a continuous growth in the number of samples analyzed per experiment as well as by the growing amount of data obtained in each analytical run. In order to process these large amounts of data, it is increasingly necessary to use elastic compute resources such as Linux-based cluster environments and cloud infrastructures. Unfortunately, the vast majority of cross-platform proteomics tools are not able to operate directly on the proprietary formats generated by the diverse mass spectrometers. Here, we present ThermoRawFileParser, an open-source, cross-platform tool that converts Thermo RAW files into open file formats such as MGF and the HUPO-PSI standard file format mzML. To ensure the broadest possible availability and to increase integration capabilities with popular workflow systems such as Galaxy or Nextflow, we have also built Conda package and BioContainers container around ThermoRawFileParser. In addition, we implemented a user-friendly interface (ThermoRawFileParserGUI) for those users not familiar with command-line tools. Finally, we performed a benchmark of ThermoRawFileParser and msconvert to verify that the converted mzML files contain reliable quantitative results.
Asunto(s)
Biología Computacional/métodos , Proteómica/métodos , Programas Informáticos , Bases de Datos de Proteínas , Proteínas de Saccharomyces cerevisiae/metabolismo , Flujo de TrabajoRESUMEN
Accurate protein inference in the presence of shared peptides is still one of the key problems in bottom-up proteomics. Most protein inference tools employing simple heuristic inference strategies are efficient but exhibit reduced accuracy. More advanced probabilistic methods often exhibit better inference quality but tend to be too slow for large data sets. Here, we present a novel protein inference method, EPIFANY, combining a loopy belief propagation algorithm with convolution trees for efficient processing of Bayesian networks. We demonstrate that EPIFANY combines the reliable protein inference of Bayesian methods with significantly shorter runtimes. On the 2016 iPRG protein inference benchmark data, EPIFANY is the only tested method that finds all true-positive proteins at a 5% protein false discovery rate (FDR) without strict prefiltering on the peptide-spectrum match (PSM) level, yielding an increase in identification performance (+10% in the number of true positives and +14% in partial AUC) compared to previous approaches. Even very large data sets with hundreds of thousands of spectra (which are intractable with other Bayesian and some non-Bayesian tools) can be processed with EPIFANY within minutes. The increased inference quality including shared peptides results in better protein inference results and thus increased robustness of the biological hypotheses generated. EPIFANY is available as open-source software for all major platforms at https://OpenMS.de/epifany.
Asunto(s)
Algoritmos , Proteómica , Teorema de Bayes , Bases de Datos de Proteínas , Proteínas , Programas InformáticosRESUMEN
Technological advances in high-resolution mass spectrometry (MS) vastly increased the number of samples that can be processed in a life science experiment, as well as volume and complexity of the generated data. To address the bottleneck of high-throughput data processing, we present SmartPeak (https://github.com/AutoFlowResearch/SmartPeak), an application that encapsulates advanced algorithms to enable fast, accurate, and automated processing of capillary electrophoresis-, gas chromatography-, and liquid chromatography (LC)-MS(/MS) data and high-pressure LC data for targeted and semitargeted metabolomics, lipidomics, and fluxomics experiments. The application allows for an approximate 100-fold reduction in the data processing time compared to manual processing while enhancing quality and reproducibility of the results.
Asunto(s)
Procesamiento Automatizado de Datos/métodos , Metabolómica/métodos , Automatización , Cromatografía Liquida , Electroforesis Capilar , Espectrometría de Masas en Tándem , Factores de TiempoRESUMEN
Personalized multipeptide vaccines are currently being discussed intensively for tumor immunotherapy. In order to identify epitopes-short, immunogenic peptides-suitable for eliciting a tumor-specific immune response, human leukocyte antigen-presented peptides are isolated by immunoaffinity purification from cancer tissue samples and analyzed by liquid chromatography-coupled tandem mass spectrometry (LC-MS/MS). Here, we present MHCquant, a fully automated, portable computational pipeline able to process LC-MS/MS data automatically and generate annotated, false discovery rate-controlled lists of (neo-)epitopes with associated relative quantification information. We could show that MHCquant achieves higher sensitivity than established methods. While obtaining the highest number of unique peptides, the rate of predicted MHC binders remains still comparable to other tools. Reprocessing of the data from a previously published study resulted in the identification of several neoepitopes not detected by previously applied methods. MHCquant integrates tailor-made pipeline components with existing open-source software into a coherent processing workflow. Container-based virtualization permits execution of this workflow without complex software installation, execution on cluster/cloud infrastructures, and full reproducibility of the results. Integration with the data analysis workbench KNIME enables easy mining of large-scale immunopeptidomics data sets. MHCquant is available as open-source software along with accompanying documentation on our website at https://www.openms.de/mhcquant/ .
Asunto(s)
Biología Computacional/métodos , Análisis de Datos , Péptidos/metabolismo , Proteómica/métodos , Cromatografía Liquida/métodos , Antígenos HLA/inmunología , Humanos , Internet , Mutación , Péptidos/genética , Péptidos/inmunología , Reproducibilidad de los Resultados , Programas Informáticos , Espectrometría de Masas en Tándem/métodosRESUMEN
Mass spectrometry (MS) is one of the primary techniques used for large-scale analysis of small molecules in metabolomics studies. To date, there has been little data format standardization in this field, as different software packages export results in different formats represented in XML or plain text, making data sharing, database deposition, and reanalysis highly challenging. Working within the consortia of the Metabolomics Standards Initiative, Proteomics Standards Initiative, and the Metabolomics Society, we have created mzTab-M to act as a common output format from analytical approaches using MS on small molecules. The format has been developed over several years, with input from a wide range of stakeholders. mzTab-M is a simple tab-separated text format, but importantly, the structure is highly standardized through the design of a detailed specification document, tightly coupled to validation software, and a mandatory controlled vocabulary of terms to populate it. The format is able to represent final quantification values from analyses, as well as the evidence trail in terms of features measured directly from MS (e.g., LC-MS, GC-MS, DIMS, etc.) and different types of approaches used to identify molecules. mzTab-M allows for ambiguity in the identification of molecules to be communicated clearly to readers of the files (both people and software). There are several implementations of the format available, and we anticipate widespread adoption in the field.
Asunto(s)
Metabolómica/métodos , Programas Informáticos , Bases de Datos Factuales , Espectrometría de MasasRESUMEN
High-resolution mass spectrometry (MS) has become an important tool in the life sciences, contributing to the diagnosis and understanding of human diseases, elucidating biomolecular structural information and characterizing cellular signaling networks. However, the rapid growth in the volume and complexity of MS data makes transparent, accurate and reproducible analysis difficult. We present OpenMS 2.0 (http://www.openms.de), a robust, open-source, cross-platform software specifically designed for the flexible and reproducible analysis of high-throughput MS data. The extensible OpenMS software implements common mass spectrometric data processing tasks through a well-defined application programming interface in C++ and Python and through standardized open data formats. OpenMS additionally provides a set of 185 tools and ready-made workflows for common mass spectrometric data processing tasks, which enable users to perform complex quantitative mass spectrometric analyses with ease.
Asunto(s)
Biología Computacional/métodos , Procesamiento Automatizado de Datos , Espectrometría de Masas/métodos , Proteómica/métodos , Programas Informáticos , Envejecimiento/sangre , Proteínas Sanguíneas/química , Humanos , Anotación de Secuencia Molecular , Proteogenómica/métodos , Flujo de TrabajoRESUMEN
The 2017 Dagstuhl Seminar on Computational Proteomics provided an opportunity for a broad discussion on the current state and future directions of the generation and use of peptide tandem mass spectrometry spectral libraries. Their use in proteomics is growing slowly, but there are multiple challenges in the field that must be addressed to further increase the adoption of spectral libraries and related techniques. The primary bottlenecks are the paucity of high quality and comprehensive libraries and the general difficulty of adopting spectral library searching into existing workflows. There are several existing spectral library formats, but none captures a satisfactory level of metadata; therefore, a logical next improvement is to design a more advanced, Proteomics Standards Initiative-approved spectral library format that can encode all of the desired metadata. The group discussed a series of metadata requirements organized into three designations of completeness or quality, tentatively dubbed bronze, silver, and gold. The metadata can be organized at four different levels of granularity: at the collection (library) level, at the individual entry (peptide ion) level, at the peak (fragment ion) level, and at the peak annotation level. Strategies for encoding mass modifications in a consistent manner and the requirement for encoding high-quality and commonly seen but as-yet-unidentified spectra were discussed. The group also discussed related topics, including strategies for comparing two spectra, techniques for generating representative spectra for a library, approaches for selection of optimal signature ions for targeted workflows, and issues surrounding the merging of two or more libraries into one. We present here a review of this field and the challenges that the community must address in order to accelerate the adoption of spectral libraries in routine analysis of proteomics datasets.
Asunto(s)
Bases de Datos de Proteínas/normas , Biblioteca de Péptidos , Proteómica/métodos , Animales , Humanos , Espectrometría de Masas en Tándem/métodos , Flujo de TrabajoRESUMEN
MOTIVATION: BioContainers (biocontainers.pro) is an open-source and community-driven framework which provides platform independent executable environments for bioinformatics software. BioContainers allows labs of all sizes to easily install bioinformatics software, maintain multiple versions of the same software and combine tools into powerful analysis pipelines. BioContainers is based on popular open-source projects Docker and rkt frameworks, that allow software to be installed and executed under an isolated and controlled environment. Also, it provides infrastructure and basic guidelines to create, manage and distribute bioinformatics containers with a special focus on omics technologies. These containers can be integrated into more comprehensive bioinformatics pipelines and different architectures (local desktop, cloud environments or HPC clusters). AVAILABILITY AND IMPLEMENTATION: The software is freely available at github.com/BioContainers/. CONTACT: yperez@ebi.ac.uk.
Asunto(s)
Biología Computacional/métodos , Programas Informáticos , Genómica/métodos , Metabolómica/métodos , Proteómica/métodosRESUMEN
Cross-linking of nucleic acids to proteins in combination with mass spectrometry permits the precise identification of interacting residues between nucleic acid-protein complexes. However, the mass spectrometric identification and characterization of cross-linked nucleic acid-protein heteroconjugates within a complex sample is challenging. Here we establish a novel enzymatic differential 16O/18O-labeling approach, which uniquely labels heteroconjugates. We have developed an automated data analysis workflow based on OpenMS for the identification of differentially isotopically labeled heteroconjugates against a complex background. We validated our method using synthetic model DNA oligonucleotide-peptide heteroconjugates, which were subjected to the labeling reaction and analyzed by high-resolution FTICR mass spectrometry.