RESUMEN
In common with other omics technologies, mass spectrometry (MS)-based proteomics produces ever-increasing amounts of raw data, making efficient analysis a principal challenge. A plethora of different computational tools can process the MS data to derive peptide and protein identification and quantification. However, during the last years there has been dramatic progress in computer science, including collaboration tools that have transformed research and industry. To leverage these advances, we develop AlphaPept, a Python-based open-source framework for efficient processing of large high-resolution MS data sets. Numba for just-in-time compilation on CPU and GPU achieves hundred-fold speed improvements. AlphaPept uses the Python scientific stack of highly optimized packages, reducing the code base to domain-specific tasks while accessing the latest advances. We provide an easy on-ramp for community contributions through the concept of literate programming, implemented in Jupyter Notebooks. Large datasets can rapidly be processed as shown by the analysis of hundreds of proteomes in minutes per file, many-fold faster than acquisition. AlphaPept can be used to build automated processing pipelines with web-serving functionality and compatibility with downstream analysis tools. It provides easy access via one-click installation, a modular Python library for advanced users, and via an open GitHub repository for developers.
Asunto(s)
Proteómica , Programas Informáticos , Proteómica/métodos , Espectrometría de Masas/métodos , ProteomaRESUMEN
Machine learning and in particular deep learning (DL) are increasingly important in mass spectrometry (MS)-based proteomics. Recent DL models can predict the retention time, ion mobility and fragment intensities of a peptide just from the amino acid sequence with good accuracy. However, DL is a very rapidly developing field with new neural network architectures frequently appearing, which are challenging to incorporate for proteomics researchers. Here we introduce AlphaPeptDeep, a modular Python framework built on the PyTorch DL library that learns and predicts the properties of peptides ( https://github.com/MannLabs/alphapeptdeep ). It features a model shop that enables non-specialists to create models in just a few lines of code. AlphaPeptDeep represents post-translational modifications in a generic manner, even if only the chemical composition is known. Extensive use of transfer learning obviates the need for large data sets to refine models for particular experimental conditions. The AlphaPeptDeep models for predicting retention time, collisional cross sections and fragment intensities are at least on par with existing tools. Additional sequence-based properties can also be predicted by AlphaPeptDeep, as demonstrated with a HLA peptide prediction model to improve HLA peptide identification for data-independent acquisition ( https://github.com/MannLabs/PeptDeep-HLA ).
Asunto(s)
Aprendizaje Profundo , Proteómica , Proteómica/métodos , Péptidos/química , Secuencia de Aminoácidos , Redes Neurales de la ComputaciónRESUMEN
Data-independent acquisition (DIA) methods have become increasingly attractive in mass spectrometry-based proteomics because they enable high data completeness and a wide dynamic range. Recently, we combined DIA with parallel accumulation-serial fragmentation (dia-PASEF) on a Bruker trapped ion mobility (IM) separated quadrupole time-of-flight mass spectrometer. This requires alignment of the IM separation with the downstream mass selective quadrupole, leading to a more complex scheme for dia-PASEF window placement compared with DIA. To achieve high data completeness and deep proteome coverage, here we employ variable isolation windows that are placed optimally depending on precursor density in the m/z and IM plane. This is implemented in the freely available py_diAID (Python package for DIA with an automated isolation design) package. In combination with in-depth project-specific proteomics libraries and the Evosep liquid chromatography system, we reproducibly identified over 7700 proteins in a human cancer cell line in 44 min with quadruplicate single-shot injections at high sensitivity. Even at a throughput of 100 samples per day (11 min liquid chromatography gradients), we consistently quantified more than 6000 proteins in mammalian cell lysates by injecting four replicates. We found that optimal dia-PASEF window placement facilitates in-depth phosphoproteomics with very high sensitivity, quantifying more than 35,000 phosphosites in a human cancer cell line stimulated with an epidermal growth factor in triplicate 21 min runs. This covers a substantial part of the regulated phosphoproteome with high sensitivity, opening up for extensive systems-biological studies.
Asunto(s)
Proteoma , Espectrometría de Masas en Tándem , Animales , Cromatografía Liquida/métodos , Factor de Crecimiento Epidérmico , Humanos , Mamíferos/metabolismo , Proteoma/metabolismo , Proteómica/métodos , Espectrometría de Masas en Tándem/métodosRESUMEN
Mass-spectrometry based bottom-up proteomics is the main method to analyze proteomes comprehensively and the rapid evolution of instrumentation and data analysis has made the technology widely available. Data visualization is an integral part of the analysis process and it is crucial for the communication of results. This is a major challenge due to the immense complexity of MS data. In this review, we provide an overview of commonly used visualizations, starting with raw data of traditional and novel MS technologies, then basic peptide and protein level analyses, and finally visualization of highly complex datasets and networks. We specifically provide guidance on how to critically interpret and discuss the multitude of different proteomics data visualizations. Furthermore, we highlight Python-based libraries and other open science tools that can be applied for independent and transparent generation of customized visualizations. To further encourage programmatic data visualization, we provide the Python code used to generate all data figures in this review on GitHub (https://github.com/MannLabs/ProteomicsVisualization).
Asunto(s)
Visualización de Datos , Proteómica , Espectrometría de Masas , Péptidos , Proteómica/métodos , Programas InformáticosRESUMEN
SUMMARY: Integrating experimental information across proteomic datasets with the wealth of publicly available sequence annotations is a crucial part in many proteomic studies that currently lacks an automated analysis platform. Here, we present AlphaMap, a Python package that facilitates the visual exploration of peptide-level proteomics data. Identified peptides and post-translational modifications in proteomic datasets are mapped to their corresponding protein sequence and visualized together with prior knowledge from UniProt and with expected proteolytic cleavage sites. The functionality of AlphaMap can be accessed via an intuitive graphical user interface or-more flexibly-as a Python package that allows its integration into common analysis workflows for data visualization. AlphaMap produces publication-quality illustrations and can easily be customized to address a given research question. AVAILABILITY AND IMPLEMENTATION: AlphaMap is implemented in Python and released under an Apache license. The source code and one-click installers are freely available at https://github.com/MannLabs/alphamap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Proteómica , Programas Informáticos , Péptidos , Secuencia de Aminoácidos , Péptido HidrolasasRESUMEN
High-resolution MS-based proteomics generates large amounts of data, even in the standard LC-tandem MS configuration. Adding an ion mobility dimension vastly increases the acquired data volume, challenging both analytical processing pipelines and especially data exploration by scientists. This has necessitated data aggregation, effectively discarding much of the information present in these rich datasets. Taking trapped ion mobility spectrometry (TIMS) on a quadrupole TOF (Q-TOF) platform as an example, we developed an efficient indexing scheme that represents all data points as detector arrival times on scales of minutes (LC), milliseconds (TIMS), and microseconds (TOF). In our open-source AlphaTims package, data are indexed, accessed, and visualized by a combination of tools of the scientific Python ecosystem. We interpret unprocessed data as a sparse four-dimensional matrix and use just-in-time compilation to machine code with Numba, accelerating our computational procedures by several orders of magnitude while keeping to familiar indexing and slicing notations. For samples with more than six billion detector events, a modern laptop can load and index raw data in about a minute. Loading is even faster when AlphaTims has already saved indexed data in an HDF5 file, a portable scientific standard used in extremely large-scale data acquisition. Subsequently, data accession along any dimension and interactive visualization happens in milliseconds. We have found AlphaTims to be a key enabling tool to explore high-dimensional LC-TIMS-Q-TOF data and have made it freely available as an open-source Python package with a stand-alone graphical user interface at https://github.com/MannLabs/alphatims or as part of the AlphaPept 'ecosystem'.
Asunto(s)
Programas Informáticos , Cromatografía Liquida , Células HeLa , Humanos , Espectrometría de Movilidad Iónica , Espectrometría de Masas , PéptidosRESUMEN
The size and shape of peptide ions in the gas phase are an under-explored dimension for mass spectrometry-based proteomics. To investigate the nature and utility of the peptide collisional cross section (CCS) space, we measure more than a million data points from whole-proteome digests of five organisms with trapped ion mobility spectrometry (TIMS) and parallel accumulation-serial fragmentation (PASEF). The scale and precision (CV < 1%) of our data is sufficient to train a deep recurrent neural network that accurately predicts CCS values solely based on the peptide sequence. Cross section predictions for the synthetic ProteomeTools peptides validate the model within a 1.4% median relative error (R > 0.99). Hydrophobicity, proportion of prolines and position of histidines are main determinants of the cross sections in addition to sequence-specific interactions. CCS values can now be predicted for any peptide and organism, forming a basis for advanced proteomics workflows that make full use of the additional information.
Asunto(s)
Aprendizaje Profundo , Péptidos/química , Proteoma/análisis , Proteómica/métodos , Espectrometría de Masas en Tándem/métodos , Secuencia de Aminoácidos , Animales , Caenorhabditis elegans , Drosophila melanogaster , Escherichia coli , Células HeLa , Humanos , Iones , Redes Neurales de la Computación , Saccharomyces cerevisiae , Flujo de TrabajoRESUMEN
Data-independent acquisition modes isolate and concurrently fragment populations of different precursors by cycling through segments of a predefined precursor m/z range. Although these selection windows collectively cover the entire m/z range, overall, only a few per cent of all incoming ions are isolated for mass analysis. Here, we make use of the correlation of molecular weight and ion mobility in a trapped ion mobility device (timsTOF Pro) to devise a scan mode that samples up to 100% of the peptide precursor ion current in m/z and mobility windows. We extend an established targeted data extraction workflow by inclusion of the ion mobility dimension for both signal extraction and scoring and thereby increase the specificity for precursor identification. Data acquired from whole proteome digests and mixed organism samples demonstrate deep proteome coverage and a high degree of reproducibility as well as quantitative accuracy, even from 10 ng sample amounts.
Asunto(s)
Ciencia de los Datos/métodos , Ensayos Analíticos de Alto Rendimiento/métodos , Canales Iónicos/metabolismo , Transporte Iónico/fisiología , Proteoma/metabolismo , Línea Celular Tumoral , Células HeLa , Humanos , Iones/química , Proteómica/métodos , Reproducibilidad de los Resultados , Espectrometría de Masas en Tándem/métodosRESUMEN
Plasma and serum are rich sources of information regarding an individual's health state, and protein tests inform medical decision making. Despite major investments, few new biomarkers have reached the clinic. Mass spectrometry (MS)-based proteomics now allows highly specific and quantitative readout of the plasma proteome. Here, we employ Plasma Proteome Profiling to define quality marker panels to assess plasma samples and the likelihood that suggested biomarkers are instead artifacts related to sample handling and processing. We acquire deep reference proteomes of erythrocytes, platelets, plasma, and whole blood of 20 individuals (> 6,000 proteins), and compare serum and plasma proteomes. Based on spike-in experiments, we determine sample quality-associated proteins, many of which have been reported as biomarker candidates as revealed by a comprehensive literature survey. We provide sample preparation guidelines and an online resource ( www.plasmaproteomeprofiling.org) to assess overall sample-related bias in clinical studies and to prevent costly miss-assignment of biomarker candidates.