Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 30
Filter
1.
EMBO J ; 42(23): e114665, 2023 Dec 01.
Article in English | MEDLINE | ID: mdl-37916885

ABSTRACT

Substantial efforts are underway to deepen our understanding of human brain morphology, structure, and function using high-resolution imaging as well as high-content molecular profiling technologies. The current work adds to these approaches by providing a comprehensive and quantitative protein expression map of 13 anatomically distinct brain regions covering more than 11,000 proteins. This was enabled by the optimization, characterization, and implementation of a high-sensitivity and high-throughput microflow liquid chromatography timsTOF tandem mass spectrometry system (LC-MS/MS) capable of analyzing more than 2,000 consecutive samples prepared from formalin-fixed paraffin embedded (FFPE) material. Analysis of this proteomic resource highlighted brain region-enriched protein expression patterns and functional protein classes, protein localization differences between brain regions and individual markers for specific areas. To facilitate access to and ease further mining of the data by the scientific community, all data can be explored online in a purpose-built R Shiny app (https://brain-region-atlas.proteomics.ls.tum.de).


Subject(s)
Proteomics , Tandem Mass Spectrometry , Humans , Chromatography, Liquid/methods , Proteomics/methods , Paraffin Embedding/methods , Tandem Mass Spectrometry/methods , Proteins/metabolism , Brain/metabolism , Proteome/metabolism
2.
Nat Methods ; 19(7): 803-811, 2022 07.
Article in English | MEDLINE | ID: mdl-35710609

ABSTRACT

The laboratory mouse ranks among the most important experimental systems for biomedical research and molecular reference maps of such models are essential informational tools. Here, we present a quantitative draft of the mouse proteome and phosphoproteome constructed from 41 healthy tissues and several lines of analyses exemplify which insights can be gleaned from the data. For instance, tissue- and cell-type resolved profiles provide protein evidence for the expression of 17,000 genes, thousands of isoforms and 50,000 phosphorylation sites in vivo. Proteogenomic comparison of mouse, human and Arabidopsis reveal common and distinct mechanisms of gene expression regulation and, despite many similarities, numerous differentially abundant orthologs that likely serve species-specific functions. We leverage the mouse proteome by integrating phenotypic drug (n > 400) and radiation response data with the proteomes of 66 pancreatic ductal adenocarcinoma (PDAC) cell lines to reveal molecular markers for sensitivity and resistance. This unique atlas complements other molecular resources for the mouse and can be explored online via ProteomicsDB and PACiFIC.


Subject(s)
Arabidopsis , Carcinoma, Pancreatic Ductal , Pancreatic Neoplasms , Animals , Arabidopsis/genetics , Carcinoma, Pancreatic Ductal/metabolism , Mass Spectrometry , Mice , Pancreatic Neoplasms/genetics , Proteome/analysis
3.
Mol Syst Biol ; 20(1): 28-55, 2024 Jan.
Article in English | MEDLINE | ID: mdl-38177929

ABSTRACT

Kinase inhibitors (KIs) are important cancer drugs but often feature polypharmacology that is molecularly not understood. This disconnect is particularly apparent in cancer entities such as sarcomas for which the oncogenic drivers are often not clear. To investigate more systematically how the cellular proteotypes of sarcoma cells shape their response to molecularly targeted drugs, we profiled the proteomes and phosphoproteomes of 17 sarcoma cell lines and screened the same against 150 cancer drugs. The resulting 2550 phenotypic profiles revealed distinct drug responses and the cellular activity landscapes derived from deep (phospho)proteomes (9-10,000 proteins and 10-27,000 phosphorylation sites per cell line) enabled several lines of analysis. For instance, connecting the (phospho)proteomic data with drug responses revealed known and novel mechanisms of action (MoAs) of KIs and identified markers of drug sensitivity or resistance. All data is publicly accessible via an interactive web application that enables exploration of this rich molecular resource for a better understanding of active signalling pathways in sarcoma cells, identifying treatment response predictors and revealing novel MoA of clinical KIs.


Subject(s)
Antineoplastic Agents , Sarcoma , Humans , Proteomics/methods , Proteome , Protein Kinase Inhibitors/pharmacology , Protein Kinase Inhibitors/therapeutic use , Sarcoma/drug therapy , Antineoplastic Agents/pharmacology , Antineoplastic Agents/therapeutic use , Cell Line, Tumor
4.
Proteomics ; 24(8): e2300112, 2024 Apr.
Article in English | MEDLINE | ID: mdl-37672792

ABSTRACT

Machine learning (ML) and deep learning (DL) models for peptide property prediction such as Prosit have enabled the creation of high quality in silico reference libraries. These libraries are used in various applications, ranging from data-independent acquisition (DIA) data analysis to data-driven rescoring of search engine results. Here, we present Oktoberfest, an open source Python package of our spectral library generation and rescoring pipeline originally only available online via ProteomicsDB. Oktoberfest is largely search engine agnostic and provides access to online peptide property predictions, promoting the adoption of state-of-the-art ML/DL models in proteomics analysis pipelines. We demonstrate its ability to reproduce and even improve our results from previously published rescoring analyses on two distinct use cases. Oktoberfest is freely available on GitHub (https://github.com/wilhelm-lab/oktoberfest) and can easily be installed locally through the cross-platform PyPI Python package.


Subject(s)
Proteomics , Software , Proteomics/methods , Peptides , Algorithms
5.
Mol Cell Proteomics ; 21(12): 100437, 2022 12.
Article in English | MEDLINE | ID: mdl-36328188

ABSTRACT

Estimating false discovery rates (FDRs) of protein identification continues to be an important topic in mass spectrometry-based proteomics, particularly when analyzing very large datasets. One performant method for this purpose is the Picked Protein FDR approach which is based on a target-decoy competition strategy on the protein level that ensures that FDRs scale to large datasets. Here, we present an extension to this method that can also deal with protein groups, that is, proteins that share common peptides such as protein isoforms of the same gene. To obtain well-calibrated FDR estimates that preserve protein identification sensitivity, we introduce two novel ideas. First, the picked group target-decoy and second, the rescued subset grouping strategies. Using entrapment searches and simulated data for validation, we demonstrate that the new Picked Protein Group FDR method produces accurate protein group-level FDR estimates regardless of the size of the data set. The validation analysis also uncovered that applying the commonly used Occam's razor principle leads to anticonservative FDR estimates for large datasets. This is not the case for the Picked Protein Group FDR method. Reanalysis of deep proteomes of 29 human tissues showed that the new method identified up to 4% more protein groups than MaxQuant. Applying the method to the reanalysis of the entire human section of ProteomicsDB led to the identification of 18,000 protein groups at 1% protein group-level FDR. The analysis also showed that about 1250 genes were represented by ≥2 identified protein groups. To make the method accessible to the proteomics community, we provide a software tool including a graphical user interface that enables merging results from multiple MaxQuant searches into a single list of identified and quantified protein groups.


Subject(s)
Peptides , Tandem Mass Spectrometry , Humans , Tandem Mass Spectrometry/methods , Databases, Protein , Software , Proteome , Algorithms
6.
Mol Cell Proteomics ; 21(8): 100238, 2022 08.
Article in English | MEDLINE | ID: mdl-35462064

ABSTRACT

Isobaric stable isotope labeling techniques such as tandem mass tags (TMTs) have become popular in proteomics because they enable the relative quantification of proteins with high precision from up to 18 samples in a single experiment. While missing values in peptide quantification are rare in a single TMT experiment, they rapidly increase when combining multiple TMT experiments. As the field moves toward analyzing ever higher numbers of samples, tools that reduce missing values also become more important for analyzing TMT datasets. To this end, we developed SIMSI-Transfer (Similarity-based Isobaric Mass Spectra 2 [MS2] Identification Transfer), a software tool that extends our previously developed software MaRaCluster (© Matthew The) by clustering similar tandem MS2 from multiple TMT experiments. SIMSI-Transfer is based on the assumption that similarity-clustered MS2 spectra represent the same peptide. Therefore, peptide identifications made by database searching in one TMT batch can be transferred to another TMT batch in which the same peptide was fragmented but not identified. To assess the validity of this approach, we tested SIMSI-Transfer on masked search engine identification results and recovered >80% of the masked identifications while controlling errors in the transfer procedure to below 1% false discovery rate. Applying SIMSI-Transfer to six published full proteome and phosphoproteome datasets from the Clinical Proteomic Tumor Analysis Consortium led to an increase of 26 to 45% of identified MS2 spectra with TMT quantifications. This significantly decreased the number of missing values across batches and, in turn, increased the number of peptides and proteins identified in all TMT batches by 43 to 56% and 13 to 16%, respectively.


Subject(s)
Proteomics , Tandem Mass Spectrometry , Cluster Analysis , Isotope Labeling , Peptides , Proteome , Software
7.
Nucleic Acids Res ; 50(D1): D1541-D1552, 2022 01 07.
Article in English | MEDLINE | ID: mdl-34791421

ABSTRACT

ProteomicsDB (https://www.ProteomicsDB.org) is a multi-omics and multi-organism resource for life science research. In this update, we present our efforts to continuously develop and expand ProteomicsDB. The major focus over the last two years was improving the findability, accessibility, interoperability and reusability (FAIR) of the data as well as its implementation. For this purpose, we release a new application programming interface (API) that provides systematic access to essentially all data in ProteomicsDB. Second, we release a new open-source user interface (UI) and show the advantages the scientific community gains from such software. With the new interface, two new visualizations of protein primary, secondary and tertiary structure as well an updated spectrum viewer were added. Furthermore, we integrated ProteomicsDB with our deep-neural-network Prosit that can predict the fragmentation characteristics and retention time of peptides. The result is an automatic processing pipeline that can be used to reevaluate database search engine results stored in ProteomicsDB. In addition, we extended the data content with experiments investigating different human biology as well as a newly supported organism.


Subject(s)
Databases, Protein , Proteins/classification , Proteomics/classification , Software , Biological Science Disciplines , Humans , Neural Networks, Computer , Proteins/chemistry
8.
J Proteome Res ; 22(4): 1359-1366, 2023 04 07.
Article in English | MEDLINE | ID: mdl-36988210

ABSTRACT

A frequent goal, or subgoal, when processing data from a quantitative shotgun proteomics experiment is a list of proteins that are differentially abundant under the examined experimental conditions. Unfortunately, obtaining such a list is a challenging process, as the mass spectrometer analyzes the proteolytic peptides of a protein rather than the proteins themselves. We have previously designed a Bayesian hierarchical probabilistic model, Triqler, for combining peptide identification and quantification errors into probabilities of proteins being differentially abundant. However, the model was developed for data from data-dependent acquisition. Here, we show that Triqler is also compatible with data-independent acquisition data after applying minor alterations for the missing value distribution. Furthermore, we find that it has better performance than a set of compared state-of-the-art protein summarization tools when evaluated on data-independent acquisition data.


Subject(s)
Peptides , Proteins , Bayes Theorem , Proteins/analysis , Peptides/analysis , Mass Spectrometry/methods , Proteomics/methods
9.
Anal Chem ; 94(20): 7181-7190, 2022 05 24.
Article in English | MEDLINE | ID: mdl-35549156

ABSTRACT

The prediction of fragment ion intensities and retention time of peptides has gained significant attention over the past few years. However, the progress shown in the accurate prediction of such properties focused primarily on unlabeled peptides. Tandem mass tags (TMT) are chemical peptide labels that are coupled to free amine groups usually after protein digestion to enable the multiplexed analysis of multiple samples in bottom-up mass spectrometry. It is a standard workflow in proteomics ranging from single-cell to high-throughput proteomics. Particularly for TMT, increasing the number of confidently identified spectra is highly desirable as it provides identification and quantification information with every spectrum. Here, we report on the generation of an extensive resource of synthetic TMT-labeled peptides as part of the ProteomeTools project and present the extension of the deep learning model Prosit to accurately predict the retention time and fragment ion intensities of TMT-labeled peptides with high accuracy. Prosit-TMT supports CID and HCD fragmentation and ion trap and Orbitrap mass analyzers in a single model. Reanalysis of published TMT data sets show that this single model extracts substantial additional information. Applying Prosit-TMT, we discovered that the expression of many proteins in human breast milk follows a distinct daily cycle which may prime the newborn for nutritional or environmental cues.


Subject(s)
Deep Learning , Tandem Mass Spectrometry , Humans , Infant, Newborn , Peptides/chemistry , Proteolysis , Proteomics/methods , Tandem Mass Spectrometry/methods
10.
J Proteome Res ; 20(4): 2062-2068, 2021 04 02.
Article in English | MEDLINE | ID: mdl-33661646

ABSTRACT

Error estimation for differential protein quantification by label-free shotgun proteomics is challenging due to the multitude of error sources, each contributing uncertainty to the final results. We have previously designed a Bayesian model, Triqler, to combine such error terms into one combined quantification error. Here we present an interface for Triqler that takes MaxQuant results as input, allowing quick reanalysis of already processed data. We demonstrate that Triqler outperforms the original processing for a large set of both engineered and clinical/biological relevant data sets. Triqler and its interface to MaxQuant are available as a Python module under an Apache 2.0 license from https://pypi.org/project/triqler/.


Subject(s)
Proteomics , Software , Bayes Theorem , Proteins
11.
J Proteome Res ; 20(12): 5402-5411, 2021 12 03.
Article in English | MEDLINE | ID: mdl-34735149

ABSTRACT

Proteomic biomarker discovery using formalin-fixed paraffin-embedded (FFPE) tissue requires robust workflows to support the analysis of large cohorts of patient samples. It also requires finding a reasonable balance between achieving a high proteomic depth and limiting the overall analysis time. To this end, we evaluated the merits of online coupling of single-use disposable trap column nanoflow liquid chromatography, high-field asymmetric-waveform ion-mobility spectrometry (FAIMS), and tandem mass spectrometry (nLC-FAIMS-MS/MS). The data show that ≤600 ng of peptide digest should be loaded onto the chromatographic part of the system. Careful characterization of the FAIMS settings enabled the choice of optimal combinations of compensation voltages (CVs) as a function of the employed LC gradient time. We found nLC-FAIMS-MS/MS to be on par with StageTip-based off-line basic pH reversed-phase fractionation in terms of proteomic depth and reproducibility of protein quantification (coefficient of variation ≤15% for 90% of all proteins) but requiring 50% less sample and substantially reducing sample handling. Using FFPE materials from the lymph node, lung, and prostate tissue as examples, we show that nLC-FAIMS-MS/MS can identify 5000-6000 proteins from the respective tissue within a total of 3 h of analysis time.


Subject(s)
Proteomics , Tandem Mass Spectrometry , Apoptosis Regulatory Proteins , Chromatography, Liquid/methods , Humans , Ion Mobility Spectrometry/methods , Male , Proteomics/methods , Reproducibility of Results , Tandem Mass Spectrometry/methods
12.
Anal Chem ; 93(25): 8687-8692, 2021 06 29.
Article in English | MEDLINE | ID: mdl-34124897

ABSTRACT

A current trend in proteomics is to acquire data in a "single-shot" by LC-MS/MS because it simplifies workflows and promises better throughput and quantitative accuracy than schemes that involve extensive sample fractionation. However, single-shot approaches can suffer from limited proteome coverage when performed by data dependent acquisition (ssDDA) on nanoflow LC systems. For applications where sample quantities are not scarce, this study shows that high proteome coverage can be obtained using a microflow LC-MS/MS system operating a 1 mm i.d. × 150 mm column, at a flow-rate of 50 µL/min and coupled to an Orbitrap HF-X mass spectrometer. The results demonstrate the identification of ∼9 000 proteins from 50 µg of protein digest from Arabidopsis roots, 7 500 from mouse thymus, and 7 300 from human breast cancer cells in 3 h of analysis time in a single run. The dynamic range of protein quantification measured by the iBAQ approach spanned 5 orders of magnitude and replicate analysis showed that the median coefficient of variation was below 20%. Together, this study shows that ssDDA by µLC-MS/MS is a robust method for comprehensive and large-scale proteome analysis and which may be further extended to more rapid chromatography and data independent acquisition approaches in the future.̀.


Subject(s)
Chromatography, Liquid , Proteomics , Tandem Mass Spectrometry , Animals , Arabidopsis , Cell Line , Humans , Mice , Proteome
13.
Mol Cell Proteomics ; 18(3): 561-570, 2019 03.
Article in English | MEDLINE | ID: mdl-30482846

ABSTRACT

Protein quantification by label-free shotgun proteomics experiments is plagued by a multitude of error sources. Typical pipelines for identifying differential proteins use intermediate filters to control the error rate. However, they often ignore certain error sources and, moreover, regard filtered lists as completely correct in subsequent steps. These two indiscretions can easily lead to a loss of control of the false discovery rate (FDR). We propose a probabilistic graphical model, Triqler, that propagates error information through all steps, employing distributions in favor of point estimates, most notably for missing value imputation. The model outputs posterior probabilities for fold changes between treatment groups, highlighting uncertainty rather than hiding it. We analyzed 3 engineered data sets and achieved FDR control and high sensitivity, even for truly absent proteins. In a bladder cancer clinical data set we discovered 35 proteins at 5% FDR, whereas the original study discovered 1 and MaxQuant/Perseus 4 proteins at this threshold. Compellingly, these 35 proteins showed enrichment for functional annotation terms, whereas the top ranked proteins reported by MaxQuant/Perseus showed no enrichment. The model executes in minutes and is freely available at https://pypi.org/project/triqler/.


Subject(s)
Proteomics/methods , Urinary Bladder Neoplasms/metabolism , Algorithms , Bayes Theorem , Databases, Protein , Humans , Models, Theoretical , Tandem Mass Spectrometry
14.
J Proteome Res ; 18(9): 3353-3359, 2019 09 06.
Article in English | MEDLINE | ID: mdl-31407580

ABSTRACT

The processing of peptide tandem mass spectrometry data involves matching observed spectra against a sequence database. The ranking and calibration of these peptide-spectrum matches can be improved substantially using a machine learning postprocessor. Here, we describe our efforts to speed up one widely used postprocessor, Percolator. The improved software is dramatically faster than the previous version of Percolator, even when using relatively few processors. We tested the new version of Percolator on a data set containing over 215 million spectra and recorded an overall reduction to 23% of the running time as compared to the unoptimized code. We also show that the memory footprint required by these speedups is modest relative to that of the original version of Percolator.


Subject(s)
Peptides/genetics , Proteomics/methods , Software , Algorithms , Databases, Protein , Machine Learning , Peptides/classification , Peptides/isolation & purification , Tandem Mass Spectrometry/methods
15.
J Proteome Res ; 17(5): 1993-1996, 2018 05 04.
Article in English | MEDLINE | ID: mdl-29682973

ABSTRACT

In the recent benchmarking article entitled "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra", Rieder et al. compared several different approaches to cluster MS/MS spectra. While we certainly recognize the value of the manuscript, here, we report some shortcomings detected in the original analyses. For most analyses, the authors clustered only single MS/MS runs. In one of the reported analyses, three MS/MS runs were processed together, which already led to computational performance issues in many of the tested approaches. This fact highlights the difficulties of using many of the tested algorithms on the nowadays produced average proteomics data sets. Second, the authors only processed identified spectra when merging MS runs. Thereby, all unidentified spectra that are of lower quality were already removed from the data set and could not influence the clustering results. Next, we found that the authors did not analyze the effect of chimeric spectra on the clustering results. In our analysis, we found that 3% of the spectra in the used data sets were chimeric, and this had marked effects on the behavior of the different clustering algorithms tested. Finally, the authors' choice to evaluate the MS-Cluster and spectra-cluster algorithms using a precursor tolerance of 5 Da for high-resolution Orbitrap data only was, in our opinion, not adequate to assess the performance of MS/MS clustering approaches.


Subject(s)
Algorithms , Tandem Mass Spectrometry , Benchmarking , Cluster Analysis , Proteomics
16.
J Proteome Res ; 17(5): 1879-1886, 2018 05 04.
Article in English | MEDLINE | ID: mdl-29631402

ABSTRACT

A natural way to benchmark the performance of an analytical experimental setup is to use samples of known composition and see to what degree one can correctly infer the content of such a sample from the data. For shotgun proteomics, one of the inherent problems of interpreting data is that the measured analytes are peptides and not the actual proteins themselves. As some proteins share proteolytic peptides, there might be more than one possible causative set of proteins resulting in a given set of peptides and there is a need for mechanisms that infer proteins from lists of detected peptides. A weakness of commercially available samples of known content is that they consist of proteins that are deliberately selected for producing tryptic peptides that are unique to a single protein. Unfortunately, such samples do not expose any complications in protein inference. Hence, for a realistic benchmark of protein inference procedures, there is a need for samples of known content where the present proteins share peptides with known absent proteins. Here, we present such a standard, that is based on E. coli expressed human protein fragments. To illustrate the application of this standard, we benchmark a set of different protein inference procedures on the data. We observe that inference procedures excluding shared peptides provide more accurate estimates of errors compared to methods that include information from shared peptides, while still giving a reasonable performance in terms of the number of identified proteins. We also demonstrate that using a sample of known protein content without proteins with shared tryptic peptides can give a false sense of accuracy for many protein inference methods.


Subject(s)
Algorithms , Benchmarking/methods , Proteomics/methods , Sequence Homology, Amino Acid , Benchmarking/standards , Escherichia coli/metabolism , Humans , Peptide Fragments/analysis , Peptides/analysis , Proteins/analysis , Proteins/metabolism , Trypsin/metabolism
17.
Bioinformatics ; 33(4): 508-513, 2017 02 15.
Article in English | MEDLINE | ID: mdl-27797755

ABSTRACT

Motivation: Liquid chromatography is frequently used as a means to reduce the complexity of peptide-mixtures in shotgun proteomics. For such systems, the time when a peptide is released from a chromatography column and registered in the mass spectrometer is referred to as the peptide's retention time . Using heuristics or machine learning techniques, previous studies have demonstrated that it is possible to predict the retention time of a peptide from its amino acid sequence. In this paper, we are applying Gaussian Process Regression to the feature representation of a previously described predictor E lude . Using this framework, we demonstrate that it is possible to estimate the uncertainty of the prediction made by the model. Here we show how this uncertainty relates to the actual error of the prediction. Results: In our experiments, we observe a strong correlation between the estimated uncertainty provided by Gaussian Process Regression and the actual prediction error. This relation provides us with new means for assessment of the predictions. We demonstrate how a subset of the peptides can be selected with lower prediction error compared to the whole set. We also demonstrate how such predicted standard deviations can be used for designing adaptive windowing strategies. Contact: lukas.kall@scilifelab.se. Availability and Implementation: Our software and the data used in our experiments is publicly available and can be downloaded from https://github.com/statisticalbiotechnology/GPTime .


Subject(s)
Models, Theoretical , Peptides/chemistry , Proteomics/methods , Software , Uncertainty , Amino Acid Sequence , Chromatography, Liquid/methods , Mass Spectrometry/methods
18.
Proteomics ; 16(18): 2461-9, 2016 09.
Article in English | MEDLINE | ID: mdl-27503675

ABSTRACT

A frequently sought output from a shotgun proteomics experiment is a list of proteins that we believe to have been present in the analyzed sample before proteolytic digestion. The standard technique to control for errors in such lists is to enforce a preset threshold for the false discovery rate (FDR). Many consider protein-level FDRs a difficult and vague concept, as the measurement entities, spectra, are manifestations of peptides and not proteins. Here, we argue that this confusion is unnecessary and provide a framework on how to think about protein-level FDRs, starting from its basic principle: the null hypothesis. Specifically, we point out that two competing null hypotheses are used concurrently in today's protein inference methods, which has gone unnoticed by many. Using simulations of a shotgun proteomics experiment, we show how confusing one null hypothesis for the other can lead to serious discrepancies in the FDR. Furthermore, we demonstrate how the same simulations can be used to verify FDR estimates of protein inference methods. In particular, we show that, for a simple protein inference method, decoy models can be used to accurately estimate protein-level FDRs for both competing null hypotheses.


Subject(s)
Algorithms , Proteins/analysis , Proteomics/methods , Databases, Protein , High-Throughput Screening Assays , Proteins/metabolism
19.
J Proteome Res ; 15(3): 713-20, 2016 Mar 04.
Article in English | MEDLINE | ID: mdl-26653874

ABSTRACT

Shotgun proteomics experiments generate large amounts of fragment spectra as primary data, normally with high redundancy between and within experiments. Here, we have devised a clustering technique to identify fragment spectra stemming from the same species of peptide. This is a powerful alternative method to traditional search engines for analyzing spectra, specifically useful for larger scale mass spectrometry studies. As an aid in this process, we propose a distance calculation relying on the rarity of experimental fragment peaks, following the intuition that peaks shared by only a few spectra offer more evidence than peaks shared by a large number of spectra. We used this distance calculation and a complete-linkage scheme to cluster data from a recent large-scale mass spectrometry-based study. The clusterings produced by our method have up to 40% more identified peptides for their consensus spectra compared to those produced by the previous state-of-the-art method. We see that our method would advance the construction of spectral libraries as well as serve as a tool for mining large sets of fragment spectra. The source code and Ubuntu binary packages are available at https://github.com/statisticalbiotechnology/maracluster (under an Apache 2.0 license).


Subject(s)
Cluster Analysis , Mass Spectrometry/methods , Peptides/analysis , Proteomics/methods , Animals , Data Mining , Humans , Search Engine , Software
20.
Methods Mol Biol ; 2758: 457-483, 2024.
Article in English | MEDLINE | ID: mdl-38549030

ABSTRACT

Liquid chromatography-coupled mass spectrometry (LC-MS/MS) is the primary method to obtain direct evidence for the presentation of disease- or patient-specific human leukocyte antigen (HLA). However, compared to the analysis of tryptic peptides in proteomics, the analysis of HLA peptides still poses computational and statistical challenges. Recently, fragment ion intensity-based matching scores assessing the similarity between predicted and observed spectra were shown to substantially increase the number of confidently identified peptides, particularly in use cases where non-tryptic peptides are analyzed. In this chapter, we describe in detail three procedures on how to benefit from state-of-the-art deep learning models to analyze and validate single spectra, single measurements, and multiple measurements in mass spectrometry-based immunopeptidomics. For this, we explain how to use the Universal Spectrum Explorer (USE), online Oktoberfest, and offline Oktoberfest. For intensity-based scoring, Oktoberfest uses fragment ion intensity and retention time predictions from the deep learning framework Prosit, a deep neural network trained on a very large number of synthetic peptides and tandem mass spectra generated within the ProteomeTools project. The examples shown highlight how deep learning-assisted analysis can increase the number of identified HLA peptides, facilitate the discovery of confidently identified neo-epitopes, or provide assistance in the assessment of the presence of cryptic peptides, such as spliced peptides.


Subject(s)
Deep Learning , Humans , Chromatography, Liquid , Tandem Mass Spectrometry/methods , Peptides/analysis , Histocompatibility Antigens Class I , HLA Antigens
SELECTION OF CITATIONS
SEARCH DETAIL