Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 47
Filter
1.
Mol Cell Proteomics ; 23(4): 100743, 2024 Apr.
Article in English | MEDLINE | ID: mdl-38403075

ABSTRACT

Discovering noncanonical peptides has been a common application of proteogenomics. Recent studies suggest that certain noncanonical peptides, known as noncanonical major histocompatibility complex-I (MHC-I)-associated peptides (ncMAPs), that bind to MHC-I may make good immunotherapeutic targets. De novo peptide sequencing is a great way to find ncMAPs since it can detect peptide sequences from their tandem mass spectra without using any sequence databases. However, this strategy has not been widely applied for ncMAP identification because there is not a good way to estimate its false-positive rates. In order to completely and accurately identify immunopeptides using de novo peptide sequencing, we describe a unique pipeline called proteomics X genomics. In contrast to current pipelines, it makes use of genomic data, RNA-Seq abundance and sequencing quality, in addition to proteomic features to increase the sensitivity and specificity of peptide identification. We show that the peptide-spectrum match quality and genetic traits have a clear relationship, showing that they can be utilized to evaluate peptide-spectrum matches. From 10 samples, we found 24,449 canonical MHC-I-associated peptides and 956 ncMAPs by using a target-decoy competition. Three hundred eighty-seven ncMAPs and 1611 canonical MHC-I-associated peptides were new identifications that had not yet been published. We discovered 11 ncMAPs produced from a squirrel monkey retrovirus in human cell lines in addition to the two ncMAPs originating from a complementarity determining region 3 in an antibody thanks to the unrestricted search space assumed by de novo sequencing. These entirely new identifications show that proteomics X genomics can make the most of de novo peptide sequencing's advantages and its potential use in the search for new immunotherapeutic targets.


Subject(s)
Histocompatibility Antigens Class I , Peptides , Peptides/metabolism , Peptides/chemistry , Histocompatibility Antigens Class I/genetics , Histocompatibility Antigens Class I/metabolism , Humans , Proteomics/methods , RNA-Seq/methods , Animals
2.
Bioinformatics ; 39(12)2023 12 01.
Article in English | MEDLINE | ID: mdl-37995286

ABSTRACT

MOTIVATION: Predicting protein structures with high accuracy is a critical challenge for the broad community of life sciences and industry. Despite progress made by deep neural networks like AlphaFold2, there is a need for further improvements in the quality of detailed structures, such as side-chains, along with protein backbone structures. RESULTS: Building upon the successes of AlphaFold2, the modifications we made include changing the losses of side-chain torsion angles and frame aligned point error, adding loss functions for side chain confidence and secondary structure prediction, and replacing template feature generation with a new alignment method based on conditional random fields. We also performed re-optimization by conformational space annealing using a molecular mechanics energy function which integrates the potential energies obtained from distogram and side-chain prediction. In the CASP15 blind test for single protein and domain modeling (109 domains), DeepFold ranked fourth among 132 groups with improvements in the details of the structure in terms of backbone, side-chain, and Molprobity. In terms of protein backbone accuracy, DeepFold achieved a median GDT-TS score of 88.64 compared with 85.88 of AlphaFold2. For TBM-easy/hard targets, DeepFold ranked at the top based on Z-scores for GDT-TS. This shows its practical value to the structural biology community, which demands highly accurate structures. In addition, a thorough analysis of 55 domains from 39 targets with publicly available structures indicates that DeepFold shows superior side-chain accuracy and Molprobity scores among the top-performing groups. AVAILABILITY AND IMPLEMENTATION: DeepFold tools are open-source software available at https://github.com/newtonjoo/deepfold.


Subject(s)
Proteins , Software , Protein Conformation , Proteins/chemistry , Protein Structure, Secondary , Protein Folding
3.
Anal Chem ; 95(30): 11193-11200, 2023 08 01.
Article in English | MEDLINE | ID: mdl-37459568

ABSTRACT

Predicting peptide detectability is useful in a variety of mass spectrometry (MS)-based proteomics applications, particularly targeted proteomics. However, most machine learning-based computational methods have relied solely on information from the peptide itself, such as its amino acid sequences or physicochemical properties, despite the fact that peptides detected by MS are dependent on many factors, including protein sample preparation, digestion, separation, ionization, and precursor selection during MS experiments. DbyDeep (Detectability by Deep learning) is an innovative end-to-end LSTM network model for peptide detectability prediction that incorporates sequence contexts of peptides and their cleavage sites (by protease). Utilizing the cleavage site contexts could improve the performance of prediction, and DbyDeep outperformed existing methods in predicting peptides recognizable from multiple MS/MS data sets with diverse species and MS instruments. We argue for the necessity of a learning model that encompasses several contexts associated with peptide detection, as opposed to depending just on peptide sequences. There is a Python implementation of DbyDeep at https://github.com/BISCodeRepo/DbyDeep.


Subject(s)
Deep Learning , Tandem Mass Spectrometry , Peptides/chemistry , Proteins , Amino Acid Sequence
4.
Anal Chem ; 95(46): 16918-16926, 2023 11 21.
Article in English | MEDLINE | ID: mdl-37946317

ABSTRACT

To gain a better understanding of the complex human immune system, it is necessary to measure and interpret numerous cellular protein expressions at the single cell level. Mass cytometry is a relatively new technology that offers unprecedented information about the protein expression of a single cell. Conversely, the analysis of high-dimensional and multiparametric mass cytometric data sets presents a new computational challenge. For instance, conventional "manual gating" analysis was inefficient and unreliable for multiparametric phenotyping of the heterogeneous immune cellular system; consequently, automated methods have been developed to address the high dimensionality of mass cytometry data and enhance the reproducibility of the analysis. Here, we present CyGate, a semiautomated method for classifying single cells into their respective cell types. CyGate learns a gating strategy from a reference data set, trains a model for cell classification, and then automatically analyzes additional data sets using the trained model. CyGate also supports the machine learning framework for the classification of "ungated" cells, which are typically disregarded by automated methods. CyGate's utility was demonstrated by its high performance in cell type classification and the lowest generalization error on various public data sets when compared to the state-of-the-art semiautomated methods. Notably, CyGate had the shortest execution time, allowing it to scale with a growing number of samples. CyGate is available at https://github.com/seungjinna/cygate.


Subject(s)
Computational Biology , Machine Learning , Humans , Flow Cytometry/methods , Reproducibility of Results , Computational Biology/methods , Algorithms
5.
Bioinformatics ; 38(11): 2980-2987, 2022 05 26.
Article in English | MEDLINE | ID: mdl-35441674

ABSTRACT

MOTIVATION: Tandem mass tag (TMT)-based tandem mass spectrometry (MS/MS) has become the method of choice for the quantification of post-translational modifications in complex mixtures. Many cancer proteogenomic studies have highlighted the importance of large-scale phosphopeptide quantification coupled with TMT labeling. Herein, we propose a predicted Spectral DataBase (pSDB) search strategy called Deephos that can improve both sensitivity and specificity in identifying MS/MS spectra of TMT-labeled phosphopeptides. RESULTS: With deep learning-based fragment ion prediction, we compiled a pSDB of TMT-labeled phosphopeptides generated from ∼8000 human phosphoproteins annotated in UniProt. Deep learning could successfully recognize the fragmentation patterns altered by both TMT labeling and phosphorylation. In addition, we discuss the decoy spectra for false discovery rate (FDR) estimation in the pSDB search. We show that FDR could be inaccurately estimated by the existing decoy spectra generation methods and propose an innovative method to generate decoy spectra for more accurate FDR estimation. The utilities of Deephos were demonstrated in multi-stage analyses (coupled with database searches) of glioblastoma, acute myeloid leukemia and breast cancer phosphoproteomes. AVAILABILITY AND IMPLEMENTATION: Deephos pSDB and the search software are available at https://github.com/seungjinna/deephos.


Subject(s)
Phosphopeptides , Tandem Mass Spectrometry , Humans , Phosphopeptides/analysis , Tandem Mass Spectrometry/methods , Algorithms , Databases, Factual , Software , Databases, Protein
6.
BMC Bioinformatics ; 23(1): 109, 2022 Mar 30.
Article in English | MEDLINE | ID: mdl-35354356

ABSTRACT

BACKGROUND: In shotgun proteomics, database search engines have been developed to assign peptides to tandem mass (MS/MS) spectra and at the same time post-processing (or rescoring) approaches over the search results have been proposed to increase the number of confident peptide identifications. The most popular post-processing approaches such as Percolator and PeptideProphet have improved rates of peptide identifications by combining multiple scores from database search engines while applying machine learning techniques. Existing post-processing approaches, however, are limited when dealing with results from new search engines because their features for machine learning must be optimized specifically for each search engine. RESULTS: We propose a universal post-processing tool, called TIDD, which supports confident peptide identifications regardless of the search engine adopted. TIDD can work for any (including newly developed) search engines because it calculates universal features that assess peptide-spectrum match quality while it allows additional features provided by search engines (or users) as well. Even though it relies on universal features independent of search tools, TIDD showed similar or better performance than Percolator in terms of peptide identification. TIDD identified 10.23-38.95% more PSMs than target-decoy estimation for MSFragger, which is not supported by Percolator. TIDD offers an easy-to-use simple graphical user interface for user convenience. CONCLUSIONS: TIDD successfully eliminated the requirement for an optimal feature engineering per database search tool, and thus, can be applied directly to any database search results including newly developed ones.


Subject(s)
Algorithms , Tandem Mass Spectrometry , Databases, Protein , Machine Learning , Peptides , Tandem Mass Spectrometry/methods
7.
Bioinformatics ; 36(Suppl_1): i203-i209, 2020 07 01.
Article in English | MEDLINE | ID: mdl-32657416

ABSTRACT

MOTIVATION: Proteogenomics has proven its utility by integrating genomics and proteomics. Typical approaches use data from next-generation sequencing to infer proteins expressed. A sample-specific protein sequence database is often adopted to identify novel peptides from matched mass spectrometry-based proteomics; nevertheless, there is no software that can practically identify all possible forms of mutated peptides suggested by various genomic information sources. RESULTS: We propose MutCombinator, which enables us to practically identify mutated peptides from tandem mass spectra allowing combinatorial mutations during the database search. It uses an upgraded version of a variant graph, keeping track of frame information. The variant graph is indexed by nine nucleotides for fast access. Using MutCombinator, we could identify more mutated peptides than previous methods, because combinations of point mutations are considered and also because it can be practically applied together with a large mutation database such as COSMIC. Furthermore, MutCombinator supports in-frame search for coding regions and three-frame search for non-coding regions. AVAILABILITY AND IMPLEMENTATION: https://prix.hanyang.ac.kr/download/mutcombinator.jsp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Nucleotides , Peptides , Databases, Protein , Mutation , Peptides/genetics , Proteomics , Software
8.
J Proteome Res ; 19(1): 212-220, 2020 01 03.
Article in English | MEDLINE | ID: mdl-31714086

ABSTRACT

Recent sequencing technologies have highlighted translation of untranslated regions (UTRs) in genomes, although it remains unknown whether the translated products persist in a cell. Here, we propose a proteogenomic approach to UTR identification at the proteome level, which has been challenging due to the lack of corresponding sequences required for peptide spectrum matching. We address the challenge with constructing translated UTR (tUTR) database, consisting of all hypothetical sequences that can be translated from UTR by assuming non-AUG initiation at near-cognate start codons and stop codon readthrough. In the analysis of the H1299 cell line mass spectrometry (MS/MS) dataset, the tUTR DB-based proteogenomic approach enabled the detection of 52 5'-UTR and 9 3'-UTR peptides from 45 and 9 genes, respectively. The identified UTR peptides were validated via high spectral similarity with their synthetic peptides. The 5'-UTR peptides pointed out alternative initiation sites with non-AUG start codons, which exactly conformed to Kozak contexts of annotated initiation sites. It is also worth noting that our approach can detect translated amino acid sequences as well as provide evidence for UTR translation, while ribosome profiling provides only the translation evidence. For previously reported stop codon readthrough in MDH1 gene, we could confirm the amino acid inserted during the readthrough. Data are available via ProteomeXchange with identifier PXD016207.


Subject(s)
Proteogenomics , Codon, Initiator , Peptides/genetics , Tandem Mass Spectrometry , Untranslated Regions
9.
Int J Mol Sci ; 21(18)2020 Sep 05.
Article in English | MEDLINE | ID: mdl-32899552

ABSTRACT

ß/γ-Crystallins, the main structural protein in human lenses, have highly stable structure for keeping the lens transparent. Their mutations have been linked to cataracts. In this study, we identified 10 new mutations of ß/γ-crystallins in lens proteomic dataset of cataract patients using bioinformatics tools. Of these, two double mutants, S175G/H181Q of ßΒ2-crystallin and P24S/S31G of γD-crystallin, were found mutations occurred in the largest loop linking the distant ß-sheets in the Greek key motif. We selected these double mutants for identifying the properties of these mutations, employing biochemical assay, the identification of protein modifications with nanoUPLC-ESI-TOF tandem MS and examining their structural dynamics with hydrogen/deuterium exchange-mass spectrometry (HDX-MS). We found that both double mutations decrease protein stability and induce the aggregation of ß/γ-crystallin, possibly causing cataracts. This finding suggests that both the double mutants can serve as biomarkers of cataracts.


Subject(s)
Cataract/genetics , beta-Crystallin B Chain/genetics , gamma-Crystallins/genetics , Adolescent , Adult , Aged , Child, Preschool , Humans , Infant, Newborn , Lens, Crystalline/metabolism , Mutation/genetics , Protein Aggregates/genetics , Protein Stability , Proteomics/methods , beta-Crystallin B Chain/metabolism , gamma-Crystallins/metabolism
10.
J Proteome Res ; 18(10): 3800-3806, 2019 10 04.
Article in English | MEDLINE | ID: mdl-31475827

ABSTRACT

We propose to use cRFP (common Repository of FBS Proteins) in the MS (mass spectrometry) raw data search of cell secretomes. cRFP is a small supplementary sequence list of highly abundant fetal bovine serum proteins added to the reference database in use. The aim behind using cRFP is to prevent the contaminant FBS proteins from being misidentified as other proteins in the reference database, just as we would use cRAP (common Repository of Adventitious Proteins) to prevent contaminant proteins present either by accident or through unavoidable contacts from being misidentified as other proteins. We expect it to be widely used in experiments where the proteins are obtained from serum-free media after thorough washing of the cells, or from a complex media such as SILAC, or from extracellular vesicles directly.


Subject(s)
Cells, Cultured/metabolism , Proteome/analysis , Proteomics/methods , Serum/chemistry , Animals , Cattle , Culture Media/chemistry , Databases, Protein , Humans , Mass Spectrometry
11.
Anal Chem ; 91(17): 11324-11333, 2019 09 03.
Article in English | MEDLINE | ID: mdl-31365238

ABSTRACT

Post-translational modifications regulate various cellular processes and are of great biological interest. Unrestrictive searches of mass spectrometry data enable the detection of any type of modification. Here we propose MODplus, which makes practical unrestrictive searches possible by allowing (1) hundreds of modifications, (2) multiple modifications per peptide, (3) the whole proteome database, and (4) any tolerant values in search parameters. The utility of MODplus was demonstrated in large human data sets of HEK293 cells and TMT-labeled phosphorylation enrichment. Notably, MODplus supports identifying different modification types at multiple sites and reports real chemical and biological modifications, as it has been very labor intensive to link unrestrictive search results to real modifications. We also confirmed the presence of Missing Precursor (MP) spectra that were not identifiable using targeted precursor masses. The MP spectra mostly resulted in identifications of wrong modifications and negatively affected the overall performance, often by as much as 10%. MODplus can rapidly recognize MP spectra and correct their identifications, resulting in increased identification rate up to 70% in the HEK293 data set as well as improved reliability.


Subject(s)
Mass Spectrometry/methods , Protein Processing, Post-Translational , Software , Databases, Protein , Datasets as Topic/standards , HEK293 Cells , Humans , Proteomics/methods , Reproducibility of Results , Scientific Experimental Error
12.
J Proteome Res ; 17(10): 3593-3598, 2018 10 05.
Article in English | MEDLINE | ID: mdl-30033731

ABSTRACT

Most database search tools for proteomics have their own scoring parameter sets depending on experimental conditions such as fragmentation methods, instruments, digestion enzymes, and so on. These scoring parameter sets are usually predefined by tool developers and cannot be modified by users. The number of different experimental conditions grows as the technology develops, and the given set of scoring parameters could be suboptimal for tandem mass spectrometry data acquired using new sample preparation or fragmentation methods. Here we introduce a new approach to optimize scoring parameters in a data-dependent manner using a spectrum quality filter. The new approach conducts a preliminary search for the spectra selected by the spectrum quality filter. Search results from the preliminary search are used to generate data-dependent scoring parameters; then, the full search over the entire input spectra is conducted using the learned scoring parameters. We show that the new approach yields more and better peptide-spectrum matches than the conventional search using built-in scoring parameters when compared at the same 1% false discovery rate.


Subject(s)
Algorithms , Peptides/metabolism , Proteins/metabolism , Proteomics/methods , Tandem Mass Spectrometry/methods , Data Accuracy , Databases, Protein , Humans , Reproducibility of Results , Search Engine/methods , Software
13.
Bioinformatics ; 33(8): 1218-1220, 2017 04 15.
Article in English | MEDLINE | ID: mdl-28031186

ABSTRACT

Summary: In many proteogenomic applications, mapping peptide sequences onto genome sequences can be very useful, because it allows us to understand origins of the gene products. Existing software tools either take the genomic position of a peptide start site as an input or assume that the peptide sequence exactly matches the coding sequence of a given gene model. In case of novel peptides resulting from genomic variations, especially structural variations such as alternative splicing, these existing tools cannot be directly applied unless users supply information about the variant, either its genomic position or its transcription model. Mapping potentially novel peptides to genome sequences, while allowing certain genomic variations, requires introducing novel gene models when aligning peptide sequences to gene structures. We have developed a new tool called ACTG (Amino aCids To Genome), which maps peptides to genome, assuming all possible single exon skipping, junction variation allowing three edit distances from the original splice sites, exon extension and frame shift. In addition, it can also consider SNVs (single nucleotide variations) during mapping phase if a user provides the VCF (variant call format) file as an input. Availability and Implementation: Available at http://prix.hanyang.ac.kr/ACTG/search.jsp . Contact: eunokpaek@hanyang.ac.kr. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Genomics/methods , Peptide Mapping/methods , Peptides/genetics , Software , Alternative Splicing , Exons , Humans , Models, Genetic , RNA Splice Sites
14.
J Proteome Res ; 16(6): 2231-2239, 2017 06 02.
Article in English | MEDLINE | ID: mdl-28452485

ABSTRACT

Proteogenomic searches are useful for novel peptide identification from tandem mass spectra. Usually, separate and multistage approaches are adopted to accurately control the false discovery rate (FDR) for proteogenomic search. Their performance on novel peptide identification has not been thoroughly evaluated, however, mainly due to the difficulty in confirming the existence of identified novel peptides. We simulated a proteogenomic search using a controlled, spike-in proteomic data set. After confirming that the results of the simulated proteogenomic search were similar to those of a real proteogenomic search using a human cell line data set, we evaluated the performance of six FDR control methods-global, separate, and multistage FDR estimation, respectively, coupled to a target-decoy search and a mixture model-based method-on novel peptide identification. The multistage approach showed the highest accuracy for FDR estimation. However, global and separate FDR estimation with the mixture model-based method showed higher sensitivities than others at the same true FDR. Furthermore, the mixture model-based method performed equally well when applied without or with a reduced set of decoy sequences. Considering different prior probabilities for novel and known protein identification, we recommend using mixture model-based methods with separate FDR estimation for sensitive and reliable identification of novel peptides from proteogenomic searches.


Subject(s)
Peptides/analysis , Proteogenomics/methods , Cell Line , Computer Simulation , False Positive Reactions , Humans , Methods , Models, Theoretical , Tandem Mass Spectrometry
15.
BMC Genomics ; 17(Suppl 13): 1031, 2016 12 22.
Article in English | MEDLINE | ID: mdl-28155652

ABSTRACT

BACKGROUND: Proteogenomics is a promising approach for various tasks ranging from gene annotation to cancer research. Databases for proteogenomic searches are often constructed by adding peptide sequences inferred from genomic or transcriptomic evidence to reference protein sequences. Such inflation of databases has potential of identifying novel peptides. However, it also raises concerns on sensitive and reliable peptide identification. Spurious peptides included in target databases may result in underestimated false discovery rate (FDR). On the other hand, inflation of decoy databases could decrease the sensitivity of peptide identification due to the increased number of high-scoring random hits. Although several studies have addressed these issues, widely applicable guidelines for sensitive and reliable proteogenomic search have hardly been available. RESULTS: To systematically evaluate the effect of database inflation in proteogenomic searches, we constructed a variety of real and simulated proteogenomic databases for yeast and human tandem mass spectrometry (MS/MS) data, respectively. Against these databases, we tested two popular database search tools with various approaches to search result validation: the target-decoy search strategy (with and without a refined scoring-metric) and a mixture model-based method. The effect of separate filtering of known and novel peptides was also examined. The results from real and simulated proteogenomic searches confirmed that separate filtering increases the sensitivity and reliability in proteogenomic search. However, no one method consistently identified the largest (or the smallest) number of novel peptides from real proteogenomic searches. CONCLUSIONS: We propose to use a set of search result validation methods with separate filtering, for sensitive and reliable identification of peptides in proteogenomic search.


Subject(s)
Databases, Genetic , Peptides/metabolism , Proteogenomics/methods , Humans , Peptides/chemistry , Reproducibility of Results , Search Engine , Sensitivity and Specificity , Tandem Mass Spectrometry , Yeasts/metabolism
16.
Bioinformatics ; 31(24): 4026-8, 2015 Dec 15.
Article in English | MEDLINE | ID: mdl-26315908

ABSTRACT

UNLABELLED: Peptide identification is an important problem in proteomics. One of the most popular scoring schemes for peptide identification is XCorr (cross-correlation). Since calculating XCorr is computationally intensive, a lot of efforts have been made to develop fast XCorr engines. However, the existing XCorr engines are not suitable for high-resolution MS/MS spectrometry because they are either slow or require a specific type of CPU. We present a portable high-speed XCorr engine for high-resolution tandem mass spectrometry by developing a novel algorithm for calculating XCorr. The algorithm enables XCorr calculation 1.25-49 times faster than previous algorithms for 0.01 Da fragment tolerance. Furthermore, our engine is easily portable to any machine with different types of CPU because it is developed in C language. Hence, our XCorr engine will expedite peptide identification by high-resolution tandem mass spectrometry. AVAILABILITY AND IMPLEMENTATION: Available at http://isa.hanyang.ac.kr/HiXCorr/HiXCorr.html.


Subject(s)
Peptides/chemistry , Software , Tandem Mass Spectrometry/methods , Algorithms , Proteomics/methods
17.
Mass Spectrom Rev ; 34(2): 133-47, 2015.
Article in English | MEDLINE | ID: mdl-24889695

ABSTRACT

Post-translational modifications (PTMs) are critical to almost all aspects of complex processes of the cell. Identification of PTMs is one of the biggest challenges for proteomics, and there have been many computational studies for the analysis of PTMs from tandem mass spectrometry (MS/MS). Most early PTM identification studies have been performed by matching MS/MS data to protein databases, using database search tools, but they are prohibitively slow when a large number of PTMs is given as a search parameter. In this article, we present recent developments to search for more types of PTMs and to speed up the search, and discuss many computational issues and solutions in terms of identifying multiply modified peptides or searching for all possible modifications at once in unrestrictive mode. Apart from the most common type of PTMs involving covalent addition of functional groups to proteins, PTMs such as disulfide linkage require dedicated software for the analysis because they may involve cross-linking between two different parts of proteins. Finally, methods for identification of protein disulfide bonds are presented.


Subject(s)
Disulfides/analysis , Peptide Fragments/analysis , Protein Processing, Post-Translational , Proteins/metabolism , Software , Algorithms , Amino Acid Sequence , Databases, Protein , Disulfides/chemistry , Molecular Sequence Data , Oxidation-Reduction , Proteins/chemistry , Proteomics/instrumentation , Proteomics/methods , Tandem Mass Spectrometry
18.
J Proteome Res ; 14(7): 2784-91, 2015 Jul 02.
Article in English | MEDLINE | ID: mdl-26004133

ABSTRACT

Proteogenomics research has been using six-frame translation of the whole genome or amino acid exon graphs to overcome the limitations of reference protein sequence database; however, six-frame translation is not suitable for annotating genes that span over multiple exons, and amino acid exon graphs are not convenient to represent novel splice variants and exon skipping events between exons of incompatible reading frames. We propose a proteogenomic pipeline NextSearch (Nucleotide EXon-graph Transcriptome Search) that is based on a nucleotide exon graph. This pipeline consists of constructing a compact nucleotide exon graph that systematically incorporates novel splice variations and a search tool that identifies peptides by directly searching the nucleotide exon graph against tandem mass spectra. Because our exon graph stores nucleotide sequences, it can easily represent novel splice variations and exon skipping events between incompatible reading frame exons. Searching for peptide identification is performed against this nucleotide exon graph, without converting it into a protein sequence in FASTA format, achieving an order of magnitude reduction in the size of the sequence database storage. NextSearch outputs the proteome-genome/transcriptome mapping results in a general feature format (GFF) file, which can be visualized by public tools such as the UCSC Genome Browser.


Subject(s)
Exons , Information Storage and Retrieval , Mass Spectrometry/methods , Nucleotides/genetics , Base Sequence , Cell Line , Humans , Molecular Sequence Data , Nucleotides/chemistry
19.
Proteomics ; 14(23-24): 2742-9, 2014 Dec.
Article in English | MEDLINE | ID: mdl-25316439

ABSTRACT

In proteogenomic analysis, construction of a compact, customized database from mRNA-seq data and a sensitive search of both reference and customized databases are essential to accurately determine protein abundances and structural variations at the protein level. However, these tasks have not been systematically explored, but rather performed in an ad-hoc fashion. Here, we present an effective method for constructing a compact database containing comprehensive sequences of sample-specific variants--single nucleotide variants, insertions/deletions, and stop-codon mutations derived from Exome-seq and RNA-seq data. It, however, occupies less space by storing variant peptides, not variant proteins. We also present an efficient search method for both customized and reference databases. The separate searches of the two databases increase the search time, and a unified search is less sensitive to identify variant peptides due to the smaller size of the customized database, compared to the reference database, in the target-decoy setting. Our method searches the unified database once, but performs target-decoy validations separately. Experimental results show that our approach is as fast as the unified search and as sensitive as the separate searches. Our customized database includes mutation information in the headers of variant peptides, thereby facilitating the inspection of peptide-spectrum matches.


Subject(s)
Peptides/metabolism , Proteins/metabolism , Proteomics/methods , Databases, Protein , Mutation , Peptides/genetics , Proteins/genetics , Stomach Neoplasms/metabolism
20.
J Proteome Res ; 13(7): 3488-97, 2014 Jul 03.
Article in English | MEDLINE | ID: mdl-24918111

ABSTRACT

Isobaric tag-based quantification such as iTRAQ and TMT is a promising approach to mass spectrometry-based quantification in proteomics as it provides wide proteome coverage with greatly increased experimental throughput. However, it is known to suffer from inaccurate quantification and identification of a target peptide due to cofragmentation of multiple peptides, which likely leads to under-estimation of differentially expressed peptides (DEPs). A simple method of filtering out cofragmented spectra with less than 100% precursor isolation purity (PIP) would decrease the coverage of iTRAQ/TMT experiments. In order to estimate the impact of cofragmentation on quantification and identification of iTRAQ-labeled peptide samples, we generated multiplexed spectra with varying degrees of PIP by mixing the two MS/MS spectra of 100% PIP obtained in global proteome profiling experiments on gastric tumor-normal tissue pair proteomes labeled by 4-plex iTRAQ. Despite cofragmentation, the simulation experiments showed that more than 99% of multiplexed spectra with PIP greater than 80% were correctly identified by three different database search engines-MODa, MS-GF+, and Proteome Discoverer. Using the multiplexed spectra that have been correctly identified, we estimated the effect of cofragmentation on peptide quantification. In 74% of the multiplexed spectra, however, the cancer-to-normal expression ratio was compressed, and a fair number of spectra showed the "ratio inflation" phenomenon. On the basis of the estimated distribution of distortions on quantification, we were able to calculate cutoff values for DEP detection from cofragmented spectra, which were corrected according to a specific PIP and probability of type I (or type II) error. When we applied these corrected cutoff values to real cofragmented spectra with PIP larger than or equal to 70%, we were able to identify reliable DEPs by removing about 25% of DEPs, which are highly likely to be false positives. Our experimental results provide useful insight into the effect of cofragmentation on isobaric tag-based quantification methods. The simulation procedure as well as the corrected cutoff calculation method could be adopted for quantifying the effect of cofragmentation and reducing false positives (or false negatives) in the DEP identification with general quantification experiments based on isobaric labeling techniques.


Subject(s)
Peptide Fragments/chemistry , Proteome/chemistry , Amino Acid Sequence , Computer Simulation , Humans , Molecular Sequence Data , Peptide Mapping , Proteolysis , Proteome/metabolism , Proteomics , Stomach Neoplasms/metabolism , Tandem Mass Spectrometry
SELECTION OF CITATIONS
SEARCH DETAIL