ABSTRACT
MOTIVATION: Confident deconvolution of proteomic spectra is critical for several applications such as de novo sequencing, cross-linking mass spectrometry and handling chimeric mass spectra. RESULTS: In general, all deconvolution algorithms may eventually report mass peaks that are not compatible with the chemical formula of any peptide. We show how to remove these artifacts by considering their mass defects. We introduce Y.A.D.A. 3.0, a fast deconvolution algorithm that can remove peaks with unacceptable mass defects. Our approach is effective for polypeptides with less than 10 kDa, and its essence can be easily incorporated into any deconvolution algorithm. AVAILABILITY AND IMPLEMENTATION: Y.A.D.A. 3.0 is freely available for academic use at http://patternlabforproteomics.org/yada3. SUPPLEMENTARY INFORMATION: Supplementary information is available at Bioinformatics online.
Subject(s)
Algorithms , Proteomics , Peptides , Mass Spectrometry/methods , SoftwareABSTRACT
MOTIVATION: We present the first tool for unbiased quality control of top-down proteomics datasets. Our tool can select high-quality top-down proteomics spectra, serve as a gateway for building top-down spectral libraries and, ultimately, improve identification rates. RESULTS: We demonstrate that a twofold rate increase for two E. coli top-down proteomics datasets may be achievable. AVAILABILITY AND IMPLEMENTATION: http://patternlabforproteomics.org/tdgc, freely available for academic use. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Proteomics , Escherichia coli , Software , Tandem Mass SpectrometryABSTRACT
BACKGROUND: Worldwide, breast cancer is the main cause of cancer mortality in women. Most cases originate in mammary ductal cells that produce the nipple aspirate fluid (NAF). In cancer patients, this secretome contains proteins associated with the tumor microenvironment. NAF studies are challenging because of inter-individual variability. We introduced a paired-proteomic shotgun strategy that relies on NAF analysis from both breasts of patients with unilateral breast cancer and extended PatternLab for Proteomics software to take advantage of this setup. METHODS: The software is based on a peptide-centric approach and uses the binomial distribution to attribute a probability for each peptide as being linked to the disease; these probabilities are propagated to a final protein p-value according to the Stouffer's Z-score method. RESULTS: A total of 1227 proteins were identified and quantified, of which 87 were differentially abundant, being mainly involved in glycolysis (Warburg effect) and immune system activation (activated stroma). Additionally, in the estrogen receptor-positive subgroup, proteins related to the regulation of insulin-like growth factor transport and platelet degranulation displayed higher abundance, confirming the presence of a proliferative microenvironment. CONCLUSIONS: We debuted a differential bioinformatics workflow for the proteomic analysis of NAF, validating this secretome as a treasure-trove for studying a paired-organ cancer type.
Subject(s)
Biomarkers, Tumor/metabolism , Breast Neoplasms/metabolism , Breast Neoplasms/pathology , Nipple Aspirate Fluid/metabolism , Proteome/analysis , Proteomics/methods , Tumor Microenvironment , Aged , Aged, 80 and over , Case-Control Studies , Female , Follow-Up Studies , Humans , Middle Aged , Prognosis , WorkflowABSTRACT
MOTIVATION: Around 75% of all mass spectra remain unidentified by widely adopted proteomic strategies. We present DiagnoProt, an integrated computational environment that can efficiently cluster millions of spectra and use machine learning to shortlist high-quality unidentified mass spectra that are discriminative of different biological conditions. RESULTS: We exemplify the use of DiagnoProt by shortlisting 4366 high-quality unidentified tandem mass spectra that are discriminative of different types of the Aspergillus fungus. AVAILABILITY AND IMPLEMENTATION: DiagnoProt, a demonstration video and a user tutorial are available at http://patternlabforproteomics.org/diagnoprot . CONTACT: andrerfsilva@gmail.com or paulo@pcarvalho.com. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Machine Learning , Proteomics/methods , Sequence Analysis, Protein/methods , Software , Tandem Mass Spectrometry/methods , Aspergillus/metabolism , Fungal Proteins/analysisABSTRACT
Analyzing the information content of DNA, though holding the promise to help quantify how the processes of evolution have led to information gain throughout the ages, has remained an elusive goal. Paradoxically, one of the main reasons for this has been precisely the great diversity of life on the planet: if on the one hand this diversity is a rich source of data for information-content analysis, on the other hand there is so much variation as to make the task unmanageable. During the past decade or so, however, succinct fragments of the COI mitochondrial gene, which is present in all animal phyla and in a few others, have been shown to be useful for species identification through DNA barcoding. A few million such fragments are now publicly available through the BOLD systems initiative, thus providing an unprecedented opportunity for relatively comprehensive information-theoretic analyses of DNA to be attempted. Here we show how a generalized form of total correlation can yield distinctive information-theoretic descriptors of the phyla represented in those fragments. In order to illustrate the potential of this analysis to provide new insight into the evolution of species, we performed principal component analysis on standardized versions of the said descriptors for 23 phyla. Surprisingly, we found that, though based solely on the species represented in the data, the first principal component correlates strongly with the natural logarithm of the number of all known living species for those phyla. The new descriptors thus constitute clear information-theoretic signatures of the processes whereby evolution has given rise to current biodiversity, which suggests their potential usefulness in further related studies.
Subject(s)
Biodiversity , DNA Barcoding, Taxonomic/methods , Animals , Biological Evolution , DNA, Mitochondrial/genetics , Electron Transport Complex IV/genetics , Phylogeny , Principal Component AnalysisABSTRACT
Peptide spectrum matching is the current gold standard for protein identification via mass-spectrometry-based proteomics. Peptide spectrum matching compares experimental mass spectra against theoretical spectra generated from a protein sequence database to perform identification, but protein sequences not present in a database cannot be identified unless their sequences are in part conserved. The alternative approach, de novo sequencing, can make it possible to infer a peptide sequence directly from a mass spectrum, but interpreting long lists of peptide sequences resulting from large-scale experiments is not trivial. With this as motivation, PepExplorer was developed to use rigorous pattern recognition to assemble a list of homologue proteins using de novo sequencing data coupled to sequence alignment to allow biological interpretation of the data. PepExplorer can read the output of various widely adopted de novo sequencing tools and converge to a list of proteins with a global false-discovery rate. To this end, it employs a radial basis function neural network that considers precursor charge states, de novo sequencing scores, peptide lengths, and alignment scores to select similar protein candidates, from a target-decoy database, usually obtained from phylogenetically related species. Alignments are performed using a modified Smith-Waterman algorithm tailored for the task at hand. We verified the effectiveness of our approach using a reference set of identifications generated by ProLuCID when searching for Pyrococcus furiosus mass spectra on the corresponding NCBI RefSeq database. We then modified the sequence database by swapping amino acids until ProLuCID was no longer capable of identifying any proteins. By searching the mass spectra using PepExplorer on the modified database, we were able to recover most of the identifications at a 1% false-discovery rate. Finally, we employed PepExplorer to disclose a comprehensive proteomic assessment of the Bothrops jararaca plasma, a known biological source of natural inhibitors of snake toxins. PepExplorer is integrated into the PatternLab for Proteomics environment, which makes available various tools for downstream data analysis, including resources for quantitative and differential proteomics.
Subject(s)
Algorithms , Databases, Protein , Sequence Analysis, Protein/methods , Amino Acid Sequence , Animals , Archaeal Proteins/metabolism , Bothrops/metabolism , Mass Spectrometry , Plasma/metabolism , Proteomics , Pyrococcus furiosus/metabolism , Sequence AlignmentABSTRACT
Accessing localized proteomic profiles has emerged as a fundamental strategy to understand the biology of diseases, as recently demonstrated, for example, in the context of determining cancer resection margins with improved precision. Here, we analyze a gastric cancer biopsy sectioned into 10 parts, each one subjected to MudPIT analysis. We introduce a software tool, named Shotgun Imaging Analyzer and inspired in MALDI imaging, to enable the overlaying of a protein's expression heat map on a tissue picture. The software is tightly integrated with the NeXtProt database, so it enables the browsing of identified proteins according to chromosomes, quickly listing human proteins never identified by mass spectrometry (i.e., the so-called missing proteins), and the automatic search for proteins that are more expressed over a specific region of interest on the biopsy, all of which constitute goals that are clearly well-aligned with those of the C-HPP. Our software has been able to highlight an intense expression of proteins previously known to be correlated with cancers (e.g., glutathione S-transferase Mu 3), and in particular, we draw attention to Gastrokine-2, a "missing protein" identified in this work of which we were able to clearly delineate the tumoral region from the "healthy" with our approach. Data are available via ProteomeXchange with identifier PXD000584.
Subject(s)
Neoplasm Proteins/metabolism , Proteomics , Stomach Neoplasms/metabolism , Biopsy , Chromatography, Liquid , Humans , Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization , Stomach Neoplasms/pathology , Tandem Mass SpectrometryABSTRACT
SUMMARY: Protein identification by mass spectrometry is commonly accomplished using a peptide sequence matching search algorithm, whose sensitivity varies inversely with the size of the sequence database and the number of post-translational modifications considered. We present the Spectrum Identification Machine, a peptide sequence matching tool that capitalizes on the high-intensity b1-fragment ion of tandem mass spectra of peptides coupled in solution with phenylisotiocyanate to confidently sequence the first amino acid and ultimately reduce the search space. We demonstrate that in complex search spaces, a gain of some 120% in sensitivity can be achieved. AVAILABILITY: All data generated and the software are freely available for academic use at http://proteomics.fiocruz.br/software/sim. CONTACT: paulo@pcarvalho.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Algorithms , Escherichia coli Proteins/analysis , Escherichia coli/chemistry , Peptides/analysis , Proteomics/methods , Amino Acid Sequence , Escherichia coli Proteins/chemistry , Mass Spectrometry , Peptides/chemistry , Protein Processing, Post-Translational , SoftwareABSTRACT
UNLABELLED: We present an updated version of the TFold software for pinpointing differentially expressed proteins in shotgun proteomics experiments. Given an FDR bound, the updated approach uses a theoretical FDR estimator to maximize the number of identifications that satisfy both a fold-change cutoff that varies with the t-test P-value as a power law and a stringency criterion that aims to detect lowly abundant proteins. The new version has yielded significant improvements in sensitivity over the previous one. AVAILABILITY: Freely available for academic use at http://pcarvalho.com/patternlab.
Subject(s)
Proteins/analysis , Proteomics/methods , Software , Algorithms , Cell Line, Tumor , Computational Biology/methods , Data Interpretation, Statistical , Humans , Sequence Analysis, Protein , User-Computer InterfaceABSTRACT
In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Even though many different techniques have throughout many years been introduced to this end, it is probably fair to say that the workhorse in this preprocessing phase has been to divide the data by the standard deviation along each dimension. Like division by the standard deviation, the great majority of scaling techniques can be said to have roots in some sort of statistical take on the data. Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering by some method, like k-means, that makes explicit use of distances between samples. We borrow from the field of cosmology and related areas the recently introduced notion of shape complexity, which in the variant we use is a relatively simple, data-dependent nonlinear function that we show can be used to help with the determination of appropriate scaling factors. Focusing on what might be called "midrange" distances, we formulate a constrained nonlinear programming problem and use it to produce candidate scaling-factor sets that can be sifted on the basis of further considerations of the data, say via expert knowledge. We give results on some iconic data sets, highlighting the strengths and potential weaknesses of the new approach. These results are generally positive across all the data sets used.
Subject(s)
Algorithms , Cluster AnalysisABSTRACT
Complex protein mixtures typically generate many tandem mass spectra produced by different peptides coisolated in the gas phase. Widely adopted proteomic data analysis environments usually fail to identify most of these spectra, succeeding at best in identifying only one of the multiple cofragmenting peptides. We present PatternLab V (PLV), an updated version of PatternLab that integrates the YADA 3 deconvolution algorithm to handle such cases efficiently. In general, we expect an increase of 10% in spectral identifications when dealing with complex proteomic samples. PLV is freely available at http://patternlabforproteomics.org.
Subject(s)
Peptides , Proteomics , Peptides/analysis , Proteins/analysis , Algorithms , Tandem Mass Spectrometry , Databases, Protein , SoftwareABSTRACT
MOTIVATION: There are several well-established paradigms for identifying and pinpointing discriminative peptides/proteins using shotgun proteomic data; examples are peptide-spectrum matching, de novo sequencing, open searches, and even hybrid approaches. Such an arsenal of complementary paradigms can provide deep data coverage, albeit some unidentified discriminative peptides remain. RESULTS: We present DiagnoMass, software tool that groups similar spectra into spectral clusters and then shortlists those clusters that are discriminative for biological conditions. DiagnoMass then communicates with proteomic tools to attempt the identification of such clusters. We demonstrate the effectiveness of DiagnoMass by analyzing proteomic data from Escherichia coli, Salmonella, and Shigella, listing many high-quality discriminative spectral clusters that had thus far remained unidentified by widely adopted proteomic tools. DiagnoMass can also classify proteomic profiles. We anticipate the use of DiagnoMass as a vital tool for pinpointing biomarkers. AVAILABILITY: DiagnoMass and related documentation, including a usage protocol, are available at http://www.diagnomass.com.
Subject(s)
Proteomics , Software , Proteomics/methods , Proteins/chemistry , Peptides/chemistry , Escherichia coli , Algorithms , Databases, ProteinABSTRACT
The search engine processor (SEPro) is a tool for filtering, organizing, sharing, and displaying peptide spectrum matches. It employs a novel three-tier Bayesian approach that uses layers of spectrum, peptide, and protein logic to lead the data to converge to a single list of reliable protein identifications. SEPro is integrated into the PatternLab for proteomics environment, where an arsenal of tools for analyzing shotgun proteomic data is provided. By using the semi-labeled decoy approach for benchmarking, we show that SEPro significantly outperforms a commercially available competitor.
Subject(s)
Algorithms , Databases, Protein , Peptide Fragments/chemistry , Proteomics/methods , Software , Animals , Bayes Theorem , Chromatography, Liquid , Database Management Systems , Mice , Proteins/chemistry , Proteins/classification , Tandem Mass Spectrometry , User-Computer InterfaceABSTRACT
A strategy for treating cancer is to surgically remove the tumor together with a portion of apparently healthy tissue surrounding it, the so-called "resection margin", to minimize recurrence. Here, we investigate whether the proteomic profiles from biopsies of gastric cancer resection margins are indeed more similar to those from healthy tissue than from cancer biopsies. To this end, we analyzed biopsies using an offline MudPIT shotgun proteomic approach and performed label-free quantitation through a distributed normalized spectral abundance factor approach adapted for extracted ion chromatograms (XICs). A multidimensional scaling analysis revealed that each of those tissue-types is very distinct from each other. The resection margin presented several proteins previously correlated with cancer, but also other overexpressed proteins that may be related to tumor nourishment and metastasis, such as collagen alpha-1, ceruloplasmin, calpastatin, and E-cadherin. We argue that the resection margin plays a key role in Paget's "soil to seed" hypothesis, that is, that cancer cells require a special microenvironment to nourish and that understanding it could ultimately lead to more effective treatments.
Subject(s)
Biomarkers, Tumor/analysis , Proteome/analysis , Software , Stomach Neoplasms/metabolism , Biomarkers, Tumor/metabolism , Biopsy , Cadherins/metabolism , Case-Control Studies , Ceruloplasmin/metabolism , Chromatography, Ion Exchange/methods , Collagen Type XI/metabolism , Databases, Protein , Female , Humans , Male , Neoplasm Metastasis/diagnosis , Neoplasm Proteins/metabolism , Prognosis , Proteomics/methods , Pyloric Antrum/metabolism , Pyloric Antrum/pathology , Stomach Neoplasms/diagnosis , Stomach Neoplasms/pathologyABSTRACT
SUMMARY: We present an approach to statistically pinpoint differentially expressed proteins that have quantitation values near the quantitation threshold and are not identified in all replicates (marginal cases). Our method uses a Bayesian strategy to combine parametric statistics with an empirical distribution built from the reproducibility quality of the technical replicates. AVAILABILITY: The software is freely available for academic use at http://pcarvalho.com/patternlab.
Subject(s)
Proteins/metabolism , Proteomics/methods , Bayes Theorem , SoftwareABSTRACT
A quasispecies is a set of interrelated genotypes that have reached a stationary state while evolving according to the usual Darwinian principles of selection and mutation. Quasispecies studies invariably assume that it is possible for any genotype to mutate into any other, but recent finds indicate that this assumption is not necessarily true. Here we revisit the traditional quasispecies theory by adopting a network structure to constrain the occurrence of mutations. Such structure is governed by a random-graph model, whose single parameter (a probability p) controls both the graph's density and the dynamics of mutation. We contribute two further modifications to the theory, one to account for the fact that different loci in a genotype may be differently susceptible to the occurrence of mutations, the other to allow for a more plausible description of the transition from adaptation to degeneracy of the quasispecies as p is increased. We give analytical and simulation results for the usual case of binary genotypes, assuming the fitness landscape in which a genotype's fitness decays exponentially with its Hamming distance to the wild type. These results support the theory's assertions regarding the adaptation of the quasispecies to the fitness landscape and also its possible demise as a function of p.
Subject(s)
Evolution, Molecular , Models, BiologicalABSTRACT
Shotgun proteomics aims to identify and quantify the thousands of proteins in complex mixtures such as cell and tissue lysates and biological fluids. This approach uses liquid chromatography coupled with tandem mass spectrometry and typically generates hundreds of thousands of mass spectra that require specialized computational environments for data analysis. PatternLab for proteomics is a unified computational environment for analyzing shotgun proteomic data. PatternLab V (PLV) is the most comprehensive and crucial update so far, the result of intensive interaction with the proteomics community over several years. All PLV modules have been optimized and its graphical user interface has been completely updated for improved user experience. Major improvements were made to all aspects of the software, ranging from boosting the number of protein identifications to faster extraction of ion chromatograms. PLV provides modules for preparing sequence databases, protein identification, statistical filtering and in-depth result browsing for both labeled and label-free quantitation. The PepExplorer module can even pinpoint de novo sequenced peptides not already present in the database. PLV is of broad applicability and therefore suitable for challenging experimental setups, such as time-course experiments and data handling from unsequenced organisms. PLV interfaces with widely adopted software and community initiatives, e.g., Comet, Skyline, PEAKS and PRIDE. It is freely available at http://www.patternlabforproteomics.org .
Subject(s)
Proteomics , Software , Databases, Protein , Proteins/chemistry , Proteomics/methods , Tandem Mass SpectrometryABSTRACT
The decoy-database approach is currently the gold standard for assessing the confidence of identifications in shotgun proteomic experiments. Here, we demonstrate that what might appear to be a good result under the decoy-database approach for a given false-discovery rate could be, in fact, the product of overfitting. This problem has been overlooked until now and could lead to obtaining boosted identification numbers whose reliability does not correspond to the expected false-discovery rate. To overcome this, we are introducing a modified version of the method, termed a semi-labeled decoy approach, which enables the statistical determination of an overfitted result.
Subject(s)
Computational Biology , Proteomics/standards , Drug Discovery/standardsABSTRACT
SUMMARY: XDIA is a computational strategy for analyzing multiplexed spectra acquired using electron transfer dissociation and collision-activated dissociation; it significantly increases identified spectra (approximately 250%) and unique peptides (approximately 30%) when compared with the data-dependent ETCaD analysis on middle-down, single-phase shotgun proteomic analysis. Increasing identified spectra and peptides improves quantitation statistics confidence and protein coverage, respectively. AVAILABILITY: The software and data produced in this work are freely available for academic use at http://fields.scripps.edu/XDIA CONTACT: paulo@pcarvalho.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Proteomics/methods , Software , Algorithms , Databases, FactualABSTRACT
Bacterial quorum sensing is the communication that takes place between bacteria as they secrete certain molecules into the intercellular medium that later get absorbed by the secreting cells themselves and by others. Depending on cell density, this uptake has the potential to alter gene expression and thereby affect global properties of the community. We consider the case of multiple bacterial species coexisting, referring to each one of them as a genotype and adopting the usual denomination of the molecules they collectively secrete as public goods. A crucial problem in this setting is characterizing the coevolution of genotypes as some of them secrete public goods (and pay the associated metabolic costs) while others do not but may nevertheless benefit from the available public goods. We introduce a network model to describe genotype interaction and evolution when genotype fitness depends on the production and uptake of public goods. The model comprises a random graph to summarize the possible evolutionary pathways the genotypes may take as they interact genetically with one another, and a system of coupled differential equations to characterize the behavior of genotype abundance in time. We study some simple variations of the model analytically and more complex variations computationally. Our results point to a simple trade-off affecting the long-term survival of those genotypes that do produce public goods. This trade-off involves, on the producer side, the impact of producing and that of absorbing the public good. On the nonproducer side, it involves the impact of absorbing the public good as well, now compounded by the molecular compatibility between the producer and the nonproducer. Depending on how these factors turn out, producers may or may not survive.