Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 26
Filter
1.
J Am Soc Mass Spectrom ; 35(6): 1138-1155, 2024 Jun 05.
Article in English | MEDLINE | ID: mdl-38740383

ABSTRACT

Having fast, accurate, and broad spectrum methods for the identification of microorganisms is of paramount importance to public health, research, and safety. Bottom-up mass spectrometer-based proteomics has emerged as an effective tool for the accurate identification of microorganisms from microbial isolates. However, one major hurdle that limits the deployment of this tool for routine clinical diagnosis, and other areas of research such as culturomics, is the instrument time required for the mass spectrometer to analyze a single sample, which can take ∼1 h per sample, when using mass spectrometers that are presently used in most institutes. To address this issue, in this study, we employed, for the first time, tandem mass tags (TMTs) in multiplex identifications of microorganisms from multiple TMT-labeled samples in one MS/MS experiment. A difficulty encountered when using TMT labeling is the presence of interference in the measured intensities of TMT reporter ions. To correct for interference, we employed in the proposed method a modified version of the expectation maximization (EM) algorithm that redistributes the signal from ion interference back to the correct TMT-labeled samples. We have evaluated the sensitivity and specificity of the proposed method using 94 MS/MS experiments (covering a broad range of protein concentration ratios across TMT-labeled channels and experimental parameters), containing a total of 1931 true positive TMT-labeled channels and 317 true negative TMT-labeled channels. The results of the evaluation show that the proposed method has an identification sensitivity of 93-97% and a specificity of 100% at the species level. Furthermore, as a proof of concept, using an in-house-generated data set composed of some of the most common urinary tract pathogens, we demonstrated that by using the proposed method the mass spectrometer time required per sample, using a 1 h LC-MS/MS run, can be reduced to 10 and 6 min when samples are labeled with TMT-6 and TMT-10, respectively. The proposed method can also be used along with Orbitrap mass spectrometers that have faster MS/MS acquisition rates, like the recently released Orbitrap Astral mass spectrometer, to further reduce the mass spectrometer time required per sample.


Subject(s)
Algorithms , Proteomics , Tandem Mass Spectrometry , Tandem Mass Spectrometry/methods , Proteomics/methods , Humans , Bacteria/isolation & purification , Bacteria/chemistry , Bacterial Proteins/analysis , Bacterial Proteins/chemistry , Bacterial Proteins/isolation & purification
2.
J Comput Biol ; 31(2): 175-178, 2024 02.
Article in English | MEDLINE | ID: mdl-38301204

ABSTRACT

Although many user-friendly workflows exist for identifications of peptides and proteins in mass-spectrometry-based proteomics, there is a need of easy to use, fast, and accurate workflows for identifications of microorganisms, antimicrobial resistant proteins, and biomass estimation. Identification of microorganisms is a computationally demanding task that requires querying thousands of MS/MS spectra in a database containing thousands to tens of thousands of microorganisms. Existing software can't handle such a task in a time efficient manner, taking hours to process a single MS/MS experiment. Another paramount factor to consider is the necessity of accurate statistical significance to properly control the proportion of false discoveries among the identified microorganisms, and antimicrobial-resistant proteins, and to provide robust biomass estimation. Recently, we have developed Microorganism Classification and Identification (MiCId) workflow that assigns accurate statistical significance to identified microorganisms, antimicrobial-resistant proteins, and biomass estimation. MiCId's workflow is also computationally efficient, taking about 6-17 minutes to process a tandem mass-spectrometry (MS/MS) experiment using computer resources that are available in most laptop and desktop computers, making it a portable workflow. To make data analysis accessible to a broader range of users, beyond users familiar with the Linux environment, we have developed a graphical user interface (GUI) for MiCId's workflow. The GUI brings to users all the functionality of MiCId's workflow in a friendly interface along with tools for data analysis, visualization, and to export results.


Subject(s)
Anti-Infective Agents , Tandem Mass Spectrometry , Tandem Mass Spectrometry/methods , Workflow , Software , Proteins
3.
J Am Soc Mass Spectrom ; 33(6): 917-931, 2022 Jun 01.
Article in English | MEDLINE | ID: mdl-35500907

ABSTRACT

Fast and accurate identifications of pathogenic bacteria along with their associated antibiotic resistance proteins are of paramount importance for patient treatments and public health. To meet this goal from the mass spectrometry aspect, we have augmented the previously published Microorganism Classification and Identification (MiCId) workflow for this capability. To evaluate the performance of this augmented workflow, we have used MS/MS datafiles from samples of 10 antibiotic resistance bacterial strains belonging to three different species: Escherichia coli, Klebsiella pneumoniae, and Pseudomonas aeruginosa. The evaluation shows that MiCId's workflow has a sensitivity value around 85% (with a lower bound at about 72%) and a precision greater than 95% in identifying antibiotic resistance proteins. In addition to having high sensitivity and precision, MiCId's workflow is fast and portable, making it a valuable tool for rapid identifications of bacteria as well as detection of their antibiotic resistance proteins. It performs microorganismal identifications, protein identifications, sample biomass estimates, and antibiotic resistance protein identifications in 6-17 min per MS/MS sample using computing resources that are available in most desktop and laptop computers. We have also demonstrated other use of MiCId's workflow. Using MS/MS data sets from samples of two bacterial clonal isolates, one being antibiotic-sensitive while the other being multidrug-resistant, we applied MiCId's workflow to investigate possible mechanisms of antibiotic resistance in these pathogenic bacteria; the results showed that MiCId's conclusions agree with the published study. The new version of MiCId (v.07.01.2021) is freely available for download at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html.


Subject(s)
Proteomics , Tandem Mass Spectrometry , Anti-Bacterial Agents/pharmacology , Bacteria/chemistry , Drug Resistance, Bacterial , Drug Resistance, Microbial , Escherichia coli , Humans , Proteomics/methods , Pseudomonas aeruginosa , Tandem Mass Spectrometry/methods , Workflow
4.
Front Cell Infect Microbiol ; 11: 634215, 2021.
Article in English | MEDLINE | ID: mdl-34381737

ABSTRACT

Bloodstream infections (BSIs), the presence of microorganisms in blood, are potentially serious conditions that can quickly develop into sepsis and life-threatening situations. When assessing proper treatment, rapid diagnosis is the key; besides clinical judgement performed by attending physicians, supporting microbiological tests typically are performed, often requiring microbial isolation and culturing steps, which increases the time required for confirming positive cases of BSI. The additional waiting time forces physicians to prescribe broad-spectrum antibiotics and empirically based treatments, before determining the precise cause of the disease. Thus, alternative and more rapid cultivation-independent methods are needed to improve clinical diagnostics, supporting prompt and accurate treatment and reducing the development of antibiotic resistance. In this study, a culture-independent workflow for pathogen detection and identification in blood samples was developed, using peptide biomarkers and applying bottom-up proteomics analyses, i.e., so-called "proteotyping". To demonstrate the feasibility of detection of blood infectious pathogens, using proteotyping, Escherichia coli and Staphylococcus aureus were included in the study, as the most prominent bacterial causes of bacteremia and sepsis, as well as Candida albicans, one of the most prominent causes of fungemia. Model systems including spiked negative blood samples, as well as positive blood cultures, without further culturing steps, were investigated. Furthermore, an experiment designed to determine the incubation time needed for correct identification of the infectious pathogens in blood cultures was performed. The results for the spiked negative blood samples showed that proteotyping was 100- to 1,000-fold more sensitive, in comparison with the MALDI-TOF MS-based approach. Furthermore, in the analyses of ten positive blood cultures each of E. coli and S. aureus, both the MALDI-TOF MS-based and proteotyping approaches were successful in the identification of E. coli, although only proteotyping could identify S. aureus correctly in all samples. Compared with the MALDI-TOF MS-based approaches, shotgun proteotyping demonstrated higher sensitivity and accuracy, and required significantly shorter incubation time before detection and identification of the correct pathogen could be accomplished.


Subject(s)
Bacteremia , Staphylococcal Infections , Bacteremia/diagnosis , Candida albicans , Escherichia coli , Humans , Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization , Staphylococcal Infections/diagnosis , Staphylococcus aureus
5.
J Proteome Res ; 20(3): 1476-1487, 2021 03 05.
Article in English | MEDLINE | ID: mdl-33573382

ABSTRACT

Simple light isotope metabolic labeling (SLIM labeling) is an innovative method to quantify variations in the proteome based on an original in vivo labeling strategy. Heterotrophic cells grown in U-[12C] as the sole source of carbon synthesize U-[12C]-amino acids, which are incorporated into proteins, giving rise to U-[12C]-proteins. This results in a large increase in the intensity of the monoisotope ion of peptides and proteins, thus allowing higher identification scores and protein sequence coverage in mass spectrometry experiments. This method, initially developed for signal processing and quantification of the incorporation rate of 12C into peptides, was based on a multistep process that was difficult to implement for many laboratories. To overcome these limitations, we developed a new theoretical background to analyze bottom-up proteomics data using SLIM-labeling (bSLIM) and established simple procedures based on open-source software, using dedicated OpenMS modules, and embedded R scripts to process the bSLIM experimental data. These new tools allow computation of both the 12C abundance in peptides to follow the kinetics of protein labeling and the molar fraction of unlabeled and 12C-labeled peptides in multiplexing experiments to determine the relative abundance of proteins extracted under different biological conditions. They also make it possible to consider incomplete 12C labeling, such as that observed in cells with nutritional requirements for nonlabeled amino acids. These tools were validated on an experimental dataset produced using various yeast strains of Saccharomyces cerevisiae and growth conditions. The workflows are built on the implementation of appropriate calculation modules in a KNIME working environment. These new integrated tools provide a convenient framework for the wider use of the SLIM-labeling strategy.


Subject(s)
Proteome , Proteomics , Amino Acid Sequence , Isotope Labeling , Mass Spectrometry
6.
J Am Soc Mass Spectrom ; 31(1): 85-102, 2020 01 02.
Article in English | MEDLINE | ID: mdl-32881514

ABSTRACT

Rapid and accurate identification of microorganisms and estimation of their biomasses are of extreme importance to public health. Mass spectrometry has become an important technique for these purposes. Previously we published a workflow named Microorganism Classification and Identification (MiCId v.12.26.2017) that was shown to perform no worse than other workflows. This manuscript presents MiCId v.12.13.2018 that, in comparison with the earlier version v.12.26.2017, allows for biomass estimates, provides more accurate microorganism identifications (better controls the number of false positives), and is robust against database size increase. This significant advance is made possible by several new ingredients introduced: first, we apply a modified expectation-maximization method to compute for each taxon considered a prior probability, which can be used for biomass estimate; second, we introduce a new concept called ownership, through which the participation ratio is computed and use it as the number of taxa to be kept within a cluster of closely related taxa; third, based on confidently identified peptides, we calculate for each taxon its degree of independence from the rest of taxa considered to determine whether or not to split this taxon off the cluster. Using 270 data files, each containing a large number of MS/MS spectra, we show that, in comparison with v.12.26.2017, version v.12.13.2018 yields superior retrieval results. We also show that MiCId v.12.13.2018 can estimate species biomass reasonably well. The new MiCId v.12.13.2018, designed to run in Linux environment, is freely available for download at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html.


Subject(s)
Microbiological Techniques/methods , Software , Tandem Mass Spectrometry/methods , Bacteria/chemistry , Bacteria/classification , Bacteria/isolation & purification , Biomass , Computational Biology/methods , Databases, Factual , Databases, Protein , Peptides/chemistry , Workflow
7.
Proteomics ; 19(14): e1800367, 2019 07.
Article in English | MEDLINE | ID: mdl-30908818

ABSTRACT

Mass spectrometry-based proteomics starts with identifications of peptides and proteins, which provide the bases for forming the next-level hypotheses whose "validations" are often employed for forming even higher level hypotheses and so forth. Scientifically meaningful conclusions are thus attainable only if the number of falsely identified peptides/proteins is accurately controlled. For this reason, RAId continued to be developed in the past decade. RAId employs rigorous statistics for peptides/proteins identification, hence assigning accurate P-values/E-values that can be used confidently to control the number of falsely identified peptides and proteins. The RAId web service is a versatile tool built to identify peptides and proteins from tandem mass spectrometry data. Not only recognizing various spectra file formats, the web service also allows four peptide scoring functions and choice of three statistical methods for assigning P-values/E-values to identified peptides. Users may upload their own protein database or use one of the available knowledge integrated organismal databases that contain annotated information such as single amino acid polymorphisms, post-translational modifications, and their disease associations. The web service also provides a friendly interface to display, sort using different criteria, and download the identified peptides and proteins. RAId web service is freely available at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid.


Subject(s)
Databases, Protein , Mass Spectrometry/methods , Proteomics/methods , Computational Biology
8.
J Am Soc Mass Spectrom ; 29(8): 1721-1737, 2018 Aug.
Article in English | MEDLINE | ID: mdl-29873019

ABSTRACT

Rapid and accurate identification and classification of microorganisms is of paramount importance to public health and safety. With the advance of mass spectrometry (MS) technology, the speed of identification can be greatly improved. However, the increasing number of microbes sequenced is complicating correct microbial identification even in a simple sample due to the large number of candidates present. To properly untwine candidate microbes in samples containing one or more microbes, one needs to go beyond apparent morphology or simple "fingerprinting"; to correctly prioritize the candidate microbes, one needs to have accurate statistical significance in microbial identification. We meet these challenges by using peptide-centric representations of microbes to better separate them and by augmenting our earlier analysis method that yields accurate statistical significance. Here, we present an updated analysis workflow that uses tandem MS (MS/MS) spectra for microbial identification or classification. We have demonstrated, using 226 MS/MS publicly available data files (each containing from 2500 to nearly 100,000 MS/MS spectra) and 4000 additional MS/MS data files, that the updated workflow can correctly identify multiple microbes at the genus and often the species level for samples containing more than one microbe. We have also shown that the proposed workflow computes accurate statistical significances, i.e., E values for identified peptides and unified E values for identified microbes. Our updated analysis workflow MiCId, a freely available software for Microorganism Classification and Identification, is available for download at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html . Graphical Abstract ᅟ.

9.
BMC Res Notes ; 11(1): 182, 2018 Mar 15.
Article in English | MEDLINE | ID: mdl-29544540

ABSTRACT

OBJECTIVE: RAId is a software package that has been actively developed for the past 10 years for computationally and visually analyzing MS/MS data. Founded on rigorous statistical methods, RAId's core program computes accurate E-values for peptides and proteins identified during database searches. Making this robust tool readily accessible for the proteomics community by developing a graphical user interface (GUI) is our main goal here. RESULTS: We have constructed a graphical user interface to facilitate the use of RAId on users' local machines. Written in Java, RAId_GUI not only makes easy executions of RAId but also provides tools for data/spectra visualization, MS-product analysis, molecular isotopic distribution analysis, and graphing the retrieval versus the proportion of false discoveries. The results viewer displays and allows the users to download the analyses results. Both the knowledge-integrated organismal databases and the code package (containing source code, the graphical user interface, and a user manual) are available for download at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads/raid.html .


Subject(s)
Computational Biology/methods , Proteome/analysis , Proteomics/methods , Software , User-Computer Interface , Databases, Protein , Humans , Internet , Tandem Mass Spectrometry/methods
10.
Bioinformatics ; 32(17): 2642-9, 2016 09 01.
Article in English | MEDLINE | ID: mdl-27153659

ABSTRACT

MOTIVATION: There is a growing trend for biomedical researchers to extract evidence and draw conclusions from mass spectrometry based proteomics experiments, the cornerstone of which is peptide identification. Inaccurate assignments of peptide identification confidence thus may have far-reaching and adverse consequences. Although some peptide identification methods report accurate statistics, they have been limited to certain types of scoring function. The extreme value statistics based method, while more general in the scoring functions it allows, demands accurate parameter estimates and requires, at least in its original design, excessive computational resources. Improving the parameter estimate accuracy and reducing the computational cost for this method has two advantages: it provides another feasible route to accurate significance assessment, and it could provide reliable statistics for scoring functions yet to be developed. RESULTS: We have formulated and implemented an efficient algorithm for calculating the extreme value statistics for peptide identification applicable to various scoring functions, bypassing the need for searching large random databases. AVAILABILITY AND IMPLEMENTATION: The source code, implemented in C ++ on a linux system, is available for download at ftp://ftp.ncbi.nlm.nih.gov/pub/qmbp/qmbp_ms/RAId/RAId_Linux_64Bit CONTACT: yyu@ncbi.nlm.nih.gov SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Mass Spectrometry , Peptides , Proteomics , Databases, Protein , Humans , Tandem Mass Spectrometry
11.
J Am Soc Mass Spectrom ; 27(2): 194-210, 2016 Feb.
Article in English | MEDLINE | ID: mdl-26510657

ABSTRACT

Correct and rapid identification of microorganisms is the key to the success of many important applications in health and safety, including, but not limited to, infection treatment, food safety, and biodefense. With the advance of mass spectrometry (MS) technology, the speed of identification can be greatly improved. However, the increasing number of microbes sequenced is challenging correct microbial identification because of the large number of choices present. To properly disentangle candidate microbes, one needs to go beyond apparent morphology or simple 'fingerprinting'; to correctly prioritize the candidate microbes, one needs to have accurate statistical significance in microbial identification. We meet these challenges by using peptidome profiles of microbes to better separate them and by designing an analysis method that yields accurate statistical significance. Here, we present an analysis pipeline that uses tandem MS (MS/MS) spectra for microbial identification or classification. We have demonstrated, using MS/MS data of 81 samples, each composed of a single known microorganism, that the proposed pipeline can correctly identify microorganisms at least at the genus and species levels. We have also shown that the proposed pipeline computes accurate statistical significances, i.e., E-values for identified peptides and unified E-values for identified microorganisms. The proposed analysis pipeline has been implemented in MiCId, a freely available software for Microorganism Classification and Identification. MiCId is available for download at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html . Graphical Abstract ᅟ.


Subject(s)
Bacteria/classification , Tandem Mass Spectrometry/methods , Tandem Mass Spectrometry/statistics & numerical data , Bacteria/chemistry , Databases, Factual , Escherichia coli/classification , Peptides/analysis , Peptides/chemistry , Pseudomonas aeruginosa/classification , Software
12.
Bioinformatics ; 31(5): 699-706, 2015 Mar 01.
Article in English | MEDLINE | ID: mdl-25362092

ABSTRACT

MOTIVATION: Assigning statistical significance accurately has become increasingly important as metadata of many types, often assembled in hierarchies, are constructed and combined for further biological analyses. Statistical inaccuracy of metadata at any level may propagate to downstream analyses, undermining the validity of scientific conclusions thus drawn. From the perspective of mass spectrometry-based proteomics, even though accurate statistics for peptide identification can now be achieved, accurate protein level statistics remain challenging. RESULTS: We have constructed a protein ID method that combines peptide evidences of a candidate protein based on a rigorous formula derived earlier; in this formula the database P-value of every peptide is weighted, prior to the final combination, according to the number of proteins it maps to. We have also shown that this protein ID method provides accurate protein level E-value, eliminating the need of using empirical post-processing methods for type-I error control. Using a known protein mixture, we find that this protein ID method, when combined with the Soric formula, yields accurate values for the proportion of false discoveries. In terms of retrieval efficacy, the results from our method are comparable with other methods tested. AVAILABILITY AND IMPLEMENTATION: The source code, implemented in C++ on a linux system, is available for download at ftp://ftp.ncbi.nlm.nih.gov/pub/qmbp/qmbp_ms/RAId/RAId_Linux_64Bit.


Subject(s)
Algorithms , Databases, Protein , Mass Spectrometry/methods , Models, Statistical , Peptide Fragments/analysis , Proteins/analysis , Proteomics/methods , Humans , Proteins/metabolism
13.
PLoS One ; 9(3): e91225, 2014.
Article in English | MEDLINE | ID: mdl-24663491

ABSTRACT

Meta-analysis methods that combine P-values into a single unified P-value are frequently employed to improve confidence in hypothesis testing. An assumption made by most meta-analysis methods is that the P-values to be combined are independent, which may not always be true. To investigate the accuracy of the unified P-value from combining correlated P-values, we have evaluated a family of statistical methods that combine: independent, weighted independent, correlated, and weighted correlated P-values. Statistical accuracy evaluation by combining simulated correlated P-values showed that correlation among P-values can have a significant effect on the accuracy of the combined P-value obtained. Among the statistical methods evaluated those that weight P-values compute more accurate combined P-values than those that do not. Also, statistical methods that utilize the correlation information have the best performance, producing significantly more accurate combined P-values. In our study we have demonstrated that statistical methods that combine P-values based on the assumption of independence can produce inaccurate P-values when combining correlated P-values, even when the P-values are only weakly correlated. Therefore, to prevent from drawing false conclusions during hypothesis testing, our study advises caution be used when interpreting the P-value obtained from combining P-values of unknown correlation. However, when the correlation information is available, the weighting-capable statistical method, first introduced by Brown and recently modified by Hou, seems to perform the best amongst the methods investigated.


Subject(s)
Meta-Analysis as Topic , Statistics as Topic/methods , Data Interpretation, Statistical
14.
J Am Soc Mass Spectrom ; 25(1): 57-70, 2014 Jan.
Article in English | MEDLINE | ID: mdl-24254576

ABSTRACT

In this paper, we present Molecular Isotopic Distribution Analysis (MIDAs), a new software tool designed to compute molecular isotopic distributions with adjustable accuracies. MIDAs offers two algorithms, one polynomial-based and one Fourier-transform-based, both of which compute molecular isotopic distributions accurately and efficiently. The polynomial-based algorithm contains few novel aspects, whereas the Fourier-transform-based algorithm consists mainly of improvements to other existing Fourier-transform-based algorithms. We have benchmarked the performance of the two algorithms implemented in MIDAs with that of eight software packages (BRAIN, Emass, Mercury, Mercury5, NeutronCluster, Qmass, JFC, IC) using a consensus set of benchmark molecules. Under the proposed evaluation criteria, MIDAs's algorithms, JFC, and Emass compute with comparable accuracy the coarse-grained (low-resolution) isotopic distributions and are more accurate than the other software packages. For fine-grained isotopic distributions, we compared IC, MIDAs's polynomial algorithm, and MIDAs's Fourier transform algorithm. Among the three, IC and MIDAs's polynomial algorithm compute isotopic distributions that better resemble their corresponding exact fine-grained (high-resolution) isotopic distributions. MIDAs can be accessed freely through a user-friendly web-interface at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/midas/index.html.


Subject(s)
Isotopes/chemistry , Mass Spectrometry/methods , Software , Algorithms , Internet , Molecular Weight , Proteomics
15.
J Proteome Res ; 12(6): 2571-81, 2013 Jun 07.
Article in English | MEDLINE | ID: mdl-23668635

ABSTRACT

Because of its high specificity, trypsin is the enzyme of choice in shotgun proteomics. Nonetheless, several publications do report the identification of semitryptic and nontryptic peptides. Many of these peptides are thought to be signaling peptides or to have formed during sample preparation. It is known that only a small fraction of tandem mass spectra from a trypsin-digested protein mixture can be confidently matched to tryptic peptides. If other possibilities such as post-translational modifications and single-amino acid polymorphisms are ignored, this suggests that many unidentified spectra originate from semitryptic and nontryptic peptides. To include them in database searches, however, may not improve overall peptide identification because of the possible sensitivity reduction from search space expansion. To circumvent this issue for E-value-based search methods, we have designed a scheme that categorizes qualified peptides (i.e., peptides whose differences in molecular weight from the parent ion are within a specified error tolerance) into three tiers: tryptic, semitryptic, and nontryptic. This classification allows peptides that belong to different tiers to have different Bonferroni correction factors. Our results show that this scheme can significantly improve retrieval performance compared to those of search strategies that assign equal Bonferroni correction factors to all qualified peptides.


Subject(s)
Algorithms , Models, Statistical , Molecular Sequence Annotation/statistics & numerical data , Peptide Fragments/isolation & purification , Sequence Analysis, Protein/statistics & numerical data , Animals , Humans , Proteolysis , Proteomics , Sensitivity and Specificity , Tandem Mass Spectrometry , Trypsin/chemistry
16.
PLoS One ; 6(8): e22647, 2011.
Article in English | MEDLINE | ID: mdl-21912585

ABSTRACT

Given the expanding availability of scientific data and tools to analyze them, combining different assessments of the same piece of information has become increasingly important for social, biological, and even physical sciences. This task demands, to begin with, a method-independent standard, such as the P-value, that can be used to assess the reliability of a piece of information. Good's formula and Fisher's method combine independent P-values with respectively unequal and equal weights. Both approaches may be regarded as limiting instances of a general case of combining P-values from m groups; P-values within each group are weighted equally, while weight varies by group. When some of the weights become nearly degenerate, as cautioned by Good, numeric instability occurs in computation of the combined P-values. We deal explicitly with this difficulty by deriving a controlled expansion, in powers of differences in inverse weights, that provides both accurate statistics and stable numerics. We illustrate the utility of this systematic approach with a few examples. In addition, we also provide here an alternative derivation for the probability distribution function of the general case and show how the analytic formula obtained reduces to both Good's and Fisher's methods as special cases. A C++ program, which computes the combined P-values with equal numerical stability regardless of whether weights are (nearly) degenerate or not, is available for download at our group website http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads/CoinedPValues.html.


Subject(s)
Data Interpretation, Statistical , Probability
17.
J Proteomics ; 74(2): 199-211, 2011 Feb 01.
Article in English | MEDLINE | ID: mdl-21055489

ABSTRACT

Querying MS/MS spectra against a database containing only proteotypic peptides reduces data analysis time due to reduction of database size. Despite the speed advantage, this search strategy is challenged by issues of statistical significance and coverage. The former requires separating systematically significant identifications from less confident identifications, while the latter arises when the underlying peptide is not present, due to single amino acid polymorphisms (SAPs) or post-translational modifications (PTMs), in the proteotypic peptide libraries searched. To address both issues simultaneously, we have extended RAId's knowledge database to include proteotypic information, utilized RAId's statistical strategy to assign statistical significance to proteotypic peptides, and modified RAId's programs to allow for consideration of proteotypic information during database searches. The extended database alleviates the coverage problem since all annotated modifications, even those that occurred within proteotypic peptides, may be considered. Taking into account the likelihoods of observation, the statistical strategy of RAId provides accurate E-value assignments regardless whether a candidate peptide is proteotypic or not. The advantage of including proteotypic information is evidenced by its superior retrieval performance when compared to regular database searches.


Subject(s)
Data Interpretation, Statistical , Databases, Protein , Peptides/analysis , Protein Hydrolysates/analysis , Proteomics/methods , Information Storage and Retrieval , Peptides/chemistry , Protein Hydrolysates/chemistry , Protein Hydrolysates/metabolism , Trypsin/metabolism
18.
PLoS One ; 5(11): e15438, 2010 Nov 16.
Article in English | MEDLINE | ID: mdl-21103371

ABSTRACT

Statistically meaningful comparison/combination of peptide identification results from various search methods is impeded by the lack of a universal statistical standard. Providing an E-value calibration protocol, we demonstrated earlier the feasibility of translating either the score or heuristic E-value reported by any method into the textbook-defined E-value, which may serve as the universal statistical standard. This protocol, although robust, may lose spectrum-specific statistics and might require a new calibration when changes in experimental setup occur. To mitigate these issues, we developed a new MS/MS search tool, RAId_aPS, that is able to provide spectrum-specific-values for additive scoring functions. Given a selection of scoring functions out of RAId score, K-score, Hyperscore and XCorr, RAId_aPS generates the corresponding score histograms of all possible peptides using dynamic programming. Using these score histograms to assign E-values enables a calibration-free protocol for accurate significance assignment for each scoring function. RAId_aPS features four different modes: (i) compute the total number of possible peptides for a given molecular mass range, (ii) generate the score histogram given a MS/MS spectrum and a scoring function, (iii) reassign E-values for a list of candidate peptides given a MS/MS spectrum and the scoring functions chosen, and (iv) perform database searches using selected scoring functions. In modes (iii) and (iv), RAId_aPS is also capable of combining results from different scoring functions using spectrum-specific statistics. The web link is http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid_aps/index.html. Relevant binaries for Linux, Windows, and Mac OS X are available from the same page.


Subject(s)
Algorithms , Mass Spectrometry/methods , Peptides/analysis , Computational Biology/methods , Databases, Protein , Molecular Weight , Peptide Library , Peptides/chemistry , Proteomics/methods , Reproducibility of Results , Software
19.
BMC Genomics ; 9: 505, 2008 Oct 27.
Article in English | MEDLINE | ID: mdl-18954448

ABSTRACT

BACKGROUND: Existing scientific literature is a rich source of biological information such as disease markers. Integration of this information with data analysis may help researchers to identify possible controversies and to form useful hypotheses for further validations. In the context of proteomics studies, individualized proteomics era may be approached through consideration of amino acid substitutions/modifications as well as information from disease studies. Integration of such information with peptide searches facilitates speedy, dynamic information retrieval that may significantly benefit clinical laboratory studies. DESCRIPTION: We have integrated from various sources annotated single amino acid polymorphisms, post-translational modifications, and their documented disease associations (if they exist) into one enhanced database per organism. We have also augmented our peptide identification software RAId_DbS to take into account this information while analyzing a tandem mass spectrum. In principle, one may choose to respect or ignore the correlation of amino acid polymorphisms/modifications within each protein. The former leads to targeted searches and avoids scoring of unnecessary polymorphism/modification combinations; the latter explores possible polymorphisms in a controlled fashion. To facilitate new discoveries, RAId_DbS also allows users to conduct searches permitting novel polymorphisms as well as to search a knowledge database created by the users. CONCLUSION: We have finished constructing enhanced databases for 17 organisms. The web link to RAId_DbS and the enhanced databases is http://www.ncbi.nlm.nih.gov/CBBResearch/qmbp/RAId_DbS/index.html. The relevant databases and binaries of RAId_DbS for Linux, Windows, and Mac OS X are available for download from the same web page.


Subject(s)
Databases, Nucleic Acid/organization & administration , Internet , Mass Spectrometry/methods , Peptides/analysis , Animals , Computational Biology/methods , Humans , National Library of Medicine (U.S.) , Proteomics/methods , Software , United States
20.
Biol Direct ; 3: 27, 2008 Jul 02.
Article in English | MEDLINE | ID: mdl-18597684

ABSTRACT

BACKGROUND: Current experimental techniques, especially those applying liquid chromatography mass spectrometry, have made high-throughput proteomic studies possible. The increase in throughput however also raises concerns on the accuracy of identification or quantification. Most experimental procedures select in a given MS scan only a few relatively most intense parent ions, each to be fragmented (MS2) separately, and most other minor co-eluted peptides that have similar chromatographic retention times are ignored and their information lost. RESULTS: We have computationally investigated the possibility of enhancing the information retrieval during a given LC/MS experiment by selecting the two or three most intense parent ions for simultaneous fragmentation. A set of spectra is created via superimposing a number of MS2 spectra, each can be identified by all search methods tested with high confidence, to mimick the spectra of co-eluted peptides. The generated convoluted spectra were used to evaluate the capability of several database search methods - SEQUEST, Mascot, X!Tandem, OMSSA, and RAId_DbS - in identifying true peptides from superimposed spectra of co-eluted peptides. We show that using these simulated spectra, all the database search methods will gain eventually in the number of true peptides identified by using the compound spectra of co-eluted peptides. OPEN PEER REVIEW: Reviewed by Vlad Petyuk (nominated by Arcady Mushegian), King Jordan and Shamil Sunyaev. For the full reviews, please go to the Reviewers' comments section.


Subject(s)
Databases, Protein , Information Storage and Retrieval/methods , Peptides/chemistry , Peptides/isolation & purification , Amino Acid Sequence , Chromatography, High Pressure Liquid , Chromatography, Liquid , Molecular Sequence Data , Peptides/classification , Software , Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization , Tandem Mass Spectrometry
SELECTION OF CITATIONS
SEARCH DETAIL
...