Búsqueda | BVS CLAP/SMR-OPS/OMS

Speeding Up Percolator.

Halloran, John T; Zhang, Hantian; Kara, Kaan; Renggli, Cédric; The, Matthew; Zhang, Ce; Rocke, David M; Käll, Lukas; Noble, William Stafford.

J Proteome Res ; 18(9): 3353-3359, 2019 09 06.

Artículo en Inglés | MEDLINE | ID: mdl-31407580

RESUMEN

The processing of peptide tandem mass spectrometry data involves matching observed spectra against a sequence database. The ranking and calibration of these peptide-spectrum matches can be improved substantially using a machine learning postprocessor. Here, we describe our efforts to speed up one widely used postprocessor, Percolator. The improved software is dramatically faster than the previous version of Percolator, even when using relatively few processors. We tested the new version of Percolator on a data set containing over 215 million spectra and recorded an overall reduction to 23% of the running time as compared to the unoptimized code. We also show that the memory footprint required by these speedups is modest relative to that of the original version of Percolator.

Asunto(s)

Péptidos/genética , Proteómica/métodos , Programas Informáticos , Algoritmos , Bases de Datos de Proteínas , Aprendizaje Automático , Péptidos/clasificación , Péptidos/aislamiento & purificación , Espectrometría de Masas en Tándem/métodos

A Matter of Time: Faster Percolator Analysis via Efficient SVM Learning for Large-Scale Proteomics.

Halloran, John T; Rocke, David M.

J Proteome Res ; 17(5): 1978-1982, 2018 05 04.

Artículo en Inglés | MEDLINE | ID: mdl-29607643

RESUMEN

Percolator is an important tool for greatly improving the results of a database search and subsequent downstream analysis. Using support vector machines (SVMs), Percolator recalibrates peptide-spectrum matches based on the learned decision boundary between targets and decoys. To improve analysis time for large-scale data sets, we update Percolator's SVM learning engine through software and algorithmic optimizations rather than heuristic approaches that necessitate the careful study of their impact on learned parameters across different search settings and data sets. We show that by optimizing Percolator's original learning algorithm, l2-SVM-MFN, large-scale SVM learning requires nearly only a third of the original runtime. Furthermore, we show that by employing the widely used Trust Region Newton (TRON) algorithm instead of l2-SVM-MFN, large-scale Percolator SVM learning is reduced to nearly only a fifth of the original runtime. Importantly, these speedups only affect the speed at which Percolator converges to a global solution and do not alter recalibration performance. The upgraded versions of both l2-SVM-MFN and TRON are optimized within the Percolator codebase for multithreaded and single-thread use and are available under Apache license at bitbucket.org/jthalloran/percolator_upgrade .

Asunto(s)

Aprendizaje Automático , Proteómica/métodos , Programas Informáticos , Algoritmos , Bases de Datos de Proteínas , Máquina de Vectores de Soporte , Factores de Tiempo

Faster and more accurate graphical model identification of tandem mass spectra using trellises.

Wang, Shengjie; Halloran, John T; Bilmes, Jeff A; Noble, William S.

Bioinformatics ; 32(12): i322-i331, 2016 06 15.

Artículo en Inglés | MEDLINE | ID: mdl-27307634

RESUMEN

UNLABELLED: Tandem mass spectrometry (MS/MS) is the dominant high throughput technology for identifying and quantifying proteins in complex biological samples. Analysis of the tens of thousands of fragmentation spectra produced by an MS/MS experiment begins by assigning to each observed spectrum the peptide that is hypothesized to be responsible for generating the spectrum. This assignment is typically done by searching each spectrum against a database of peptides. To our knowledge, all existing MS/MS search engines compute scores individually between a given observed spectrum and each possible candidate peptide from the database. In this work, we use a trellis, a data structure capable of jointly representing a large set of candidate peptides, to avoid redundantly recomputing common sub-computations among different candidates. We show how trellises may be used to significantly speed up existing scoring algorithms, and we theoretically quantify the expected speedup afforded by trellises. Furthermore, we demonstrate that compact trellis representations of whole sets of peptides enables efficient discriminative learning of a dynamic Bayesian network for spectrum identification, leading to greatly improved spectrum identification accuracy. CONTACT: bilmes@uw.edu or william-noble@uw.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Espectrometría de Masas en Tándem , Algoritmos , Teorema de Bayes , Bases de Datos de Proteínas , Péptidos , Proteínas , Proteómica

Dynamic Bayesian Network for Accurate Detection of Peptides from Tandem Mass Spectra.

Halloran, John T; Bilmes, Jeff A; Noble, William S.

J Proteome Res ; 15(8): 2749-59, 2016 08 05.

Artículo en Inglés | MEDLINE | ID: mdl-27397138

RESUMEN

A central problem in mass spectrometry analysis involves identifying, for each observed tandem mass spectrum, the corresponding generating peptide. We present a dynamic Bayesian network (DBN) toolkit that addresses this problem by using a machine learning approach. At the heart of this toolkit is a DBN for Rapid Identification (DRIP), which can be trained from collections of high-confidence peptide-spectrum matches (PSMs). DRIP's score function considers fragment ion matches using Gaussians rather than fixed fragment-ion tolerances and also finds the optimal alignment between the theoretical and observed spectrum by considering all possible alignments, up to a threshold that is controlled using a beam-pruning algorithm. This function not only yields state-of-the art database search accuracy but also can be used to generate features that significantly boost the performance of the Percolator postprocessor. The DRIP software is built upon a general purpose DBN toolkit (GMTK), thereby allowing a wide variety of options for user-specific inference tasks as well as facilitating easy modifications to the DRIP model in future work. DRIP is implemented in Python and C++ and is available under Apache license at http://melodi-lab.github.io/dripToolkit .

Asunto(s)

Aprendizaje Automático , Péptidos/análisis , Proteómica/métodos , Teorema de Bayes , Bases de Datos de Proteínas , Programas Informáticos , Espectrometría de Masas en Tándem

Analyzing Tandem Mass Spectra Using the DRIP Toolkit: Training, Searching, and Post-Processing.

Halloran, John T.

Methods Mol Biol ; 1807: 163-180, 2018.

Artículo en Inglés | MEDLINE | ID: mdl-30030810

RESUMEN

Tandem mass spectrometry (MS/MS) is a high-throughput technology used to identify the proteins present in a complex, biological sample. Critical to MS/MS is the ability to accurately identify the peptide responsible for producing each observed spectrum. Recently, a dynamic Bayesian network (DBN) approach was shown to achieve state-of-the-art accuracy for this peptide identification problem. Modeling the stochastic process by which a peptide produces an MS/MS spectrum, this DBN for Rapid Identification of Peptides (DRIP) uses probabilistic inference to efficiently determine the most probable alignment between a peptide and an observed spectrum. DRIP's dynamic alignment strategy improves upon standard "static" alignment strategies, which rely on fixed quantization of the temporal axis of MS/MS data, in several significant ways. In particular, DRIP allows learning non-linear shifts of the temporal axis and, owing to the generative nature of the model, accurate feature extraction for substantially improved discriminative analysis (i.e., Percolator post-processing), all of which are supported in the DRIP Toolkit (DTK). Herein we describe how DTK may be used to significantly improve MS/MS identification accuracy, as well as DTK's interactive features for fine-grained analysis, including on the fly inference and plotting attributes.

Asunto(s)

Algoritmos , Péptidos/análisis , Espectrometría de Masas en Tándem/métodos , Secuencia de Aminoácidos , Teorema de Bayes , Péptidos/química

Learning Concave Conditional Likelihood Models for Improved Analysis of Tandem Mass Spectra.

Halloran, John T; Rocke, David M.

Adv Neural Inf Process Syst ; 31: 5420-5430, 2018 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-31745383

RESUMEN

The most widely used technology to identify the proteins present in a complex biological sample is tandem mass spectrometry, which quickly produces a large collection of spectra representative of the peptides (i.e., protein subsequences) present in the original sample. In this work, we greatly expand the parameter learning capabilities of a dynamic Bayesian network (DBN) peptide-scoring algorithm, Didea [25], by deriving emission distributions for which its conditional log-likelihood scoring function remains concave. We show that this class of emission distributions, called Convex Virtual Emissions (CVEs), naturally generalizes the log-sum-exp function while rendering both maximum likelihood estimation and conditional maximum likelihood estimation concave for a wide range of Bayesian networks. Utilizing CVEs in Didea allows efficient learning of a large number of parameters while ensuring global convergence, in stark contrast to Didea's previous parameter learning framework (which could only learn a single parameter using a costly grid search) and other trainable models [12, 13, 14] (which only ensure convergence to local optima). The newly trained scoring function substantially outperforms the state-of-the-art in both scoring function accuracy and downstream Fisher kernel analysis. Furthermore, we significantly improve Didea's runtime performance through successive optimizations to its message passing schedule and derive explicit connections between Didea's new concave score and related MS/MS scoring functions.

Gradients of Generative Models for Improved Discriminative Analysis of Tandem Mass Spectra.

Halloran, John T; Rocke, David M.

Adv Neural Inf Process Syst ; 30: 5724-5733, 2017 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-31745382

RESUMEN

Tandem mass spectrometry (MS/MS) is a high-throughput technology used to identify the proteins in a complex biological sample, such as a drop of blood. A collection of spectra is generated at the output of the process, each spectrum of which is representative of a peptide (protein subsequence) present in the original complex sample. In this work, we leverage the log-likelihood gradients of generative models to improve the identification of such spectra. In particular, we show that the gradient of a recently proposed dynamic Bayesian network (DBN) [7] may be naturally employed by a kernel-based discriminative classifier. The resulting Fisher kernel substantially improves upon recent attempts to combine generative and discriminative models for post-processing analysis, outperforming all other methods on the evaluated datasets. We extend the improved accuracy offered by the Fisher kernel framework to other search algorithms by introducing Theseus, a DBN representing a large number of widely used MS/MS scoring functions. Furthermore, with gradient ascent and max-product inference at hand, we use Theseus to learn model parameters without any supervision.

Comprehensive statistical inference of the clonal structure of cancer from multiple biopsies.

Liu, Jie; Halloran, John T; Bilmes, Jeffrey A; Daza, Riza M; Lee, Choli; Mahen, Elisabeth M; Prunkard, Donna; Song, Chaozhong; Blau, Sibel; Dorschner, Michael O; Gadi, Vijayakrishna K; Shendure, Jay; Blau, C Anthony; Noble, William S.

Sci Rep ; 7(1): 16943, 2017 12 05.

Artículo en Inglés | MEDLINE | ID: mdl-29208983

RESUMEN

A comprehensive characterization of tumor genetic heterogeneity is critical for understanding how cancers evolve and escape treatment. Although many algorithms have been developed for capturing tumor heterogeneity, they are designed for analyzing either a single type of genomic aberration or individual biopsies. Here we present THEMIS (Tumor Heterogeneity Extensible Modeling via an Integrative System), which allows for the joint analysis of different types of genomic aberrations from multiple biopsies taken from the same patient, using a dynamic graphical model. Simulation experiments demonstrate higher accuracy of THEMIS over its ancestor, TITAN. The heterogeneity analysis results from THEMIS are validated with single cell DNA sequencing from a clinical tumor biopsy. When THEMIS is used to analyze tumor heterogeneity among multiple biopsies from the same patient, it helps to reveal the mutation accumulation history, track cancer progression, and identify the mutations related to treatment resistance. We implement our model via an extensible modeling platform, which makes our approach open, reproducible, and easy for others to extend.

Asunto(s)

Biopsia/métodos , Modelos Biológicos , Neoplasias/patología , Neoplasias de la Mama Triple Negativas/tratamiento farmacológico , Neoplasias de la Mama Triple Negativas/genética , Algoritmos , Teorema de Bayes , Evolución Clonal , Biología Computacional/métodos , Variaciones en el Número de Copia de ADN , Femenino , Humanos , Mutación , Neoplasias/genética , Reproducibilidad de los Resultados , Análisis de Secuencia de ADN , Análisis de la Célula Individual , Transcriptoma , Neoplasias de la Mama Triple Negativas/patología

Learning Peptide-Spectrum Alignment Models for Tandem Mass Spectrometry.

Halloran, John T; Bilmes, Jeff A; Noble, William S.

Uncertain Artif Intell ; 30: 320-329, 2014.

Artículo en Inglés | MEDLINE | ID: mdl-25298752

RESUMEN

We present a peptide-spectrum alignment strategy that employs a dynamic Bayesian network (DBN) for the identification of spectra produced by tandem mass spectrometry (MS/MS). Our method is fundamentally generative in that it models peptide fragmentation in MS/MS as a physical process. The model traverses an observed MS/MS spectrum and a peptide-based theoretical spectrum to calculate the best alignment between the two spectra. Unlike all existing state-of-the-art methods for spectrum identification that we are aware of, our method can learn alignment probabilities given a dataset of high-quality peptide-spectrum pairs. The method, moreover, accounts for noise peaks and absent theoretical peaks in the observed spectrum. We demonstrate that our method outperforms, on a majority of datasets, several widely used, state-of-the-art database search tools for spectrum identification. Furthermore, the proposed approach provides an extensible framework for MS/MS analysis and provides useful information that is not produced by other methods, thanks to its generative structure.

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA