RESUMO
MOTIVATION: Tandem mass spectrometry data acquired using data independent acquisition (DIA) is challenging to interpret because the data exhibits complex structure along both the mass-to-charge (m/z) and time axes. The most common approach to analyzing this type of data makes use of a library of previously observed DIA data patterns (a 'spectral library'), but this approach is expensive because the libraries do not typically generalize well across laboratories. RESULTS: Here, we propose DIAmeter, a search engine that detects peptides in DIA data using only a peptide sequence database. Although some existing library-free DIA analysis methods (i) support data generated using both wide and narrow isolation windows, (ii) detect peptides containing post-translational modifications, (iii) analyze data from a variety of instrument platforms and (iv) are capable of detecting peptides even in the absence of detectable signal in the survey (MS1) scan, DIAmeter is the only method that offers all four capabilities in a single tool. AVAILABILITY AND IMPLEMENTATION: The open source, Apache licensed source code is available as part of the Crux mass spectrometry analysis toolkit (http://crux.ms). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Peptídeos , Espectrometria de Massas em Tandem , Processamento de Proteína Pós-Traducional , SoftwareRESUMO
In data independent acquisition (DIA) mass spectrometry, precursor scans are interleaved with wide-window fragmentation scans, resulting in complex fragmentation spectra containing multiple coeluting peptide species. In this setting, detecting the isotope distribution profiles of intact peptides in the precursor scans can be a critical initial step in accurate peptide detection and quantification. This peak detection step is particularly challenging when the isotope peaks associated with two different peptide species overlap-or interfere-with one another. We propose a regression model, called Siren, to detect isotopic peaks in precursor DIA data that can explicitly account for interference. We validate Siren's peak-calling performance on a variety of data sets by counting how many of the peaks Siren identifies are associated with confidently detected peptides. In particular, we demonstrate that substituting the Siren regression model in place of the existing peak-calling step in DIA-Umpire leads to improved overall rates of peptide detection.
Assuntos
Espectrometria de Massas/métodos , Peptídeos/análise , Proteômica/métodos , Algoritmos , Análise de Dados , Isótopos/análise , Análise de RegressãoRESUMO
UNLABELLED: Tandem mass spectrometry (MS/MS) is the dominant high throughput technology for identifying and quantifying proteins in complex biological samples. Analysis of the tens of thousands of fragmentation spectra produced by an MS/MS experiment begins by assigning to each observed spectrum the peptide that is hypothesized to be responsible for generating the spectrum. This assignment is typically done by searching each spectrum against a database of peptides. To our knowledge, all existing MS/MS search engines compute scores individually between a given observed spectrum and each possible candidate peptide from the database. In this work, we use a trellis, a data structure capable of jointly representing a large set of candidate peptides, to avoid redundantly recomputing common sub-computations among different candidates. We show how trellises may be used to significantly speed up existing scoring algorithms, and we theoretically quantify the expected speedup afforded by trellises. Furthermore, we demonstrate that compact trellis representations of whole sets of peptides enables efficient discriminative learning of a dynamic Bayesian network for spectrum identification, leading to greatly improved spectrum identification accuracy. CONTACT: bilmes@uw.edu or william-noble@uw.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Espectrometria de Massas em Tandem , Algoritmos , Teorema de Bayes , Bases de Dados de Proteínas , Peptídeos , Proteínas , ProteômicaRESUMO
A central problem in mass spectrometry analysis involves identifying, for each observed tandem mass spectrum, the corresponding generating peptide. We present a dynamic Bayesian network (DBN) toolkit that addresses this problem by using a machine learning approach. At the heart of this toolkit is a DBN for Rapid Identification (DRIP), which can be trained from collections of high-confidence peptide-spectrum matches (PSMs). DRIP's score function considers fragment ion matches using Gaussians rather than fixed fragment-ion tolerances and also finds the optimal alignment between the theoretical and observed spectrum by considering all possible alignments, up to a threshold that is controlled using a beam-pruning algorithm. This function not only yields state-of-the art database search accuracy but also can be used to generate features that significantly boost the performance of the Percolator postprocessor. The DRIP software is built upon a general purpose DBN toolkit (GMTK), thereby allowing a wide variety of options for user-specific inference tasks as well as facilitating easy modifications to the DRIP model in future work. DRIP is implemented in Python and C++ and is available under Apache license at http://melodi-lab.github.io/dripToolkit .
Assuntos
Aprendizado de Máquina , Peptídeos/análise , Proteômica/métodos , Teorema de Bayes , Bases de Dados de Proteínas , Software , Espectrometria de Massas em TandemRESUMO
We trained Segway, a dynamic Bayesian network method, simultaneously on chromatin data from multiple experiments, including positions of histone modifications, transcription-factor binding and open chromatin, all derived from a human chronic myeloid leukemia cell line. In an unsupervised fashion, we identified patterns associated with transcription start sites, gene ends, enhancers, transcriptional regulator CTCF-binding regions and repressed regions. Software and genome browser tracks are at http://noble.gs.washington.edu/proj/segway/.
Assuntos
Cromatina/fisiologia , Genoma Humano , Histonas/fisiologia , Sítio de Iniciação de Transcrição , Teorema de Bayes , Cromatina/genética , Histonas/genética , Humanos , Células K562 , Dados de Sequência Molecular , Regiões Promotoras Genéticas , Sequências Reguladoras de Ácido Nucleico , Fatores de Transcrição/genética , Fatores de Transcrição/fisiologiaRESUMO
The identification of cell-type-specific 3D chromatin interactions between regulatory elements can help to decipher gene regulation and to interpret the function of disease-associated non-coding variants. However, current chromosome conformation capture (3C) technologies are unable to resolve interactions at this resolution when only small numbers of cells are available as input. We therefore present ChromaFold, a deep learning model that predicts 3D contact maps and regulatory interactions from single-cell ATAC sequencing (scATAC-seq) data alone. ChromaFold uses pseudobulk chromatin accessibility, co-accessibility profiles across metacells, and predicted CTCF motif tracks as input features and employs a lightweight architecture to enable training on standard GPUs. Once trained on paired scATAC-seq and Hi-C data in human cell lines and tissues, ChromaFold can accurately predict both the 3D contact map and peak-level interactions across diverse human and mouse test cell types. In benchmarking against a recent deep learning method that uses bulk ATAC-seq, DNA sequence, and CTCF ChIP-seq to make cell-type-specific predictions, ChromaFold yields superior prediction performance when including CTCF ChIP-seq data as an input and comparable performance without. Finally, fine-tuning ChromaFold on paired scATAC-seq and Hi-C in a complex tissue enables deconvolution of chromatin interactions across cell subpopulations. ChromaFold thus achieves state-of-the-art prediction of 3D contact maps and regulatory interactions using scATAC-seq alone as input data, enabling accurate inference of cell-type-specific interactions in settings where 3C-based assays are infeasible.
RESUMO
BACKGROUND: Protein secondary structure prediction provides insight into protein function and is a valuable preliminary step for predicting the 3D structure of a protein. Dynamic Bayesian networks (DBNs) and support vector machines (SVMs) have been shown to provide state-of-the-art performance in secondary structure prediction. As the size of the protein database grows, it becomes feasible to use a richer model in an effort to capture subtle correlations among the amino acids and the predicted labels. In this context, it is beneficial to derive sparse models that discourage over-fitting and provide biological insight. RESULTS: In this paper, we first show that we are able to obtain accurate secondary structure predictions. Our per-residue accuracy on a well established and difficult benchmark (CB513) is 80.3%, which is comparable to the state-of-the-art evaluated on this dataset. We then introduce an algorithm for sparsifying the parameters of a DBN. Using this algorithm, we can automatically remove up to 70-95% of the parameters of a DBN while maintaining the same level of predictive accuracy on the SD576 set. At 90% sparsity, we are able to compute predictions three times faster than a fully dense model evaluated on the SD576 set. We also demonstrate, using simulated data, that the algorithm is able to recover true sparse structures with high accuracy, and using real data, that the sparse model identifies known correlation structure (local and non-local) related to different classes of secondary structure elements. CONCLUSIONS: We present a secondary structure prediction method that employs dynamic Bayesian networks and support vector machines. We also introduce an algorithm for sparsifying the parameters of the dynamic Bayesian network. The sparsification approach yields a significant speed-up in generating predictions, and we demonstrate that the amino acid correlations identified by the algorithm correspond to several known features of protein secondary structure. Datasets and source code used in this study are available at http://noble.gs.washington.edu/proj/pssp.
Assuntos
Algoritmos , Modelos Estatísticos , Estrutura Secundária de Proteína , Proteínas/química , Aminoácidos/química , Teorema de Bayes , Bases de Dados de ProteínasRESUMO
MOTIVATION: A global map of transcription factor binding sites (TFBSs) is critical to understanding gene regulation and genome function. DNaseI digestion of chromatin coupled with massively parallel sequencing (digital genomic footprinting) enables the identification of protein-binding footprints with high resolution on a genome-wide scale. However, accurately inferring the locations of these footprints remains a challenging computational problem. RESULTS: We present a dynamic Bayesian network-based approach for the identification and assignment of statistical confidence estimates to protein-binding footprints from digital genomic footprinting data. The method, DBFP, allows footprints to be identified in a probabilistic framework and outperforms our previously described algorithm in terms of precision at a fixed recall. Applied to a digital footprinting data set from Saccharomyces cerevisiae, DBFP identifies 4679 statistically significant footprints within intergenic regions. These footprints are mainly located near transcription start sites and are strongly enriched for known TFBSs. Footprints containing no known motif are preferentially located proximal to other footprints, consistent with cooperative binding of these footprints. DBFP also identifies a set of statistically significant footprints in the yeast coding regions. Many of these footprints coincide with the boundaries of antisense transcripts, and the most significant footprints are enriched for binding sites of the chromatin-associated factors Abf1 and Rap1. SUPPLEMENTARY INFORMATION: Supplementary material is available at Bioinformatics online.
Assuntos
Pegadas de Proteínas/métodos , Algoritmos , Teorema de Bayes , Sítios de Ligação , Genoma , Dados de Sequência Molecular , Mapeamento de Interação de Proteínas/métodos , Saccharomyces cerevisiae/genética , Fatores de Transcrição/química , Sítio de Iniciação de TranscriçãoRESUMO
DNA in eukaryotes is packaged into a chromatin complex, the most basic element of which is the nucleosome. The precise positioning of the nucleosome cores allows for selective access to the DNA, and the mechanisms that control this positioning are important pieces of the gene expression puzzle. We describe a large-scale nucleosome pattern that jointly characterizes the nucleosome core and the adjacent linkers and is predominantly characterized by long-range oscillations in the mono, di- and tri-nucleotide content of the DNA sequence, and we show that this pattern can be used to predict nucleosome positions in both Homo sapiens and Saccharomyces cerevisiae more accurately than previously published methods. Surprisingly, in both H. sapiens and S. cerevisiae, the most informative individual features are the mono-nucleotide patterns, although the inclusion of di- and tri-nucleotide features results in improved performance. Our approach combines a much longer pattern than has been previously used to predict nucleosome positioning from sequence-301 base pairs, centered at the position to be scored-with a novel discriminative classification approach that selectively weights the contributions from each of the input features. The resulting scores are relatively insensitive to local AT-content and can be used to accurately discriminate putative dyad positions from adjacent linker regions without requiring an additional dynamic programming step and without the attendant edge effects and assumptions about linker length modeling and overall nucleosome density. Our approach produces the best dyad-linker classification results published to date in H. sapiens, and outperforms two recently published models on a large set of S. cerevisiae nucleosome positions. Our results suggest that in both genomes, a comparable and relatively small fraction of nucleosomes are well-positioned and that these positions are predictable based on sequence alone. We believe that the bulk of the remaining nucleosomes follow a statistical positioning model.
Assuntos
DNA/química , Conformação de Ácido Nucleico , Nucleossomos/genética , Análise de Sequência de DNA , Elementos Alu/genética , Composição de Bases/genética , Sequência de Bases/genética , Fator de Ligação a CCCTC , DNA Fúngico/química , Humanos , Curva ROC , Proteínas Repressoras/genética , Reprodutibilidade dos Testes , Saccharomyces cerevisiae/genética , Alinhamento de SequênciaRESUMO
MOTIVATION: Tandem mass spectrometry (MS/MS) is an indispensable technology for identification of proteins from complex mixtures. Proteins are digested to peptides that are then identified by their fragmentation patterns in the mass spectrometer. Thus, at its core, MS/MS protein identification relies on the relative predictability of peptide fragmentation. Unfortunately, peptide fragmentation is complex and not fully understood, and what is understood is not always exploited by peptide identification algorithms. RESULTS: We use a hybrid dynamic Bayesian network (DBN)/support vector machine (SVM) approach to address these two problems. We train a set of DBNs on high-confidence peptide-spectrum matches. These DBNs, known collectively as Riptide, comprise a probabilistic model of peptide fragmentation chemistry. Examination of the distributions learned by Riptide allows identification of new trends, such as prevalent a-ion fragmentation at peptide cleavage sites C-term to hydrophobic residues. In addition, Riptide can be used to produce likelihood scores that indicate whether a given peptide-spectrum match is correct. A vector of such scores is evaluated by an SVM, which produces a final score to be used in peptide identification. Using Riptide in this way yields improved discrimination when compared to other state-of-the-art MS/MS identification algorithms, increasing the number of positive identifications by as much as 12% at a 1% false discovery rate. AVAILABILITY: Python and C source code are available upon request from the authors. The curated training sets are available at http://noble.gs.washington.edu/proj/intense/. The Graphical Model Tool Kit (GMTK) is freely available at http://ssli.ee.washington.edu/bilmes/gmtk.
Assuntos
Algoritmos , Inteligência Artificial , Espectrometria de Massas/métodos , Reconhecimento Automatizado de Padrão/métodos , Mapeamento de Peptídeos/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Teorema de Bayes , Dados de Sequência MolecularRESUMO
Hidden Markov models (HMMs) have been successfully applied to the tasks of transmembrane protein topology prediction and signal peptide prediction. In this paper we expand upon this work by making use of the more powerful class of dynamic Bayesian networks (DBNs). Our model, Philius, is inspired by a previously published HMM, Phobius, and combines a signal peptide submodel with a transmembrane submodel. We introduce a two-stage DBN decoder that combines the power of posterior decoding with the grammar constraints of Viterbi-style decoding. Philius also provides protein type, segment, and topology confidence metrics to aid in the interpretation of the predictions. We report a relative improvement of 13% over Phobius in full-topology prediction accuracy on transmembrane proteins, and a sensitivity and specificity of 0.96 in detecting signal peptides. We also show that our confidence metrics correlate well with the observed precision. In addition, we have made predictions on all 6.3 million proteins in the Yeast Resource Center (YRC) database. This large-scale study provides an overall picture of the relative numbers of proteins that include a signal-peptide and/or one or more transmembrane segments as well as a valuable resource for the scientific community. All DBNs are implemented using the Graphical Models Toolkit. Source code for the models described here is available at http://noble.gs.washington.edu/proj/philius. A Philius Web server is available at http://www.yeastrc.org/philius, and the predictions on the YRC database are available at http://www.yeastrc.org/pdr.
Assuntos
Teorema de Bayes , Biologia Computacional/métodos , Proteínas de Membrana/ultraestrutura , Modelos Moleculares , Sinais Direcionadores de Proteínas/fisiologia , Inteligência Artificial , Proteínas Fúngicas/ultraestrutura , Cadeias de Markov , Redes Neurais de Computação , Conformação Proteica , Reprodutibilidade dos Testes , Leveduras/ultraestruturaRESUMO
We study the problem of maximizing deep submodular functions (DSFs) [13, 3] subject to a matroid constraint. DSFs are an expressive class of submodular functions that include, as strict subfamilies, the facility location, weighted coverage, and sums of concave composed with modular functions. We use a strategy similar to the continuous greedy approach [6], but we show that the multilinear extension of any DSF has a natural and computationally attainable concave relaxation that we can optimize using gradient ascent. Our results show a guarantee of max 0 < δ < 1 ( 1 - ϵ - δ - e - δ 2 Ω ( k ) ) with a running time of O(n 2 /ϵ 2 ) plus time for pipage rounding [6] to recover a discrete solution, where k is the rank of the matroid constraint. This bound is often better than the standard 1 - 1/e guarantee of the continuous greedy algorithm, but runs much faster. Our bound also holds even for fully curved (c = 1) functions where the guarantee of 1 - c/e degenerates to 1 - 1/e where c is the curvature of f [37]. We perform computational experiments that support our theoretical results.
RESUMO
The Encyclopedia of DNA Elements (ENCODE) and the Roadmap Epigenomics Project seek to characterize the epigenome in diverse cell types using assays that identify, for example, genomic regions with modified histones or accessible chromatin. These efforts have produced thousands of datasets but cannot possibly measure each epigenomic factor in all cell types. To address this, we present a method, PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition (PREDICTD), to computationally impute missing experiments. PREDICTD leverages an elegant model called "tensor decomposition" to impute many experiments simultaneously. Compared with the current state-of-the-art method, ChromImpute, PREDICTD produces lower overall mean squared error, and combining the two methods yields further improvement. We show that PREDICTD data captures enhancer activity at noncoding human accelerated regions. PREDICTD provides reference imputed data and open-source software for investigating new cell types, and demonstrates the utility of tensor decomposition and cloud computing, both promising technologies for bioinformatics.
Assuntos
Computação em Nuvem/estatística & dados numéricos , Epigênese Genética , Genoma Humano , Histonas/genética , Software , Cromatina/química , Cromatina/metabolismo , Conjuntos de Dados como Assunto , Epigenômica/estatística & dados numéricos , Histonas/metabolismo , HumanosRESUMO
We present a peptide-spectrum alignment strategy that employs a dynamic Bayesian network (DBN) for the identification of spectra produced by tandem mass spectrometry (MS/MS). Our method is fundamentally generative in that it models peptide fragmentation in MS/MS as a physical process. The model traverses an observed MS/MS spectrum and a peptide-based theoretical spectrum to calculate the best alignment between the two spectra. Unlike all existing state-of-the-art methods for spectrum identification that we are aware of, our method can learn alignment probabilities given a dataset of high-quality peptide-spectrum pairs. The method, moreover, accounts for noise peaks and absent theoretical peaks in the observed spectrum. We demonstrate that our method outperforms, on a majority of datasets, several widely used, state-of-the-art database search tools for spectrum identification. Furthermore, the proposed approach provides an extensible framework for MS/MS analysis and provides useful information that is not produced by other methods, thanks to its generative structure.
RESUMO
Shotgun proteomics is a high-throughput technology used to identify unknown proteins in a complex mixture. At the heart of this process is a prediction task, the spectrum identification problem, in which each fragmentation spectrum produced by a shotgun proteomics experiment must be mapped to the peptide (protein subsequence) which generated the spectrum. We propose a new algorithm for spectrum identification, based on dynamic Bayesian networks, which significantly out-performs the de-facto standard tools for this task: SEQUEST and Mascot.
RESUMO
PURPOSE: Mouse control has become a crucial aspect of many modern day computer interactions. This poses a challenge for individuals with motor impairments or those whose use of hands is restricted due to situational constraints. We present a system called the Vocal Joystick which allows the user to continuously control the mouse cursor by varying vocal parameters such as vowel quality, loudness and pitch. METHOD: Evaluations were conducted to characterize expert performance capability of the Vocal Joystick, and to compare novice user performance and preference for the Vocal Joystick and two other existing speech based cursor control methods. RESULTS: Our results show that Fitts' law, a well adopted model of human motor performance for movement tasks, is a good predictor of the speed - accuracy tradeoff for the Vocal Joystick, and suggests that the optimal performance of the Vocal Joystick may be comparable to that of a conventional hand-operated joystick. Novice user evaluations show that the Vocal Joystick can be used by people without extensive training, and that it presents a viable alternative to existing speech-based cursor control methods. CONCLUSIONS: The Vocal Joystick, with its ease of use, minimal setup requirement, and controllability, offers promise for providing an efficient method for cursor control and other forms of continuous input for individuals with motor impairments.