RESUMEN
Cardinal v.3 is an open-source software for reproducible analysis of mass spectrometry imaging experiments. A major update from its previous versions, Cardinal v.3 supports most mass spectrometry imaging workflows. Its analytical capabilities include advanced data processing such as mass recalibration, advanced statistical analyses such as single-ion segmentation and rough annotation-based classification, and memory-efficient analyses of large-scale multitissue experiments.
Asunto(s)
Procesamiento de Imagen Asistido por Computador , Programas Informáticos , Espectrometría de Masas/métodosRESUMEN
Protein complexes are responsible for the enactment of most cellular functions. For the protein complex to form and function, its subunits often need to be present at defined quantitative ratios. Typically, global changes in protein complex composition are assessed with experimental approaches that tend to be time consuming. Here, we have developed a computational algorithm for the detection of altered protein complexes based on the systematic assessment of subunit ratios from quantitative proteomic measurements. We applied it to measurements from breast cancer cell lines and patient biopsies and were able to identify strong remodeling of HDAC2 epigenetic complexes in more aggressive forms of cancer. The presented algorithm is available as an R package and enables the inference of changes in protein complex states by extracting functionally relevant information from bottom-up proteomic datasets.
Asunto(s)
Proteoma , Proteómica , Humanos , Proteoma/metabolismo , Algoritmos , Células MCF-7 , Biología ComputacionalRESUMEN
SUMMARY: We introduce Eliater, a Python package for estimating the effect of perturbation of an upstream molecule on a downstream molecule in a biomolecular network. The estimation takes as input a biomolecular network, observational biomolecular data, and a perturbation of interest, and outputs an estimated quantitative effect of the perturbation. We showcase the functionalities of Eliater in a case study of Escherichia coli transcriptional regulatory network. AVAILABILITY AND IMPLEMENTATION: The code, the documentation, and several case studies are available open source at https://github.com/y0-causal-inference/eliater.
Asunto(s)
Escherichia coli , Redes Reguladoras de Genes , Programas Informáticos , Escherichia coli/genética , Escherichia coli/metabolismo , Biología Computacional/métodosRESUMEN
SUMMARY: Joint analysis of mass spectrometry images (MS images) and microscopy images of hematoxylin and eosin (H&E) stained tissues assists pathologists in characterizing the morphological structure of the tissues, and in performing diagnosis. Unfortunately, the analysis is undermined by substantial differences between these modalities in terms of aspect ratios, spatial resolution, number of channels in each image, as well as by large global or small local elastic spatial deformations of one image with respect to the other. Therefore, accurate coregistration of the images is a critical pre-requisite for their joint interpretation. We introduce MSIreg, an open-source R package for coregistration of MSI and H&E images. MSIreg is designed for high-dimensional MSI experiments where each spatial location is represented by thousands of mass features. Unlike most existing coregistration methods, MSIreg implements a landmark free workflow, and quantitative metrics for performance evaluation. We evaluate the performance of MSIreg on six case studies, including coregistration of contiguous tissues with large deformations, as well as simultaneous coregistration of 29 tissue microarray cores. AVAILABILITY AND IMPLEMENTATION: The R package, installation instructions, and fully reproducible vignettes describing methods and Case Studies are available open-source under the GPL-3.0 license at https://github.com/sslakkimsetty/msireg/.
Asunto(s)
Espectrometría de Masas , Programas Informáticos , Espectrometría de Masas/métodos , Humanos , Procesamiento de Imagen Asistido por Computador/métodosRESUMEN
Liquid chromatography coupled with bottom-up mass spectrometry (LC-MS/MS)-based proteomics is increasingly used to detect changes in posttranslational modifications (PTMs) in samples from different conditions. Analysis of data from such experiments faces numerous statistical challenges. These include the low abundance of modified proteoforms, the small number of observed peptides that span modification sites, and confounding between changes in the abundance of PTM and the overall changes in the protein abundance. Therefore, statistical approaches for detecting differential PTM abundance must integrate all the available information pertaining to a PTM site and consider all the relevant sources of confounding and variation. In this manuscript, we propose such a statistical framework, which is versatile, accurate, and leads to reproducible results. The framework requires an experimental design, which quantifies, for each sample, both peptides with PTMs and peptides from the same proteins with no modification sites. The proposed framework supports both label-free and tandem mass tag-based LC-MS/MS acquisitions. The statistical methodology separately summarizes the abundances of peptides with and without the modification sites, by fitting separate linear mixed effects models appropriate for the experimental design. Next, model-based inferences regarding the PTM and the protein-level abundances are combined to account for the confounding between these two sources. Evaluations on computer simulations, a spike-in experiment with known ground truth, and three biological experiments with different organisms, modification types, and data acquisition types demonstrate the improved fold change estimation and detection of differential PTM abundance, as compared to currently used approaches. The proposed framework is implemented in the free and open-source R/Bioconductor package MSstatsPTM.
Asunto(s)
Proteómica , Espectrometría de Masas en Tándem , Proteómica/métodos , Cromatografía Liquida , Procesamiento Proteico-Postraduccional , Proteínas , Péptidos/químicaRESUMEN
MOTIVATION: Mass Spectrometry Imaging (MSI) analyzes complex biological samples such as tissues. It simultaneously characterizes the ions present in the tissue in the form of mass spectra, and the spatial distribution of the ions across the tissue in the form of ion images. Unsupervised clustering of ion images facilitates the interpretation in the spectral domain, by identifying groups of ions with similar spatial distributions. Unfortunately, many current methods for clustering ion images ignore the spatial features of the images, and are therefore unable to learn these features for clustering purposes. Alternative methods extract spatial features using deep neural networks pre-trained on natural image tasks; however, this is often inadequate since ion images are substantially noisier than natural images. RESULTS: We contribute a deep clustering approach for ion images that accounts for both spatial contextual features and noise. In evaluations on a simulated dataset and on four experimental datasets of different tissue types, the proposed method grouped ions from the same source into a same cluster more frequently than existing methods. We further demonstrated that using ion image clustering as a pre-processing step facilitated the interpretation of a subsequent spatial segmentation as compared to using either all the ions or one ion at a time. As a result, the proposed approach facilitated the interpretability of MSI data in both the spectral domain and the spatial domain. AVAILABILITYAND IMPLEMENTATION: The data and code are available at https://github.com/DanGuo1223/mzClustering. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Redes Neurales de la Computación , Espectrometría de Masas/métodos , Análisis por Conglomerados , Iones/análisisRESUMEN
Causal query estimation in biomolecular networks commonly selects a 'valid adjustment set', i.e. a subset of network variables that eliminates the bias of the estimator. A same query may have multiple valid adjustment sets, each with a different variance. When networks are partially observed, current methods use graph-based criteria to find an adjustment set that minimizes asymptotic variance. Unfortunately, many models that share the same graph topology, and therefore same functional dependencies, may differ in the processes that generate the observational data. In these cases, the topology-based criteria fail to distinguish the variances of the adjustment sets. This deficiency can lead to sub-optimal adjustment sets, and to miss-characterization of the effect of the intervention. We propose an approach for deriving 'optimal adjustment sets' that takes into account the nature of the data, bias and finite-sample variance of the estimator, and cost. It empirically learns the data generating processes from historical experimental data, and characterizes the properties of the estimators by simulation. We demonstrate the utility of the proposed approach in four biomolecular Case studies with different topologies and different data generation processes. The implementation and reproducible Case studies are at https://github.com/srtaheri/OptimalAdjustmentSet.
Asunto(s)
Biología Computacional , Simulación por ComputadorRESUMEN
Liquid chromatography coupled with bottom-up mass spectrometry (LC-MS/MS)-based proteomics is a versatile technology for identifying and quantifying proteins in complex biological mixtures. Postidentification, analysis of changes in protein abundances between conditions requires increasingly complex and specialized statistical methods. Many of these methods, in particular the family of open-source Bioconductor packages MSstats, are implemented in a coding language such as R. To make the methods in MSstats accessible to users with limited programming and statistical background, we have created MSstatsShiny, an R-Shiny graphical user interface (GUI) integrated with MSstats, MSstatsTMT, and MSstatsPTM. The GUI provides a point and click analysis pipeline applicable to a wide variety of proteomics experimental types, including label-free data-dependent acquisitions (DDAs) or data-independent acquisitions (DIAs), or tandem mass tag (TMT)-based TMT-DDAs, answering questions such as relative changes in the abundance of peptides, proteins, or post-translational modifications (PTMs). To support reproducible research, the application saves user's selections and builds an R script that programmatically recreates the analysis. MSstatsShiny can be installed locally via Github and Bioconductor, or utilized on the cloud at www.msstatsshiny.com. We illustrate the utility of the platform using two experimental data sets (MassIVE IDs MSV000086623 and MSV000085565).
Asunto(s)
Proteómica , Programas Informáticos , Proteómica/métodos , Cromatografía Liquida/métodos , Espectrometría de Masas en Tándem/métodos , Proteínas/análisisRESUMEN
Repeated measures experimental designs, which quantify proteins in biological subjects repeatedly over multiple experimental conditions or times, are commonly used in mass spectrometry-based proteomics. Such designs distinguish the biological variation within and between the subjects and increase the statistical power of detecting within-subject changes in protein abundance. Meanwhile, proteomics experiments increasingly incorporate tandem mass tag (TMT) labeling, a multiplexing strategy that gains both relative protein quantification accuracy and sample throughput. However, combining repeated measures and TMT multiplexing in a large-scale investigation presents statistical challenges due to unique interplays of between-mixture, within-mixture, between-subject, and within-subject variation. This manuscript proposes a family of linear mixed-effects models for differential analysis of proteomics experiments with repeated measures and TMT multiplexing. These models decompose the variation in the data into the contributions from its sources as appropriate for the specifics of each experiment, enable statistical inference of differential protein abundance, and recognize a difference in the uncertainty of between-subject versus within-subject comparisons. The proposed family of models is implemented in the R/Bioconductor package MSstatsTMT v2.2.0. Evaluations of four simulated datasets and four investigations answering diverse biological questions demonstrated the value of this approach as compared to the existing general-purpose approaches and implementations.
Asunto(s)
Proyectos de Investigación , Espectrometría de Masas en Tándem , Humanos , Proteoma/análisisRESUMEN
The MSstats R-Bioconductor family of packages is widely used for statistical analyses of quantitative bottom-up mass spectrometry-based proteomic experiments to detect differentially abundant proteins. It is applicable to a variety of experimental designs and data acquisition strategies and is compatible with many data processing tools used to identify and quantify spectral features. In the face of ever-increasing complexities of experiments and data processing strategies, the core package of the family, with the same name MSstats, has undergone a series of substantial updates. Its new version MSstats v4.0 improves the usability, versatility, and accuracy of statistical methodology, and the usage of computational resources. New converters integrate the output of upstream processing tools directly with MSstats, requiring less manual work by the user. The package's statistical models have been updated to a more robust workflow. Finally, MSstats' code has been substantially refactored to improve memory use and computation speed. Here we detail these updates, highlighting methodological differences between the new and old versions. An empirical comparison of MSstats v4.0 to its previous implementations, as well as to the packages MSqRob and DEqMS, on controlled mixtures and biological experiments demonstrated a stronger performance and better usability of MSstats v4.0 as compared to existing methods.
Asunto(s)
Proteómica , Proyectos de Investigación , Proteómica/métodos , Programas Informáticos , Espectrometría de Masas/métodos , Cromatografía Liquida/métodosRESUMEN
MassIVE.quant is a repository infrastructure and data resource for reproducible quantitative mass spectrometry-based proteomics, which is compatible with all mass spectrometry data acquisition types and computational analysis tools. A branch structure enables MassIVE.quant to systematically store raw experimental data, metadata of the experimental design, scripts of the quantitative analysis workflow, intermediate input and output files, as well as alternative reanalyses of the same dataset.
Asunto(s)
Bases de Datos de Proteínas , Espectrometría de Masas , Proteómica , Algoritmos , Proteínas Fúngicas/química , Reproducibilidad de los Resultados , Saccharomyces cerevisiae/metabolismo , Programas InformáticosRESUMEN
MOTIVATION: Estimating causal queries, such as changes in protein abundance in response to a perturbation, is a fundamental task in the analysis of biomolecular pathways. The estimation requires experimental measurements on the pathway components. However, in practice many pathway components are left unobserved (latent) because they are either unknown, or difficult to measure. Latent variable models (LVMs) are well-suited for such estimation. Unfortunately, LVM-based estimation of causal queries can be inaccurate when parameters of the latent variables are not uniquely identified, or when the number of latent variables is misspecified. This has limited the use of LVMs for causal inference in biomolecular pathways. RESULTS: In this article, we propose a general and practical approach for LVM-based estimation of causal queries. We prove that, despite the challenges above, LVM-based estimators of causal queries are accurate if the queries are identifiable according to Pearl's do-calculus and describe an algorithm for its estimation. We illustrate the breadth and the practical utility of this approach for estimating causal queries in four synthetic and two experimental case studies, where structures of biomolecular pathways challenge the existing methods for causal query estimation. AVAILABILITY AND IMPLEMENTATION: The code and the data documenting all the case studies are available at https://github.com/srtaheri/LVMwithDoCalculus. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Cálculos , Humanos , Modelos Teóricos , ProteínasRESUMEN
Skyline Batch is a newly developed Windows forms application that enables the easy and consistent reprocessing of data with Skyline. Skyline has made previous advances in this direction; however, none enable seamless automated reprocessing of local and remote files. Skyline keeps a log of all of the steps that were taken in the document; however, reproducing these steps takes time and allows room for human error. Skyline also has a command-line interface, enabling it to be run from a batch script, but using the program in this way requires expertise in editing these scripts. By formalizing the workflow of a highly used set of batch scripts into an intuitive and powerful user interface, Skyline Batch can reprocess data stored in remote repositories just by opening and running a Skyline Batch configuration file. When run, a Skyline Batch configuration downloads all necessary remote files and then runs a four-step Skyline workflow. By condensing the steps needed to reprocess the data into one file, Skyline Batch gives researchers the opportunity to publish their processing along with their data and other analysis files. These easily run configuration files will greatly increase the transparency and reproducibility of published work. Skyline Batch is freely available at https://skyline.ms/batch.url.
Asunto(s)
Programas Informáticos , Interfaz Usuario-Computador , Humanos , Reproducibilidad de los Resultados , Flujo de TrabajoAsunto(s)
Proteómica , Proteómica/métodos , Humanos , Inteligencia Artificial , Bases de Datos de ProteínasRESUMEN
BACKGROUND: Mass spectrometry imaging (MSI) derives spatial molecular distribution maps directly from clinical tissue specimens and thus bears great potential for assisting pathologists with diagnostic decisions or personalized treatments. Unfortunately, progress in translational MSI is often hindered by insufficient quality control and lack of reproducible data analysis. Raw data and analysis scripts are rarely publicly shared. Here, we demonstrate the application of the Galaxy MSI tool set for the reproducible analysis of a urothelial carcinoma dataset. METHODS: Tryptic peptides were imaged in a cohort of 39 formalin-fixed, paraffin-embedded human urothelial cancer tissue cores with a MALDI-TOF/TOF device. The complete data analysis was performed in a fully transparent and reproducible manner on the European Galaxy Server. Annotations of tumor and stroma were performed by a pathologist and transferred to the MSI data to allow for supervised classifications of tumor vs. stroma tissue areas as well as for muscle-infiltrating and non-muscle infiltrating urothelial carcinomas. For putative peptide identifications, m/z features were matched to the MSiMass list. RESULTS: Rigorous quality control in combination with careful pre-processing enabled reduction of m/z shifts and intensity batch effects. High classification accuracy was found for both, tumor vs. stroma and muscle-infiltrating vs. non-muscle infiltrating urothelial tumors. Some of the most discriminative m/z features for each condition could be assigned a putative identity: stromal tissue was characterized by collagen peptides and tumor tissue by histone peptides. Immunohistochemistry confirmed an increased histone H2A abundance in the tumor compared to the stroma tissues. The muscle-infiltration status was distinguished via MSI by peptides from intermediate filaments such as cytokeratin 7 in non-muscle infiltrating carcinomas and vimentin in muscle-infiltrating urothelial carcinomas, which was confirmed by immunohistochemistry. To make the study fully reproducible and to advocate the criteria of FAIR (findability, accessibility, interoperability, and reusability) research data, we share the raw data, spectra annotations as well as all Galaxy histories and workflows. Data are available via ProteomeXchange with identifier PXD026459 and Galaxy results via https://github.com/foellmelanie/Bladder_MSI_Manuscript_Galaxy_links . CONCLUSION: Here, we show that translational MSI data analysis in a fully transparent and reproducible manner is possible and we would like to encourage the community to join our efforts.
RESUMEN
In bottom-up, label-free discovery proteomics, biological samples are acquired in a data-dependent (DDA) or data-independent (DIA) manner, with peptide signals recorded in an intact (MS1) and fragmented (MS2) form. While DDA has only the MS1 space for quantification, DIA contains both MS1 and MS2 at high quantitative quality. DIA profiles of complex biological matrices such as tissues or cells can contain quantitative interferences, and the interferences at the MS1 and the MS2 signals are often independent. When comparing biological conditions, the interferences can compromise the detection of differential peptide or protein abundance and lead to false positive or false negative conclusions.We hypothesized that the combined use of MS1 and MS2 quantitative signals could improve our ability to detect differentially abundant proteins. Therefore, we developed a statistical procedure incorporating both MS1 and MS2 quantitative information of DIA. We benchmarked the performance of the MS1-MS2-combined method to the individual use of MS1 or MS2 in DIA using four previously published controlled mixtures, as well as in two previously unpublished controlled mixtures. In the majority of the comparisons, the combined method outperformed the individual use of MS1 or MS2. This was particularly true for comparisons with low fold changes, few replicates, and situations where MS1 and MS2 were of similar quality. When applied to a previously unpublished investigation of lung cancer, the MS1-MS2-combined method increased the coverage of known activated pathways.Since recent technological developments continue to increase the quality of MS1 signals (e.g. using the BoxCar scan mode for Orbitrap instruments), the combination of the MS1 and MS2 information has a high potential for future statistical analysis of DIA data.
Asunto(s)
Proteómica/métodos , Animales , Caenorhabditis elegans , Cerebelo/metabolismo , Interpretación Estadística de Datos , Células HeLa , Humanos , Pulmón/metabolismo , Neoplasias Pulmonares/metabolismo , Espectrometría de Masas , Ratones , Saccharomyces cerevisiaeRESUMEN
Tandem mass tag (TMT) is a multiplexing technology widely-used in proteomic research. It enables relative quantification of proteins from multiple biological samples in a single MS run with high efficiency and high throughput. However, experiments often require more biological replicates or conditions than can be accommodated by a single run, and involve multiple TMT mixtures and multiple runs. Such larger-scale experiments combine sources of biological and technical variation in patterns that are complex, unique to TMT-based workflows, and challenging for the downstream statistical analysis. These patterns cannot be adequately characterized by statistical methods designed for other technologies, such as label-free proteomics or transcriptomics. This manuscript proposes a general statistical approach for relative protein quantification in MS- based experiments with TMT labeling. It is applicable to experiments with multiple conditions, multiple biological replicate runs and multiple technical replicate runs, and unbalanced designs. It is based on a flexible family of linear mixed-effects models that handle complex patterns of technical artifacts and missing values. The approach is implemented in MSstatsTMT, a freely available open-source R/Bioconductor package compatible with data processing tools such as Proteome Discoverer, MaxQuant, OpenMS, and SpectroMine. Evaluation on a controlled mixture, simulated datasets, and three biological investigations with diverse designs demonstrated that MSstatsTMT balanced the sensitivity and the specificity of detecting differentially abundant proteins, in large-scale experiments with multiple biological mixtures.
Asunto(s)
Marcaje Isotópico , Proteoma/metabolismo , Estadística como Asunto , Espectrometría de Masas en Tándem , Humanos , ProteómicaRESUMEN
In bottom-up mass spectrometry-based proteomics, relative protein quantification is often achieved with data-dependent acquisition (DDA), data-independent acquisition (DIA), or selected reaction monitoring (SRM). These workflows quantify proteins by summarizing the abundances of all the spectral features of the protein (e.g. precursor ions, transitions or fragments) in a single value per protein per run. When abundances of some features are inconsistent with the overall protein profile (for technological reasons such as interferences, or for biological reasons such as post-translational modifications), the protein-level summaries and the downstream conclusions are undermined. We propose a statistical approach that automatically detects spectral features with such inconsistent patterns. The detected features can be separately investigated, and if necessary, removed from the data set. We evaluated the proposed approach on a series of benchmark-controlled mixtures and biological investigations with DDA, DIA and SRM data acquisitions. The results demonstrated that it could facilitate and complement manual curation of the data. Moreover, it can improve the estimation accuracy, sensitivity and specificity of detecting differentially abundant proteins, and reproducibility of conclusions across different data processing tools. The approach is implemented as an option in the open-source R-based software MSstats.
Asunto(s)
Espectrometría de Masas/métodos , Proteínas/análisis , Proteómica/métodos , Bases de Datos de Proteínas , Procesamiento Proteico-Postraduccional , Reproducibilidad de los Resultados , Sensibilidad y Especificidad , Programas InformáticosRESUMEN
MOTIVATION: Mass spectrometry imaging (MSI) characterizes the molecular composition of tissues at spatial resolution, and has a strong potential for distinguishing tissue types, or disease states. This can be achieved by supervised classification, which takes as input MSI spectra, and assigns class labels to subtissue locations. Unfortunately, developing such classifiers is hindered by the limited availability of training sets with subtissue labels as the ground truth. Subtissue labeling is prohibitively expensive, and only rough annotations of the entire tissues are typically available. Classifiers trained on data with approximate labels have sub-optimal performance. RESULTS: To alleviate this challenge, we contribute a semi-supervised approach mi-CNN. mi-CNN implements multiple instance learning with a convolutional neural network (CNN). The multiple instance aspect enables weak supervision from tissue-level annotations when classifying subtissue locations. The convolutional architecture of the CNN captures contextual dependencies between the spectral features. Evaluations on simulated and experimental datasets demonstrated that mi-CNN improved the subtissue classification as compared to traditional classifiers. We propose mi-CNN as an important step toward accurate subtissue classification in MSI, enabling rapid distinction between tissue types and disease states. AVAILABILITY AND IMPLEMENTATION: The data and code are available at https://github.com/Vitek-Lab/mi-CNN_MSI.
Asunto(s)
Redes Neurales de la Computación , Espectrometría de MasasRESUMEN
MOTIVATION: Accurate estimation of false discovery rate (FDR) of spectral identification is a central problem in mass spectrometry-based proteomics. Over the past two decades, target-decoy approaches (TDAs) and decoy-free approaches (DFAs) have been widely used to estimate FDR. TDAs use a database of decoy species to faithfully model score distributions of incorrect peptide-spectrum matches (PSMs). DFAs, on the other hand, fit two-component mixture models to learn the parameters of correct and incorrect PSM score distributions. While conceptually straightforward, both approaches lead to problems in practice, particularly in experiments that push instrumentation to the limit and generate low fragmentation-efficiency and low signal-to-noise-ratio spectra. RESULTS: We introduce a new decoy-free framework for FDR estimation that generalizes present DFAs while exploiting more search data in a manner similar to TDAs. Our approach relies on multi-component mixtures, in which score distributions corresponding to the correct PSMs, best incorrect PSMs and second-best incorrect PSMs are modeled by the skew normal family. We derive EM algorithms to estimate parameters of these distributions from the scores of best and second-best PSMs associated with each experimental spectrum. We evaluate our models on multiple proteomics datasets and a HeLa cell digest case study consisting of more than a million spectra in total. We provide evidence of improved performance over existing DFAs and improved stability and speed over TDAs without any performance degradation. We propose that the new strategy has the potential to extend beyond peptide identification and reduce the need for TDA on all analytical platforms. AVAILABILITYAND IMPLEMENTATION: https://github.com/shawn-peng/FDR-estimation. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.