RESUMO
The Bioconductor project (Nat. Methods2015, 12 (2), 115-121) has shown that the R statistical environment is a highly valuable tool for genomics data analysis, but with respect to proteomics, we are still missing low-level infrastructure to enable performant and robust analysis workflows in R. Fundamentally important are libraries that provide raw data access. Our R package rawDiag (J. Proteome Res.2018, 17 (8), 2908-2914) has provided the proof-of-principle how access to mass spectrometry raw files can be realized by wrapping a vendor-provided advanced programming interface (API) for the purpose of metadata analysis and visualization. Our novel package rawrr now provides complete, OS-independent access to all spectral data logged in Thermo Fisher Scientific raw files. In this technical note, we present implementation details and describe the main functionalities provided by the rawrr package. In addition, we report two use cases inspired by real-world research tasks that demonstrate the application of the package. The raw data used for demonstration purposes was deposited as MassIVE data set MSV000086542. Availability: https://github.com/fgcz/rawrr.
Assuntos
Genômica , Software , Espectrometria de Massas , ProteômicaRESUMO
Gas chromatography-high-resolution mass spectrometry (GC-HRMS) is a powerful nontargeted screening technique that promises to accelerate the identification of environmental pollutants. Currently, most GC-HRMS instruments are equipped with electron ionization (EI), but atmospheric pressure ionization (API) ion sources have attracted renewed interest because: (i) collisional cooling at atmospheric pressure minimizes fragmentation, resulting in an increased yield of molecular ions for elemental composition determination and improved detection limits; (ii) a wide range of sophisticated tandem (ion mobility) mass spectrometers can be easily adapted for operation with GC-API; and (iii) the conditions of an atmospheric pressure ion source can promote structure diagnostic ion-molecule reactions that are otherwise difficult to perform using conventional GC-MS instrumentation. This literature review addresses the merits of GC-API for nontargeted screening while summarizing recent applications using various GC-API techniques. One perceived drawback of GC-API is the paucity of spectral libraries that can be used to guide structure elucidation. Herein, novel data acquisition, deconvolution and spectral prediction tools will be reviewed. With continued development, it is anticipated that API may eventually supplant EI as the de facto GC-MS ion source used to identify unknowns.
RESUMO
Mass spectrometry is an important tool used by many scientists throughout the world. Nonetheless, feedback on the strengths and limitations of current software is often restricted to anecdote rather than formal inquiry. Over the course of 100 interviews over the state of mass spectrometry software, surprising patterns coalesced on several topics: perception of the frontier, perception of software quality, and differences between commercial and nonprofit environments. Most notably, interviews suggested a substantial schism between user satisfaction with current software and developer perceptions of software quality. Scientists' anonymized responses are presented and summarized into their suggestions for improving the state of the art.
Assuntos
Espectrometria de Massas/métodos , Software/normas , Humanos , Entrevistas como Assunto , Pessoal de Laboratório MédicoRESUMO
Computational tools are pivotal in proteomics because they are crucial for identification, quantification, and statistical assessment of data. The gateway to finding the best choice of a tool or approach for a particular problem is frequently journal articles, yet there is often an overwhelming variety of options that makes it hard to decide on the best solution. This is particularly difficult for nonexperts in bioinformatics. The maturity, reliability, and performance of tools can vary widely because publications may appear at different stages of development. A novel idea might merit early publication despite only offering proof-of-principle, while it may take years before a tool can be considered mature, and by that time it might be difficult for a new publication to be accepted because of a perceived lack of novelty. After discussions with members of the computational mass spectrometry community, we describe here proposed recommendations for organization of informatics manuscripts as a way to set the expectations of readers (and reviewers) through three different manuscript types that are based on existing journal designations. Brief Communications are short reports describing novel computational approaches where the implementation is not necessarily production-ready. Research Articles present both a novel idea and mature implementation that has been suitably benchmarked. Application Notes focus on a mature and tested tool or concept and need not be novel but should offer advancement from improved quality, ease of use, and/or implementation. Organizing computational proteomics contributions into these three manuscript types will facilitate the review process and will also enable readers to identify the maturity and applicability of the tool for their own workflows.
Assuntos
Bibliografias como Assunto , Revisão da Pesquisa por Pares , Proteômica/métodos , Biologia Computacional , Humanos , Espectrometria de Massas/instrumentação , Espectrometria de Massas/métodosRESUMO
The 2023 European Bioinformatics Community for Mass Spectrometry (EuBIC-MS) Developers Meeting was held from January 15th to January 20th, 2023, in Congressi Stefano Franscin at Monte Verità in Ticino, Switzerland. The participants were scientists and developers working in computational mass spectrometry (MS), metabolomics, and proteomics. The 5-day program was split between introductory keynote lectures and parallel hackathon sessions focusing on "Artificial Intelligence in proteomics" to stimulate future directions in the MS-driven omics areas. During the latter, the participants developed bioinformatics tools and resources addressing outstanding needs in the community. The hackathons allowed less experienced participants to learn from more advanced computational MS experts and actively contribute to highly relevant research projects. We successfully produced several new tools applicable to the proteomics community by improving data analysis and facilitating future research.
Assuntos
Espectrometria de Massas , Proteômica , Proteômica/métodos , Humanos , Espectrometria de Massas/métodos , Biologia Computacional/métodos , Metabolômica/métodos , Inteligência ArtificialRESUMO
Lipids exhibit functional bioactivities based on their polar and acyl chain properties; humans obtain lipids from dietary plant product intake. Therefore, the identification of different molecular species facilitates the evaluation of biological functions and nutrition levels and new phenotype-modulating lipid structures. As a rapid screening strategy, we performed untargeted lipidomics for 155 agricultural products in 58 species from 23 plant families, wherein product-specific lipid diversities were shown using computational mass spectrometry. We characterized 716 lipid species, for which the profiles revealed the National Center for Biotechnology Information-established organismal classification and unique plant tissue metabotypes. Moreover, we annotated unreported subclasses in plant lipidology; e.g., triacylglycerol estolide (TG-EST) was detected in rice seeds (Oryza sativa) and several plant species. TG-EST is known as the precursor molecule producing the fatty acid ester of hydroxy fatty acid, which lowers ambient glycemia and improves glucose tolerance. Hence, our method can identify agricultural plant products containing valuable lipid ingredients.
Assuntos
Lipidômica , Oryza , Ácidos Graxos , Humanos , Lipídeos , Espectrometria de MassasRESUMO
In any analytical discipline, data analysis reproducibility is closely interlinked with data quality. In this book chapter focused on mass spectrometry-based proteomics approaches, we introduce how both data analysis reproducibility and data quality can influence each other and how data quality and data analysis designs can be used to increase robustness and improve reproducibility. We first introduce methods and concepts to design and maintain robust data analysis pipelines such that reproducibility can be increased in parallel. The technical aspects related to data analysis reproducibility are challenging, and current ways to increase the overall robustness are multifaceted. Software containerization and cloud infrastructures play an important part.We will also show how quality control (QC) and quality assessment (QA) approaches can be used to spot analytical issues, reduce the experimental variability, and increase confidence in the analytical results of (clinical) proteomics studies, since experimental variability plays a substantial role in analysis reproducibility. Therefore, we give an overview on existing solutions for QC/QA, including different quality metrics, and methods for longitudinal monitoring. The efficient use of both types of approaches undoubtedly provides a way to improve the experimental reliability, reproducibility, and level of consistency in proteomics analytical measurements.
Assuntos
Computação em Nuvem , Análise de Dados , Proteômica/métodos , Controle de Qualidade , Confiabilidade dos Dados , Humanos , Espectrometria de Massas , Reprodutibilidade dos Testes , SoftwareRESUMO
Ribosomally synthesized and post-translationally modified peptides (RiPPs) are an important class of natural products that contain antibiotics and a variety of other bioactive compounds. The existing methods for discovery of RiPPs by combining genome mining and computational mass spectrometry are limited to discovering specific classes of RiPPs from small datasets, and these methods fail to handle unknown post-translational modifications. Here, we present MetaMiner, a software tool for addressing these challenges that is compatible with large-scale screening platforms for natural product discovery. After searching millions of spectra in the Global Natural Products Social (GNPS) molecular networking infrastructure against just eight genomic and metagenomic datasets, MetaMiner discovered 31 known and seven unknown RiPPs from diverse microbial communities, including human microbiome and lichen microbiome, and microorganisms isolated from the International Space Station.
Assuntos
Biologia Computacional/métodos , Microbiota/genética , Processamento de Proteína Pós-Traducional/genética , Genômica/métodos , Humanos , Peptídeos/química , Ribossomos/genética , SoftwareRESUMO
In biological mass spectrometry, crude instrumental data need to be converted into meaningful theoretical models. Several data processing and data evaluation steps are required to come to the final results. These operations are often difficult to reproduce, because of too specific computing platforms. This effect, known as 'workflow decay', can be diminished by using a standardized informatic infrastructure. Thus, we compiled an integrated platform, which contains ready-to-use tools and workflows for mass spectrometry data analysis. Apart from general unit operations, such as peak picking and identification of proteins and metabolites, we put a strong emphasis on the statistical validation of results and Data Mining. MASSyPup64 includes e.g., the OpenMS/TOPPAS framework, the Trans-Proteomic-Pipeline programs, the ProteoWizard tools, X!Tandem, Comet and SpiderMass. The statistical computing language R is installed with packages for MS data analyses, such as XCMS/metaXCMS and MetabR. The R package Rattle provides a user-friendly access to multiple Data Mining methods. Further, we added the non-conventional spreadsheet program teapot for editing large data sets and a command line tool for transposing large matrices. Individual programs, console commands and modules can be integrated using the Workflow Management System (WMS) taverna. We explain the useful combination of the tools by practical examples: (1) A workflow for protein identification and validation, with subsequent Association Analysis of peptides, (2) Cluster analysis and Data Mining in targeted Metabolomics, and (3) Raw data processing, Data Mining and identification of metabolites in untargeted Metabolomics. Association Analyses reveal relationships between variables across different sample sets. We present its application for finding co-occurring peptides, which can be used for target proteomics, the discovery of alternative biomarkers and protein-protein interactions. Data Mining derived models displayed a higher robustness and accuracy for classifying sample groups in targeted Metabolomics than cluster analyses. Random Forest models do not only provide predictive models, which can be deployed for new data sets, but also the variable importance. We demonstrate that the later is especially useful for tracking down significant signals and affected pathways in untargeted Metabolomics. Thus, Random Forest modeling supports the unbiased search for relevant biological features in Metabolomics. Our results clearly manifest the importance of Data Mining methods to disclose non-obvious information in biological mass spectrometry . The application of a Workflow Management System and the integration of all required programs and data in a consistent platform makes the presented data analyses strategies reproducible for non-expert users. The simple remastering process and the Open Source licenses of MASSyPup64 (http://www.bioprocess.org/massypup/) enable the continuous improvement of the system.
RESUMO
The second Critical Assessment of Small Molecule Identification (CASMI) contest took place in 2013. A joint team from the Swiss Federal Institute of Aquatic Science and Technology (Eawag) and Leibniz Institute of Plant Biochemistry (IPB) participated in CASMI 2013 with an automatic workflow-style entry. MOLGEN-MS/MS was used for Category 1, molecular formula calculation, restricted by the information given for each challenge. MetFrag and MetFusion were used for Category 2, structure identification, retrieving candidates from the compound databases KEGG, PubChem and ChemSpider and joining these lists pre-submission. The results from Category 1 were used to guide whether formula or exact mass searches were performed for Category 2. The Category 2 results were impressive considering the database size and automated regime used, although these could not compete with the manual approach of the contest winner. The Category 1 results were affected by large m/z and ppm values in the challenge data, where strategies beyond pure enumeration from other participants were more successful. However, the combination used for the CASMI 2013 entries was extremely useful for developing decision-making criteria for automatic, high throughput general unknown (non-target) identification and for future contests.
RESUMO
Protein inference is an often neglected though crucial step in most proteomic experiments. In the bottom-up proteomic approach, the actual molecules of interest, the proteins, are digested into peptides before measurement on a mass spectrometer. This approach introduces a loss of information: The actual proteins must be inferred based on the identified peptides. While this might seem trivial, there are certain problems, one of the biggest being the presence of peptides that are shared among proteins. These amino acid sequences can, based on the database used for identification, belong to more than one protein. If such peptides are identified in a sample, it cannot be said which proteins actually were in the sample, but only an estimate on the most probable proteins or protein groups can be given based on a predefined inference strategy.Here we describe the effect of the chosen database for peptide identification on the number of shared peptides. Afterward, the mainly used protein inference methods will be sketched, and the necessity of stringent false discovery rate on peptide and protein level is discussed. Finally, we explain how the tool "PIA or protein inference algorithms" can be used together with the workflow environment KNIME and OpenMS to perform protein inference in a common proteomic experiment.