RESUMO
Monoclonal antibodies are biotechnologically produced proteins with various applications in research, therapeutics and diagnostics. Their ability to recognize and bind to specific molecule structures makes them essential research tools and therapeutic agents. Sequence information of antibodies is helpful for understanding antibody-antigen interactions and ensuring their affinity and specificity. De novo protein sequencing based on mass spectrometry is a valuable method to obtain the amino acid sequence of peptides and proteins without a priori knowledge. In this study, we evaluated six recently developed de novo peptide sequencing algorithms (Novor, pNovo 3, DeepNovo, SMSNet, PointNovo and Casanovo), which were not specifically designed for antibody data. We validated their ability to identify and assemble antibody sequences on three multi-enzymatic data sets. The deep learning-based tools Casanovo and PointNovo showed an increased peptide recall across different enzymes and data sets compared with spectrum-graph-based approaches. We evaluated different error types of de novo peptide sequencing tools and their performance for different numbers of missing cleavage sites, noisy spectra and peptides of various lengths. We achieved a sequence coverage of 97.69-99.53% on the light chains of three different antibody data sets using the de Bruijn assembler ALPS and the predictions from Casanovo. However, low sequence coverage and accuracy on the heavy chains demonstrate that complete de novo protein sequencing remains a challenging issue in proteomics that requires improved de novo error correction, alternative digestion strategies and hybrid approaches such as homology search to achieve high accuracy on long protein sequences.
Assuntos
Anticorpos Monoclonais , Peptídeos , Sequência de Aminoácidos , Anticorpos Monoclonais/genética , Peptídeos/genética , Peptídeos/química , Algoritmos , Análise de Sequência de Proteína/métodosRESUMO
MOTIVATION: Inferring taxonomy in mass spectrometry-based shotgun proteomics is a complex task. In multi-species or viral samples of unknown taxonomic origin, the presence of proteins and corresponding taxa must be inferred from a list of identified peptides, which is often complicated by protein homology: many proteins do not only share peptides within a taxon but also between taxa. However, the correct taxonomic inference is crucial when identifying different viral strains with high-sequence homology-considering, e.g., the different epidemiological characteristics of the various strains of severe acute respiratory syndrome-related coronavirus-2. Additionally, many viruses mutate frequently, further complicating the correct identification of viral proteomic samples. RESULTS: We present PepGM, a probabilistic graphical model for the taxonomic assignment of virus proteomic samples with strain-level resolution and associated confidence scores. PepGM combines the results of a standard proteomic database search algorithm with belief propagation to calculate the marginal distributions, and thus confidence scores, for potential taxonomic assignments. We demonstrate the performance of PepGM using several publicly available virus proteomic datasets, showing its strain-level resolution performance. In two out of eight cases, the taxonomic assignments were only correct on the species level, which PepGM clearly indicates by lower confidence scores. AVAILABILITY AND IMPLEMENTATION: PepGM is written in Python and embedded into a Snakemake workflow. It is available at https://github.com/BAMeScience/PepGM.
Assuntos
COVID-19 , Vírus , Humanos , Proteoma , Proteômica/métodos , Algoritmos , Vírus/genética , PeptídeosRESUMO
MOTIVATION: Deep learning has moved to the forefront of tandem mass spectrometry-driven proteomics and authentic prediction for peptide fragmentation is more feasible than ever. Still, at this point spectral prediction is mainly used to validate database search results or for confined search spaces. Fully predicted spectral libraries have not yet been efficiently adapted to large search space problems that often occur in metaproteomics or proteogenomics. RESULTS: In this study, we showcase a workflow that uses Prosit for spectral library predictions on two common metaproteomes and implement an indexing and search algorithm, Mistle, to efficiently identify experimental mass spectra within the library. Hence, the workflow emulates a classic protein sequence database search with protein digestion but builds a searchable index from spectral predictions as an in-between step. We compare Mistle to popular search engines, both on a spectral and database search level, and provide evidence that this approach is more accurate than a database search using MSFragger. Mistle outperforms other spectral library search engines in terms of run time and proves to be extremely memory efficient with a 4- to 22-fold decrease in RAM usage. This makes Mistle universally applicable to large search spaces, e.g. covering comprehensive sequence databases of diverse microbiomes. AVAILABILITY AND IMPLEMENTATION: Mistle is freely available on GitHub at https://github.com/BAMeScience/Mistle.
Assuntos
Peptídeos , Software , Peptídeos/metabolismo , Ferramenta de Busca/métodos , Proteômica/métodos , Algoritmos , Espectrometria de Massas em Tandem/métodos , Bases de Dados de Proteínas , Biblioteca de PeptídeosRESUMO
Non-targeted screenings (NTS) are essential tools in different fields, such as forensics, health and environmental sciences. NTSs often employ mass spectrometry (MS) methods due to their high throughput and sensitivity in comparison to, for example, nuclear magnetic resonance-based methods. As the identification of mass spectral signals, called annotation, is labour intensive, it has been used for developing supporting tools based on machine learning (ML). However, both the diversity of mass spectral signals and the sheer quantity of different ML tools developed for compound annotation present a challenge for researchers in maintaining a comprehensive overview of the field. In this work, we illustrate which ML-based methods are available for compound annotation in non-targeted MS experiments and provide a nuanced comparison of the ML models used in MS data analysis, unravelling their unique features and performance metrics. Through this overview we support researchers to judiciously apply these tools in their daily research. This review also offers a detailed exploration of methods and datasets to show gaps in current methods, and promising target areas, offering a starting point for developers intending to improve existing methodologies.
Assuntos
Aprendizado de Máquina , Espectrometria de Massas , Espectrometria de Massas/métodos , Simulação por Computador , HumanosRESUMO
Adaptations of animal cells to growth in suspension culture concern in particular viral vaccine production, where very specific aspects of virus-host cell interaction need to be taken into account to achieve high cell specific yields and overall process productivity. So far, the complexity of alterations on the metabolism, enzyme, and proteome level required for adaptation is only poorly understood. In this study, for the first time, we combined several complex analytical approaches with the aim to track cellular changes on different levels and to unravel interconnections and correlations. Therefore, a Madin-Darby canine kidney (MDCK) suspension cell line, adapted earlier to growth in suspension, was cultivated in a 1-L bioreactor. Cell concentrations and cell volumes, extracellular metabolite concentrations, and intracellular enzyme activities were determined. The experimental data set was used as the input for a segregated growth model that was already applied to describe the growth dynamics of the parental adherent cell line. In addition, the cellular proteome was analyzed by liquid chromatography coupled to tandem mass spectrometry using a label-free protein quantification method to unravel altered cellular processes for the suspension and the adherent cell line. Four regulatory mechanisms were identified as a response of the adaptation of adherent MDCK cells to growth in suspension. These regulatory mechanisms were linked to the proteins caveolin, cadherin-1, and pirin. Combining cell, metabolite, enzyme, and protein measurements with mathematical modeling generated a more holistic view on cellular processes involved in the adaptation of an adherent cell line to suspension growth. KEY POINTS: ⢠Less and more efficient glucose utilization for suspension cell growth ⢠Concerted alteration of metabolic enzyme activity and protein expression ⢠Protein candidates to interfere glycolytic activity in MDCK cells.
Assuntos
Proteoma , Cultura de Vírus , Animais , Linhagem Celular , Proliferação de Células , Cães , Células Madin Darby de Rim CaninoRESUMO
One of the most widely used methods to detect an acute viral infection in clinical specimens is diagnostic real-time polymerase chain reaction. However, because of the COVID-19 pandemic, mass-spectrometry-based proteomics is currently being discussed as a potential diagnostic method for viral infections. Because proteomics is not yet applied in routine virus diagnostics, here we discuss its potential to detect viral infections. Apart from theoretical considerations, the current status and technical limitations are considered. Finally, the challenges that have to be overcome to establish proteomics in routine virus diagnostics are highlighted.
Assuntos
Infecções por Coronavirus/diagnóstico , Espectrometria de Massas/métodos , Pneumonia Viral/diagnóstico , Proteômica/métodos , Virologia/métodos , Betacoronavirus/química , COVID-19 , Teste para COVID-19 , Técnicas de Laboratório Clínico , Infecções por Coronavirus/virologia , Humanos , Pandemias , Pneumonia Viral/virologia , Reação em Cadeia da Polimerase em Tempo Real , SARS-CoV-2 , Viroses/diagnóstico , Viroses/virologiaRESUMO
Untargeted accurate strain-level classification of a priori unidentified organisms using tandem mass spectrometry is a challenging task. Reference databases often lack taxonomic depth, limiting peptide assignments to the species level. However, the extension with detailed strain information increases runtime and decreases statistical power. In addition, larger databases contain a higher number of similar proteomes. We present TaxIt, an iterative workflow to address the increasing search space required for MS/MS-based strain-level classification of samples with unknown taxonomic origin. TaxIt first applies reference sequence data for initial identification of species candidates, followed by automated acquisition of relevant strain sequences for low level classification. Furthermore, proteome similarities resulting in ambiguous taxonomic assignments are addressed with an abundance weighting strategy to increase the confidence in candidate taxa. For benchmarking the performance of our method, we apply our iterative workflow on several samples of bacterial and viral origin. In comparison to noniterative approaches using unique peptides or advanced abundance correction, TaxIt identifies microbial strains correctly in all examples presented (with one tie), thereby demonstrating the potential for untargeted and deeper taxonomic classification. TaxIt makes extensive use of public, unrestricted, and continuously growing sequence resources such as the NCBI databases and is available under open-source BSD license at https://gitlab.com/rki_bioinformatics/TaxIt.
Assuntos
Proteômica , Espectrometria de Massas em Tandem , Bases de Dados de Proteínas , Peptídeos , Proteoma , SoftwareRESUMO
Although metaproteomics, the study of the collective proteome of microbial communities, has become increasingly powerful and popular over the past few years, the field has lagged behind on the availability of user-friendly, end-to-end pipelines for data analysis. We therefore describe the connection from two commonly used metaproteomics data processing tools in the field, MetaProteomeAnalyzer and PeptideShaker, to Unipept for downstream analysis. Through these connections, direct end-to-end pipelines are built from database searching to taxonomic and functional annotation.
Assuntos
Análise de Dados , Microbiota , Proteoma , Proteômica , SoftwareRESUMO
While peptide identifications in mass spectrometry (MS)-based shotgun proteomics are mostly obtained using database search methods, high-resolution spectrum data from modern MS instruments nowadays offer the prospect of improving the performance of computational de novo peptide sequencing. The major benefit of de novo sequencing is that it does not require a reference database to deduce full-length or partial tag-based peptide sequences directly from experimental tandem mass spectrometry spectra. Although various algorithms have been developed for automated de novo sequencing, the prediction accuracy of proposed solutions has been rarely evaluated in independent benchmarking studies. The main objective of this work is to provide a detailed evaluation on the performance of de novo sequencing algorithms on high-resolution data. For this purpose, we processed four experimental data sets acquired from different instrument types from collision-induced dissociation and higher energy collisional dissociation (HCD) fragmentation mode using the software packages Novor, PEAKS and PepNovo. Moreover, the accuracy of these algorithms is also tested on ground truth data based on simulated spectra generated from peak intensity prediction software. We found that Novor shows the overall best performance compared with PEAKS and PepNovo with respect to the accuracy of correct full peptide, tag-based and single-residue predictions. In addition, the same tool outpaced the commercial competitor PEAKS in terms of running time speedup by factors of around 12-17. Despite around 35% prediction accuracy for complete peptide sequences on HCD data sets, taken as a whole, the evaluated algorithms perform moderately on experimental data but show a significantly better performance on simulated data (up to 84% accuracy). Further, we describe the most frequently occurring de novo sequencing errors and evaluate the influence of missing fragment ion peaks and spectral noise on the accuracy. Finally, we discuss the potential of de novo sequencing for now becoming more widely used in the field.
Assuntos
Algoritmos , Proteômica/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Animais , Biologia Computacional/métodos , Simulação por Computador , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Camundongos , Peptídeos/química , Proteômica/estatística & dados numéricos , Pyrococcus furiosus/genética , Saccharomyces cerevisiae/genética , Análise de Sequência de Proteína/estatística & dados numéricos , Sitios de Sequências Rotuladas , Software , Espectrometria de Massas em Tandem/métodos , Espectrometria de Massas em Tandem/estatística & dados numéricosRESUMO
INTRODUCTION: The study of microbial communities based on the combined analysis of genomic and proteomic data - called metaproteogenomics - has gained increased research attention in recent years. This relatively young field aims to elucidate the functional and taxonomic interplay of proteins in microbiomes and its implications on human health and the environment. Areas covered: This article reviews bioinformatics methods and software tools dedicated to the analysis of data from metaproteomics and metaproteogenomics experiments. In particular, it focuses on the creation of tailored protein sequence databases, on the optimal use of database search algorithms including methods of error rate estimation, and finally on taxonomic and functional annotation of peptide and protein identifications. Expert opinion: Recently, various promising strategies and software tools have been proposed for handling typical data analysis issues in metaproteomics. However, severe challenges remain that are highlighted and discussed in this article; these include: (i) robust false-positive assessment of peptide and protein identifications, (ii) complex protein inference against a background of highly redundant data, (iii) taxonomic and functional post-processing of identification data, and finally, (iv) the assessment and provision of metrics and tools for quantitative analysis.
Assuntos
Análise de Dados , Metagenômica , Proteômica , Bases de Dados de Proteínas , Humanos , Proteoma/metabolismo , Ferramenta de BuscaRESUMO
In shotgun proteomics, peptide and protein identification is most commonly conducted using database search engines, the method of choice when reference protein sequences are available. Despite its widespread use the database-driven approach is limited, mainly because of its static search space. In contrast, de novo sequencing derives peptide sequence information in an unbiased manner, using only the fragment ion information from the tandem mass spectra. In recent years, with the improvements in MS instrumentation, various new methods have been proposed for de novo sequencing. This review article provides an overview of existing de novo sequencing algorithms and software tools ranging from peptide sequencing to sequence-to-protein mapping. Various use cases are described for which de novo sequencing was successfully applied. Finally, limitations of current methods are highlighted and new directions are discussed for a wider acceptance of de novo sequencing in the community.
Assuntos
Proteômica/métodos , Análise de Sequência de Proteína/métodos , Software , Algoritmos , Animais , Biologia Computacional/métodos , Humanos , Proteínas/análise , Proteínas/metabolismoRESUMO
Metaproteomics, the mass spectrometry-based analysis of proteins from multispecies samples faces severe challenges concerning data analysis and results interpretation. To overcome these shortcomings, we here introduce the MetaProteomeAnalyzer (MPA) Portable software. In contrast to the original server-based MPA application, this newly developed tool no longer requires computational expertise for installation and is now independent of any relational database system. In addition, MPA Portable now supports state-of-the-art database search engines and a convenient command line interface for high-performance data processing tasks. While search engine results can easily be combined to increase the protein identification yield, an additional two-step workflow is implemented to provide sufficient analysis resolution for further postprocessing steps, such as protein grouping as well as taxonomic and functional annotation. Our new application has been developed with a focus on intuitive usability, adherence to data standards, and adaptation to Web-based workflow platforms. The open source software package can be found at https://github.com/compomics/meta-proteome-analyzer .
Assuntos
Proteoma/análise , Proteômica/métodos , Software , Algoritmos , Espectrometria de Massas/estatística & dados numéricosRESUMO
INTRODUCTION: Within the last decade, the study of microbial communities has gained increasing research interest also driven by the recognition of the important role of these consortia in human health and disease. Metaproteomics, the analysis of the entire set of proteins from all microorganisms present in one ecosystem, has become a prominent technique for studying the relation between taxonomic diversity and functional profile of microbial communities. AREAS COVERED: The aim of this review is to address opportunities and challenges of metaproteomics from a computational perspective. Appealing to an audience of microbial ecologists and proteomic researchers alike, we provide an overview on state-of-the-art software and databases by which metaproteome data can be readily analyzed. Expert commentary: While tailored protein databases, combined search algorithms and iterative workflows are means to improve the identification yield, software tools for taxonomic and functional analysis are challenged by the vast amount of unannotated sequences in metaproteomics.
Assuntos
Bactérias/genética , Metagenômica , Proteoma/genética , Algoritmos , Ecossistema , Genoma Bacteriano/genética , Humanos , SoftwareRESUMO
The biological and clinical relevance of glycosylation is becoming increasingly recognized, leading to a growing interest in large-scale clinical and population-based studies. In the past few years, several methods for high-throughput analysis of glycans have been developed, but thorough validation and standardization of these methods is required before significant resources are invested in large-scale studies. In this study, we compared liquid chromatography, capillary gel electrophoresis, and two MS methods for quantitative profiling of N-glycosylation of IgG in the same data set of 1201 individuals. To evaluate the accuracy of the four methods we then performed analysis of association with genetic polymorphisms and age. Chromatographic methods with either fluorescent or MS-detection yielded slightly stronger associations than MS-only and multiplexed capillary gel electrophoresis, but at the expense of lower levels of throughput. Advantages and disadvantages of each method were identified, which should inform the selection of the most appropriate method in future studies.
Assuntos
Ensaios de Triagem em Larga Escala/métodos , Imunoglobulina G/genética , Espectrometria de Massas/métodos , Polissacarídeos/genética , Adulto , Cromatografia Líquida , Eletroforese Capilar , Glicosilação , Humanos , Interações Hidrofóbicas e Hidrofílicas , Polimorfismo Genético , Polissacarídeos/isolamento & purificaçãoRESUMO
Protein identification via database searches has become the gold standard in mass spectrometry based shotgun proteomics. However, as the quality of tandem mass spectra improves, direct mass spectrum sequencing gains interest as a database-independent alternative. In this chapter, the general principle of this so-called de novo sequencing is introduced along with pitfalls and challenges of the technique. The main tools available are presented with a focus on user friendly open source software which can be directly applied in everyday proteomic workflows.
Assuntos
Biologia Computacional/métodos , Mineração de Dados/métodos , Bases de Dados de Proteínas , Proteínas/análise , Proteoma , Proteômica/métodos , Ferramenta de Busca , Espectrometria de Massas em Tandem/métodos , Algoritmos , Animais , Ensaios de Triagem em Larga Escala , Humanos , Software , Fluxo de TrabalhoRESUMO
Proteomics has become one of the main approaches for analyzing and understanding biological systems. Yet similar to other high-throughput analysis methods, the presentation of the large amounts of obtained data in easily interpretable ways remains challenging. In this review, we present an overview of the different ways in which proteomics software supports the visualization and interpretation of proteomics data. The unique challenges and current solutions for visualizing the different aspects of proteomics data, from acquired spectra via protein identification and quantification to pathway analysis, are discussed, and examples of the most useful visualization approaches are highlighted. Finally, we offer our ideas about future directions for proteomics data visualization.
Assuntos
Proteômica/métodos , Gráficos por Computador , Interpretação Estatística de Dados , Bases de Dados de Proteínas , Conjuntos de Dados como Assunto , Humanos , Anotação de Sequência Molecular , Especificidade de Órgãos , Mapas de Interação de Proteínas , Proteoma/química , Proteoma/metabolismo , SoftwareRESUMO
Metaproteomic research involves various computational challenges during the identification of fragmentation spectra acquired from the proteome of a complex microbiome. These issues are manifold and range from the construction of customized sequence databases, the optimal setting of search parameters to limitations in the identification search algorithms themselves. In order to assess the importance of these individual factors, we studied the effect of strategies to combine different search algorithms, explored the influence of chosen database search settings, and investigated the impact of the size of the protein sequence database used for identification. Furthermore, we applied de novo sequencing as a complementary approach to classic database searching. All evaluations were performed on a human intestinal metaproteome dataset. Pyrococcus furiosus proteome data were used to contrast database searching of metaproteomic data to a classic proteomic experiment. Searching against subsets of metaproteome databases and the use of multiple search engines increased the number of identifications. The integration of P. furiosus sequences in a metaproteomic sequence database showcased the limitation of the target-decoy-controlled false discovery rate approach in combination with large sequence databases. The selection of varying search engine parameters and the application of de novo sequencing represented useful methods to increase the reliability of the results. Based on our findings, we provide recommendations for the data analysis that help researchers to establish or improve analysis workflows in metaproteomics.
Assuntos
Metagenoma/genética , Proteoma/genética , Proteômica , Algoritmos , Sequência de Aminoácidos/genética , Humanos , Pyrococcus furiosus/genética , Software , Espectrometria de Massas em TandemRESUMO
Obesity is associated with the intestinal microbiota in humans but the underlying mechanisms are yet to be fully understood. Our previous phylogenetic study showed that the faecal microbiota profiles of nonobese versus obese and morbidly obese individuals differed. Here, we have extended this analysis with a characterization of the faecal metaproteome, in order to detect differences at a functional level. Proteins were extracted from crude faecal samples of 29 subjects, separated by 1D gel electrophoresis and characterized using RP LC-MS/MS. The peptide data were analyzed in database searches with two complementary algorithms, OMSSA and X!Tandem, to increase the number of identifications. Evolutionary genealogy of genes: nonsupervised orthologous groups (EggNOG) database searches resulted in the functional annotation of over 90% of the identified microbial and human proteins. Based on both bacterial and human proteins, a clear clustering of obese and nonobese samples was obtained that exceeded the phylogenetic separation in dimension. Moreover, integration of the metaproteomics and phylogenetic datasets revealed notably that the phylum Bacteroidetes was metabolically more active in the obese than nonobese subjects. Finally, significant correlations between clinical measurements and bacterial gene functions were identified. This study emphasizes the importance of integrating data of the host and microbiota to understand their interactions.
Assuntos
Trato Gastrointestinal/microbiologia , Microbiota/genética , Obesidade Mórbida/genética , Proteoma/genética , Adulto , Bacteroides/genética , Bacteroides/isolamento & purificação , Fezes/microbiologia , Feminino , Trato Gastrointestinal/patologia , Humanos , Masculino , Obesidade Mórbida/microbiologia , Obesidade Mórbida/patologia , Filogenia , Prevotella/genética , Prevotella/isolamento & purificação , Espectrometria de Massas em TandemRESUMO
The enormous challenges of mass spectrometry-based metaproteomics are primarily related to the analysis and interpretation of the acquired data. This includes reliable identification of mass spectra and the meaningful integration of taxonomic and functional meta-information from samples containing hundreds of unknown species. To ease these difficulties, we developed a dedicated software suite, the MetaProteomeAnalyzer, an intuitive open-source tool for metaproteomics data analysis and interpretation, which includes multiple search engines and the feature to decrease data redundancy by grouping protein hits to so-called meta-proteins. We also designed a graph database back-end for the MetaProteomeAnalyzer to allow seamless analysis of results. The functionality of the MetaProteomeAnalyzer is demonstrated using a sample of a microbial community taken from a biogas plant.
Assuntos
Proteoma , Software , Gráficos por Computador , Espectrometria de MassasRESUMO
De novo sequencing is a popular technique in proteomics for identifying peptides from tandem mass spectra without having to rely on a protein sequence database. Despite the strong potential of de novo sequencing algorithms, their adoption threshold remains quite high. We here present a user-friendly and lightweight graphical user interface called DeNovoGUI for running parallelized versions of the freely available de novo sequencing software PepNovo+, greatly simplifying the use of de novo sequencing in proteomics. Our platform-independent software is freely available under the permissible Apache2 open source license. Source code, binaries, and additional documentation are available at http://denovogui.googlecode.com .