RESUMO
Spectrum clustering is a powerful strategy to minimize redundant mass spectra by grouping them based on similarity, with the aim of forming groups of mass spectra from the same repeatedly measured analytes. Each such group of near-identical spectra can be represented by its so-called consensus spectrum for downstream processing. Although several algorithms for spectrum clustering have been adequately benchmarked and tested, the influence of the consensus spectrum generation step is rarely evaluated. Here, we present an implementation and benchmark of common consensus spectrum algorithms, including spectrum averaging, spectrum binning, the most similar spectrum, and the best-identified spectrum. We have analyzed diverse public data sets using two different clustering algorithms (spectra-cluster and MaRaCluster) to evaluate how the consensus spectrum generation procedure influences downstream peptide identification. The BEST and BIN methods were found the most reliable methods for consensus spectrum generation, including for data sets with post-translational modifications (PTM) such as phosphorylation. All source code and data of the present study are freely available on GitHub at https://github.com/statisticalbiotechnology/representative-spectra-benchmark.
Assuntos
Proteômica , Espectrometria de Massas em Tandem , Algoritmos , Análise por Conglomerados , Consenso , Bases de Dados de Proteínas , Proteômica/métodos , Software , Espectrometria de Massas em Tandem/métodosRESUMO
In drug development, late stage toxicity issues of a compound are the main cause of failure in clinical trials. In silico methods are therefore of high importance to guide the early design process to reduce time, costs and animal testing. Technical advances and the ever growing amount of available toxicity data enabled machine learning, especially neural networks, to impact the field of predictive toxicology. In this study, cytotoxicity prediction, one of the earliest handles in drug discovery, is investigated using a deep learning approach trained on a highly consistent in-house data set of over 34,000 compounds with a share of less than 5% of cytotoxic molecules. The model reached a balanced accuracy of over 70%, similar to previously reported studies using Random Forest. Albeit yielding good results, neural networks are often described as a black box lacking deeper mechanistic understanding of the underlying model. To overcome this absence of interpretability, a Deep Taylor Decomposition method is investigated to identify substructures that may be responsible for the cytotoxic effects, the so-called toxicophores. Furthermore, this study introduces cytotoxicity maps which provide a visual structural interpretation of the relevance of these substructures. Using this approach could be helpful in drug development to predict the potential toxicity of a compound as well as to generate new insights into the toxic mechanism. Moreover, it could also help to de-risk and optimize compounds.
Assuntos
Citotoxinas/química , Citotoxinas/toxicidade , Aprendizado Profundo , Descoberta de Drogas/métodos , Sobrevivência Celular/efeitos dos fármacos , Desenho Assistido por Computador , Desenho de Fármacos , Descoberta de Drogas/estatística & dados numéricos , Células HEK293 , Células Hep G2 , Humanos , Modelos Biológicos , Redes Neurais de Computação , Bibliotecas de Moléculas Pequenas , Software , Toxicologia/estatística & dados numéricosRESUMO
Here we provide a curated, large scale, label free mass spectrometry-based proteomics data set derived from HeLa cell lines for general purpose machine learning and analysis. Data access and filtering is a tedious task, which takes up considerable amounts of time for researchers. Therefore we provide machine based metadata for easy selection and overview along the 7,444 raw files and MaxQuant search output. For convenience, we provide three filtered and aggregated development datasets on the protein groups, peptides and precursors level. Next to providing easy to access training data, we provide a SDRF file annotating each raw file with instrument settings allowing automated reprocessing. We encourage others to enlarge this data set by instrument runs of further HeLa samples from different machine types by providing our workflows and analysis scripts.
Assuntos
Células HeLa , Aprendizado de Máquina , Proteômica , Humanos , Espectrometria de Massas , MetadadosRESUMO
Imputation techniques provide means to replace missing measurements with a value and are used in almost all downstream analysis of mass spectrometry (MS) based proteomics data using label-free quantification (LFQ). Here we demonstrate how collaborative filtering, denoising autoencoders, and variational autoencoders can impute missing values in the context of LFQ at different levels. We applied our method, proteomics imputation modeling mass spectrometry (PIMMS), to an alcohol-related liver disease (ALD) cohort with blood plasma proteomics data available for 358 individuals. Removing 20 percent of the intensities we were able to recover 15 out of 17 significant abundant protein groups using PIMMS-VAE imputations. When analyzing the full dataset we identified 30 additional proteins (+13.2%) that were significantly differentially abundant across disease stages compared to no imputation and found that some of these were predictive of ALD progression in machine learning models. We, therefore, suggest the use of deep learning approaches for imputing missing values in MS-based proteomics on larger datasets and provide workflows for these.
Assuntos
Aprendizado Profundo , Espectrometria de Massas , Proteômica , Proteômica/métodos , Humanos , Espectrometria de Massas/métodos , Aprendizado de Máquina Supervisionado , MasculinoRESUMO
The 2023 European Bioinformatics Community for Mass Spectrometry (EuBIC-MS) Developers Meeting was held from January 15th to January 20th, 2023, in Congressi Stefano Franscin at Monte Verità in Ticino, Switzerland. The participants were scientists and developers working in computational mass spectrometry (MS), metabolomics, and proteomics. The 5-day program was split between introductory keynote lectures and parallel hackathon sessions focusing on "Artificial Intelligence in proteomics" to stimulate future directions in the MS-driven omics areas. During the latter, the participants developed bioinformatics tools and resources addressing outstanding needs in the community. The hackathons allowed less experienced participants to learn from more advanced computational MS experts and actively contribute to highly relevant research projects. We successfully produced several new tools applicable to the proteomics community by improving data analysis and facilitating future research.
Assuntos
Espectrometria de Massas , Proteômica , Proteômica/métodos , Humanos , Espectrometria de Massas/métodos , Biologia Computacional/métodos , Metabolômica/métodos , Inteligência ArtificialRESUMO
Most investigations of geographical within-species differences are limited to focusing on a single species. Here, we investigate global differences for multiple bacterial species using a dataset of 757 metagenomics sewage samples from 101 countries worldwide. The within-species variations were determined by performing genome reconstructions, and the analyses were expanded by gene focused approaches. Applying these methods, we recovered 3353 near complete (NC) metagenome assembled genomes (MAGs) encompassing 1439 different MAG species and found that within-species genomic variation was in 36% of the investigated species (12/33) coherent with regional separation. Additionally, we found that variation of organelle genes correlated less with geography compared to metabolic and membrane genes, suggesting that the global differences of these species are caused by regional environmental selection rather than dissemination limitations. From the combination of the large and globally distributed dataset and in-depth analysis, we present a wide investigation of global within-species phylogeny of sewage bacteria. The global differences found here emphasize the need for worldwide data sets when making global conclusions.
Assuntos
Bactérias , Esgotos , Filogenia , Esgotos/microbiologia , Bactérias/genética , Análise por Conglomerados , GeografiaRESUMO
The inflammatory activity in cirrhosis is often pronounced and related to episodes of decompensation. Systemic markers of inflammation may contain prognostic information, and we investigated their possible correlation with admissions and mortality among patients with newly diagnosed liver cirrhosis. We collected plasma samples from 149 patients with newly diagnosed (within the past 6 months) cirrhosis, and registered deaths and hospital admissions within 180 days. Ninety-two inflammatory markers were quantified and correlated with clinical variables, mortality, and admissions. Prediction models were calculated by logistic regression. We compared the disease courses of our cohort with a validation cohort of 86 patients with cirrhosis. Twenty of 92 markers of inflammation correlated significantly with mortality within 180 days (q-values of 0.00-0.044), whereas we found no significant correlations with liver-related admissions. The logistic regression models yielded AUROCs of 0.73 to 0.79 for mortality and 0.61 to 0.73 for liver-related admissions, based on a variety of modalities (clinical variables, inflammatory markers, clinical scores, or combinations thereof). The models performed moderately well in the validation cohort and were better able to predict mortality than liver-related admissions. In conclusion, markers of inflammation can be used to predict 180-day mortality in patients with newly diagnosed cirrhosis. Prediction models for newly diagnosed cirrhotic patients need further validation before implementation in clinical practice.Trial registration: NCT04422223 (and NCT03443934 for the validation cohort), and Scientific Ethics Committee No.: H-19024348.
Assuntos
Hospitalização , Cirrose Hepática , Humanos , Cirrose Hepática/diagnóstico , Estudos Prospectivos , Prognóstico , Inflamação , Índice de Gravidade de DoençaRESUMO
The application of multiple omics technologies in biomedical cohorts has the potential to reveal patient-level disease characteristics and individualized response to treatment. However, the scale and heterogeneous nature of multi-modal data makes integration and inference a non-trivial task. We developed a deep-learning-based framework, multi-omics variational autoencoders (MOVE), to integrate such data and applied it to a cohort of 789 people with newly diagnosed type 2 diabetes with deep multi-omics phenotyping from the DIRECT consortium. Using in silico perturbations, we identified drug-omics associations across the multi-modal datasets for the 20 most prevalent drugs given to people with type 2 diabetes with substantially higher sensitivity than univariate statistical tests. From these, we among others, identified novel associations between metformin and the gut microbiota as well as opposite molecular responses for the two statins, simvastatin and atorvastatin. We used the associations to quantify drug-drug similarities, assess the degree of polypharmacy and conclude that drug effects are distributed across the multi-omics modalities.
Assuntos
Aprendizado Profundo , Diabetes Mellitus Tipo 2 , Humanos , Algoritmos , Diabetes Mellitus Tipo 2/tratamento farmacológico , Diabetes Mellitus Tipo 2/genéticaRESUMO
Alcohol-related liver disease (ALD) is a major cause of liver-related death worldwide, yet understanding of the three key pathological features of the disease-fibrosis, inflammation and steatosis-remains incomplete. Here, we present a paired liver-plasma proteomics approach to infer molecular pathophysiology and to explore the diagnostic and prognostic capability of plasma proteomics in 596 individuals (137 controls and 459 individuals with ALD), 360 of whom had biopsy-based histological assessment. We analyzed all plasma samples and 79 liver biopsies using a mass spectrometry (MS)-based proteomics workflow with short gradient times and an enhanced, data-independent acquisition scheme in only 3 weeks of measurement time. In plasma and liver biopsy tissues, metabolic functions were downregulated whereas fibrosis-associated signaling and immune responses were upregulated. Machine learning models identified proteomics biomarker panels that detected significant fibrosis (receiver operating characteristic-area under the curve (ROC-AUC), 0.92, accuracy, 0.82) and mild inflammation (ROC-AUC, 0.87, accuracy, 0.79) more accurately than existing clinical assays (DeLong's test, P < 0.05). These biomarker panels were found to be accurate in prediction of future liver-related events and all-cause mortality, with a Harrell's C-index of 0.90 and 0.79, respectively. An independent validation cohort reproduced the diagnostic model performance, laying the foundation for routine MS-based liver disease testing.
Assuntos
Hepatopatias , Proteômica , Biomarcadores/metabolismo , Biópsia , Humanos , Inflamação/patologia , Fígado/metabolismo , Cirrose Hepática/diagnóstico , Cirrose Hepática/patologia , Hepatopatias/metabolismoRESUMO
The amount of public proteomics data is rapidly increasing but there is no standardized format to describe the sample metadata and their relationship with the dataset files in a way that fully supports their understanding or reanalysis. Here we propose to develop the transcriptomics data format MAGE-TAB into a standard representation for proteomics sample metadata. We implement MAGE-TAB-Proteomics in a crowdsourcing project to manually curate over 200 public datasets. We also describe tools and libraries to validate and submit sample metadata-related information to the PRIDE repository. We expect that these developments will improve the reproducibility and facilitate the reanalysis and integration of public proteomics datasets.