RESUMO
MOTIVATION: Enzymatic digestion of proteins before mass spectrometry analysis is a key process in metaproteomic workflows. Canonical metaproteomic data processing pipelines typically involve matching spectra produced by the mass spectrometer to a theoretical spectra database, followed by matching the identified peptides back to parent-proteins. However, the nature of enzymatic digestion produces peptides that can be found in multiple proteins due to conservation or chance, presenting difficulties with protein and functional assignment. RESULTS: To combat this challenge, we developed pepFunk, a peptide-centric metaproteomic workflow focused on the analysis of human gut microbiome samples. Our workflow includes a curated peptide database annotated with Kyoto Encyclopedia of Genes and Genomes (KEGG) terms and a gene set variation analysis-inspired pathway enrichment adapted for peptide-level data. Analysis using our peptide-centric workflow is fast and highly correlated to a protein-centric analysis, and can identify more enriched KEGG pathways than analysis using protein-level data. Our workflow is open source and available as a web application or source code to be run locally. AVAILABILITY AND IMPLEMENTATION: pepFunk is available online as a web application at https://shiny.imetalab.ca/pepFunk/ with open-source code available from https://github.com/northomics/pepFunk. CONTACT: dfigeys@uottawa.ca. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Microbioma Gastrointestinal , Biologia Computacional , Humanos , Peptídeos , Proteínas , SoftwareRESUMO
BACKGROUND: The severity and frequency of drought has increased around the globe, creating challenges in ensuring food security for a growing world population. As a consequence, improving water use efficiency by crops has become an important objective for crop improvement. Some wild crop relatives have adapted to extreme osmotic stresses and can provide valuable insights into traits and genetic signatures that can guide efforts to improve crop tolerance to water deficits. Eutrema salsugineum, a close relative of many cruciferous crops, is a halophytic plant and extremophyte model for abiotic stress research. RESULTS: Using comparative transcriptomics, we show that two E. salsugineum ecotypes display significantly different transcriptional responses towards a two-stage drought treatment. Even before visibly wilting, water deficit led to the differential expression of almost 1,100 genes for an ecotype from the semi-arid, sub-arctic Yukon, Canada, but only 63 genes for an ecotype from the semi-tropical, monsoonal, Shandong, China. After recovery and a second drought treatment, about 5,000 differentially expressed genes were detected in Shandong plants versus 1,900 genes in Yukon plants. Only 13 genes displayed similar drought-responsive patterns for both ecotypes. We detected 1,007 long non-protein coding RNAs (lncRNAs), 8% were only expressed in stress-treated plants, a surprising outcome given the documented association between lncRNA expression and stress. Co-expression network analysis of the transcriptomes identified eight gene clusters where at least half of the genes in each cluster were differentially expressed. While many gene clusters were correlated to drought treatments, only a single cluster significantly correlated to drought exposure in both ecotypes. CONCLUSION: Extensive, ecotype-specific transcriptional reprogramming with drought was unexpected given that both ecotypes are adapted to saline habitats providing persistent exposure to osmotic stress. This ecotype-specific response would have escaped notice had we used a single exposure to water deficit. Finally, the apparent capacity to improve tolerance and growth after a drought episode represents an important adaptive trait for a plant that thrives under semi-arid Yukon conditions, and may be similarly advantageous for crop species experiencing stresses attributed to climate change.
Assuntos
Brassicaceae/crescimento & desenvolvimento , Perfilação da Expressão Gênica/métodos , RNA Longo não Codificante/genética , RNA Mensageiro/genética , Brassicaceae/genética , Canadá , Desidratação , Ecótipo , Regulação da Expressão Gênica de Plantas , Redes Reguladoras de Genes , Folhas de Planta/genética , Folhas de Planta/crescimento & desenvolvimento , RNA de Plantas/genética , Plantas Tolerantes a Sal/genética , Plantas Tolerantes a Sal/crescimento & desenvolvimento , Análise de Sequência de RNA , Estresse FisiológicoRESUMO
BACKGROUND: In plants, long non-protein coding RNAs are believed to have essential roles in development and stress responses. However, relative to advances on discerning biological roles for long non-protein coding RNAs in animal systems, this RNA class in plants is largely understudied. With comparatively few validated plant long non-coding RNAs, research on this potentially critical class of RNA is hindered by a lack of appropriate prediction tools and databases. Supervised learning models trained on data sets of mostly non-validated, non-coding transcripts have been previously used to identify this enigmatic RNA class with applications largely focused on animal systems. Our approach uses a training set comprised only of empirically validated long non-protein coding RNAs from plant, animal, and viral sources to predict and rank candidate long non-protein coding gene products for future functional validation. RESULTS: Individual stochastic gradient boosting and random forest classifiers trained on only empirically validated long non-protein coding RNAs were constructed. In order to use the strengths of multiple classifiers, we combined multiple models into a single stacking meta-learner. This ensemble approach benefits from the diversity of several learners to effectively identify putative plant long non-coding RNAs from transcript sequence features. When the predicted genes identified by the ensemble classifier were compared to those listed in GreeNC, an established plant long non-coding RNA database, overlap for predicted genes from Arabidopsis thaliana, Oryza sativa and Eutrema salsugineum ranged from 51 to 83% with the highest agreement in Eutrema salsugineum. Most of the highest ranking predictions from Arabidopsis thaliana were annotated as potential natural antisense genes, pseudogenes, transposable elements, or simply computationally predicted hypothetical protein. Due to the nature of this tool, the model can be updated as new long non-protein coding transcripts are identified and functionally verified. CONCLUSIONS: This ensemble classifier is an accurate tool that can be used to rank long non-protein coding RNA predictions for use in conjunction with gene expression studies. Selection of plant transcripts with a high potential for regulatory roles as long non-protein coding RNAs will advance research in the elucidation of long non-protein coding RNA function.
Assuntos
Biologia Computacional/métodos , Aprendizado de Máquina , RNA Longo não Codificante/genética , Fases de Leitura Aberta/genética , RNA de Plantas/genética , Processos EstocásticosRESUMO
Functional redundancy is a key ecosystem property representing the fact that different taxa contribute to an ecosystem in similar ways through the expression of redundant functions. The redundancy of potential functions (or genome-level functional redundancy [Formula: see text]) of human microbiomes has been recently quantified using metagenomics data. Yet, the redundancy of expressed functions in the human microbiome has never been quantitatively explored. Here, we present an approach to quantify the proteome-level functional redundancy [Formula: see text] in the human gut microbiome using metaproteomics. Ultra-deep metaproteomics reveals high proteome-level functional redundancy and high nestedness in the human gut proteomic content networks (i.e., the bipartite graphs connecting taxa to functions). We find that the nested topology of proteomic content networks and relatively small functional distances between proteomes of certain pairs of taxa together contribute to high [Formula: see text] in the human gut microbiome. As a metric comprehensively incorporating the factors of presence/absence of each function, protein abundances of each function and biomass of each taxon, [Formula: see text] outcompetes diversity indices in detecting significant microbiome responses to environmental factors, including individuality, biogeography, xenobiotics, and disease. We show that gut inflammation and exposure to specific xenobiotics can significantly diminish the [Formula: see text] with no significant change in taxonomic diversity.
Assuntos
Microbioma Gastrointestinal , Microbiota , Humanos , Microbioma Gastrointestinal/fisiologia , Proteoma , Proteômica , Xenobióticos , FezesRESUMO
Constant improvements in mass spectrometry technologies and laboratory workflows have enabled the proteomics investigation of biological samples of growing complexity. Microbiomes represent such complex samples for which metaproteomics analyses are becoming increasingly popular. Metaproteomics experimental procedures create large amounts of data from which biologically relevant signal must be efficiently extracted to draw meaningful conclusions. Such a data processing requires appropriate bioinformatics tools specifically developed for, or capable of handling metaproteomics data. In this chapter, we outline current and novel tools that can perform the most commonly used steps in the analysis of cutting-edge metaproteomics data, such as peptide and protein identification and quantification, as well as data normalization, imputation, mining, and visualization. We also provide details about the experimental setups in which these tools should be used.
Assuntos
Microbioma Gastrointestinal , Microbiota , Biologia Computacional/métodos , Proteômica/métodos , SoftwareRESUMO
Metaproteomics is a recently thriving technique that studies the collection of proteins in complex microbiomes of the human, animal, plant, and environment. The bioinformatics workflow required for metaproteomics research, from the database search and protein quantification to downstream functional and taxonomic analysis has been challenging and thus limiting the accessibility of metaproteomics to microbiome researchers. To overcome these challenges, we have developed a set of tools named iMetaLab Suite. iMetaLab Suite includes the following components: (1) MetaLab Desktop, an automated database search software that facilities proteins identification and quantitation from microbiomes; (2) the automated iMetaReport that allows users to quickly access database search results and data set profiles; and (3) an interactive online toolset, iMetaShiny, covering most frequently used functional, taxonomic, and statistical analysis in metaproteomics. iMetaLab Suite is a free, easily accessible, and actively updated toolset available to assist researchers to explore metaproteomic data.
RESUMO
Metaproteomics is used to explore the functional dynamics of microbial communities. However, acquiring metaproteomic data by tandem mass spectrometry (MS/MS) is time-consuming and resource-intensive, and there is a demand for computational methods that can be used to reduce these resource requirements. We present MetaProClust-MS1, a computational framework for microbiome feature screening developed to prioritize samples for follow-up MS/MS. In this proof-of-concept study, we tested and compared MetaProClust-MS1 results on gut microbiome data, from fecal samples, acquired using short 15-min MS1-only chromatographic gradients and MS1 spectra from longer 60-min gradients to MS/MS-acquired data. We found that MetaProClust-MS1 identified robust gut microbiome responses caused by xenobiotics with significantly correlated cluster topologies of comparable data sets. We also used MetaProClust-MS1 to reanalyze data from both a clinical MS/MS diagnostic study of pediatric patients with inflammatory bowel disease and an experiment evaluating the therapeutic effects of a small molecule on the brain tissue of Alzheimer's disease mouse models. MetaProClust-MS1 clusters could distinguish between inflammatory bowel disease diagnoses (ulcerative colitis and Crohn's disease) using samples from mucosal luminal interface samples and identified hippocampal proteome shifts of Alzheimer's disease mouse models after small-molecule treatment. Therefore, we demonstrate that MetaProClust-MS1 can screen both microbiomes and single-species proteomes using only MS1 profiles, and our results suggest that this approach may be generalizable to any proteomics experiment. MetaProClust-MS1 may be especially useful for large-scale metaproteomic screening for the prioritization of samples for further metaproteomic characterization, using MS/MS, for instance, in addition to being a promising novel approach for clinical diagnostic screening. IMPORTANCE Growing evidence suggests that human gut microbiome composition and function are highly associated with health and disease. As such, high-throughput metaproteomic studies are becoming more common in gut microbiome research. However, using a conventional long liquid chromatography (LC)-MS/MS gradient metaproteomics approach as an initial screen in large-scale microbiome experiments can be slow and expensive. To combat this challenge, we introduce MetaProClust-MS1, a computational framework for microbiome screening using MS1-only profiles. In this proof-of-concept study, we show that MetaProClust-MS1 identifies clusters of gut microbiome treatments using MS1-only profiles similar to those identified using MS/MS. Our approach allows researchers to prioritize samples and treatments of interest for further metaproteomic analyses and may be generally applicable to any proteomic analysis. In particular, this approach may be especially useful for large-scale metaproteomic screening or in clinical settings where rapid diagnostic evidence is required.
Assuntos
Doença de Alzheimer , Doenças Inflamatórias Intestinais , Microbiota , Animais , Camundongos , Humanos , Criança , Proteômica/métodos , Espectrometria de Massas em Tandem , ProteomaRESUMO
Long non-coding RNAs (lncRNAs) represent a diverse class of regulatory loci with roles in development and stress responses throughout all kingdoms of life. LncRNAs, however, remain under-studied in plants compared to animal systems. To address this deficiency, we applied a machine learning prediction tool, Classifying RNA by Ensemble Machine learning Algorithm (CREMA), to analyze RNAseq data from 11 plant species chosen to represent a wide range of evolutionary histories. Transcript sequences of all expressed and/or annotated loci from plants grown in unstressed (control) conditions were assembled and input into CREMA for comparative analyses. On average, 6.4% of the plant transcripts were identified by CREMA as encoding lncRNAs. Gene annotation associated with the transcripts showed that up to 99% of all predicted lncRNAs for Solanum tuberosum and Amborella trichopoda were missing from their reference annotations whereas the reference annotation for the genetic model plant Arabidopsis thaliana contains 96% of all predicted lncRNAs for this species. Thus a reliance on reference annotations for use in lncRNA research in less well-studied plants can be impeded by the near absence of annotations associated with these regulatory transcripts. Moreover, our work using phylogenetic signal analyses suggests that molecular traits of plant lncRNAs display different evolutionary patterns than all other transcripts in plants and have molecular traits that do not follow a classic evolutionary pattern. Specifically, GC content was the only tested trait of lncRNAs with consistently significant and high phylogenetic signal, contrary to high signal in all tested molecular traits for the other transcripts in our tested plant species.