Your browser doesn't support javascript.
loading
: 20 | 50 | 100
1 - 20 de 22
1.
Nat Food ; 4(1): 51-60, 2023 01.
Article En | MEDLINE | ID: mdl-37118575

Achieving food security requires resilient agricultural systems with improved nutrient-use efficiency, optimized water and nutrient storage in soils, and reduced gaseous emissions. Success relies on understanding coupled nitrogen and carbon metabolism in soils, their associated influences on soil structure and the processes controlling nitrogen transformations at scales relevant to microbial activity. Here we show that the influence of organic matter on arable soil nitrogen transformations can be decoded by integrating metagenomic data with soil structural parameters. Our approach provides a mechanistic explanation of why organic matter is effective in reducing nitrous oxide losses while supporting system resilience. The relationship between organic carbon, soil-connected porosity and flow rates at scales relevant to microbes suggests that important increases in nutrient-use efficiency could be achieved at lower organic carbon stocks than currently envisaged.


Nitrogen , Soil , Soil/chemistry , Nitrogen/analysis , Agriculture , Carbon/chemistry , Nitrous Oxide/analysis
2.
G3 (Bethesda) ; 12(10)2022 09 30.
Article En | MEDLINE | ID: mdl-35980174

The assembly of divergent haplotypes using noisy long-read data presents a challenge to the reconstruction of haploid genome assemblies, due to overlapping distributions of technical sequencing error, intralocus genetic variation, and interlocus similarity within these data. Here, we present a comparative analysis of assembly algorithms representing overlap-layout-consensus, repeat graph, and de Bruijn graph methods. We examine how postprocessing strategies attempting to reduce redundant heterozygosity interact with the choice of initial assembly algorithm and ultimately produce a series of chromosome-level assemblies for an agricultural pest, the diamondback moth, Plutella xylostella (L.). We compare evaluation methods and show that BUSCO analyses may overestimate haplotig removal processing in long-read draft genomes, in comparison to a k-mer method. We discuss the trade-offs inherent in assembly algorithm and curation choices and suggest that "best practice" is research question dependent. We demonstrate a link between allelic divergence and allele-derived contig redundancy in final genome assemblies and document the patterns of coding and noncoding diversity between redundant sequences. We also document a link between an excess of nonsynonymous polymorphism and haplotigs that are unresolved by assembly or postassembly algorithms. Finally, we discuss how this phenomenon may have relevance for the usage of noisy long-read genome assemblies in comparative genomics.


Moths , Alleles , Animals , Genomics/methods , Haplotypes , Moths/genetics , Sequence Analysis, DNA
3.
Comput Struct Biotechnol J ; 20: 1914-1924, 2022.
Article En | MEDLINE | ID: mdl-35521547

We present a proof of concept implementation of the in-memory computing paradigm that we use to facilitate the analysis of metagenomic sequencing reads. In doing so we compare the performance of POSIX™file systems and key-value storage for omics data, and we show the potential for integrating high-performance computing (HPC) and cloud native technologies. We show that in-memory key-value storage offers possibilities for improved handling of omics data through more flexible and faster data processing. We envision fully containerized workflows and their deployment in portable micro-pipelines with multiple instances working concurrently with the same distributed in-memory storage. To highlight the potential usage of this technology for event driven and real-time data processing, we use a biological case study focused on the growing threat of antimicrobial resistance (AMR). We develop a workflow encompassing bioinformatics and explainable machine learning (ML) to predict life expectancy of a population based on the microbiome of its sewage while providing a description of AMR contribution to the prediction. We propose that in future, performing such analyses in 'real-time' would allow us to assess the potential risk to the population based on changes in the AMR profile of the community.

4.
Plants (Basel) ; 10(12)2021 Dec 09.
Article En | MEDLINE | ID: mdl-34961177

In a changing climate where future food security is a growing concern, researchers are exploring new methods and technologies in the effort to meet ambitious crop yield targets. The application of Artificial Intelligence (AI) including Machine Learning (ML) methods in this area has been proposed as a potential mechanism to support this. This review explores current research in the area to convey the state-of-the-art as to how AI/ML have been used to advance research, gain insights, and generally enable progress in this area. We address the question-Can AI improve crops and plant health? We further discriminate the bluster from the lustre by identifying the key challenges that AI has been shown to address, balanced with the potential issues with its usage, and the key requisites for its success. Overall, we hope to raise awareness and, as a result, promote usage, of AI related approaches where they can have appropriate impact to improve practices in agricultural and plant sciences.

5.
Proc Natl Acad Sci U S A ; 118(32)2021 08 10.
Article En | MEDLINE | ID: mdl-34353905

The circadian clock is an important adaptation to life on Earth. Here, we use machine learning to predict complex, temporal, and circadian gene expression patterns in Arabidopsis Most significantly, we classify circadian genes using DNA sequence features generated de novo from public, genomic resources, facilitating downstream application of our methods with no experimental work or prior knowledge needed. We use local model explanation that is transcript specific to rank DNA sequence features, providing a detailed profile of the potential circadian regulatory mechanisms for each transcript. Furthermore, we can discriminate the temporal phase of transcript expression using the local, explanation-derived, and ranked DNA sequence features, revealing hidden subclasses within the circadian class. Model interpretation/explanation provides the backbone of our methodological advances, giving insight into biological processes and experimental design. Next, we use model interpretation to optimize sampling strategies when we predict circadian transcripts using reduced numbers of transcriptomic timepoints. Finally, we predict the circadian time from a single, transcriptomic timepoint, deriving marker transcripts that are most impactful for accurate prediction; this could facilitate the identification of altered clock function from existing datasets.


Arabidopsis Proteins/genetics , Circadian Clocks/genetics , Circadian Rhythm/physiology , Machine Learning , Models, Biological , Apoproteins/genetics , Arabidopsis/genetics , Arabidopsis/physiology , Circadian Clocks/physiology , Circadian Rhythm/genetics , Ecotype , Gene Expression Profiling , Gene Expression Regulation, Plant , Phytochrome/genetics , Phytochrome A/genetics , Regulatory Sequences, Nucleic Acid
6.
Microbiome ; 9(1): 4, 2021 01 09.
Article En | MEDLINE | ID: mdl-33422152

BACKGROUND: Widespread bioinformatic resource development generates a constantly evolving and abundant landscape of workflows and software. For analysis of the microbiome, workflows typically begin with taxonomic classification of the microorganisms that are present in a given environment. Additional investigation is then required to uncover the functionality of the microbial community, in order to characterize its currently or potentially active biological processes. Such functional analysis of metagenomic data can be computationally demanding for high-throughput sequencing experiments. Instead, we can directly compare sequencing reads to a functionally annotated database. However, since reads frequently match multiple sequences equally well, analyses benefit from a hierarchical annotation tree, e.g. for taxonomic classification where reads are assigned to the lowest taxonomic unit. RESULTS: To facilitate functional microbiome analysis, we re-purpose well-known taxonomic classification tools to allow us to perform direct functional sequencing read classification with the added benefit of a functional hierarchy. To enable this, we develop and present a tree-shaped functional hierarchy representing the molecular function subset of the Gene Ontology annotation structure. We use this functional hierarchy to replace the standard phylogenetic taxonomy used by the classification tools and assign query sequences accurately to the lowest possible molecular function in the tree. We demonstrate this with simulated and experimental datasets, where we reveal new biological insights. CONCLUSIONS: We demonstrate that improved functional classification of metagenomic sequencing reads is possible by re-purposing a range of taxonomic classification tools that are already well-established, in conjunction with either protein or nucleotide reference databases. We leverage the advances in speed, accuracy and efficiency that have been made for taxonomic classification and translate these benefits for the rapid functional classification of microbiomes. While we focus on a specific set of commonly used methods, the functional annotation approach has broad applicability across other sequence classification tools. We hope that re-purposing becomes a routine consideration during bioinformatic resource development. Video abstract.


Classification/methods , Computational Biology/methods , High-Throughput Nucleotide Sequencing , Metagenome/genetics , Metagenomics/methods , Microbiota/genetics , Software , Phylogeny
7.
Genome ; 64(4): 467-475, 2021 Apr.
Article En | MEDLINE | ID: mdl-33216660

Genomics is both a data- and compute-intensive discipline. The success of genomics depends on an adequate informatics infrastructure that can address growing data demands and enable a diverse range of resource-intensive computational activities. Designing a suitable infrastructure is a challenging task, and its success largely depends on its adoption by users. In this article, we take a user-centric view of the genomics, where users are bioinformaticians, computational biologists, and data scientists. We try to take their point of view on how traditional computational activities for genomics are expanding due to data growth, as well as the introduction of big data and cloud technologies. The changing landscape of computational activities and new user requirements will influence the design of future genomics infrastructures.


Computational Biology/methods , Genomics/methods , Base Sequence , Humans , Software
8.
Sci Rep ; 10(1): 9522, 2020 06 12.
Article En | MEDLINE | ID: mdl-32533004

During the development of new drugs or compounds there is a requirement for preclinical trials, commonly involving animal tests, to ascertain the safety of the compound prior to human trials. Machine learning techniques could provide an in-silico alternative to animal models for assessing drug toxicity, thus reducing expensive and invasive animal testing during clinical trials, for drugs that are most likely to fail safety tests. Here we present a machine learning model to predict kidney dysfunction, as a proxy for drug induced renal toxicity, in rats. To achieve this, we use inexpensive transcriptomic profiles derived from human cell lines after chemical compound treatment to train our models combined with compound chemical structure information. Genomics data due to its sparse, high-dimensional and noisy nature presents significant challenges in building trustworthy and transparent machine learning models. Here we address these issues by judiciously building feature sets from heterogenous sources and coupling them with measures of model uncertainty achieved through Gaussian Process based Bayesian models. We combine the use of insight into the feature-wise contributions to our predictions with the use of predictive uncertainties recovered from the Gaussian Process to improve the transparency and trustworthiness of the model.


Drug-Related Side Effects and Adverse Reactions/genetics , Gene Expression Profiling , Machine Learning , Models, Theoretical , Animals , Humans , Quality Control , Uncertainty
9.
iScience ; 23(4): 100988, 2020 Apr 24.
Article En | MEDLINE | ID: mdl-32248063

Increasingly available microbial reference data allow interpreting the composition and function of previously uncharacterized microbial communities in detail, via high-throughput sequencing analysis. However, efficient methods for read classification are required when the best database matches for short sequence reads are often shared among multiple reference sequences. Here, we take advantage of the fact that microbial sequences can be annotated relative to established tree structures, and we develop a highly scalable read classifier, PRROMenade, by enhancing the generalized Burrows-Wheeler transform with a labeling step to directly assign reads to the corresponding lowest taxonomic unit in an annotation tree. PRROMenade solves the multi-matching problem while allowing fast variable-size sequence classification for phylogenetic or functional annotation. Our simulations with 5% added differences from reference indicated only 1.5% error rate for PRROMenade functional classification. On metatranscriptomic data PRROMenade highlighted biologically relevant functional pathways related to diet-induced changes in the human gut microbiome.

10.
Microb Genom ; 6(1)2020 01.
Article En | MEDLINE | ID: mdl-31922467

The majority of bacterial genomes have high coding efficiencies, but there are some genomes of intracellular bacteria that have low gene density. The genome of the endosymbiont Sodalis glossinidius contains almost 50 % pseudogenes containing mutations that putatively silence them at the genomic level. We have applied multiple 'omic' strategies, combining Illumina and Pacific Biosciences Single-Molecule Real-Time DNA sequencing and annotation, stranded RNA sequencing and proteome analysis to better understand the transcriptional and translational landscape of Sodalis pseudogenes, and potential mechanisms for their control. Between 53 and 74 % of the Sodalis transcriptome remains active in cell-free culture. The mean sense transcription from coding domain sequences (CDSs) is four times greater than that from pseudogenes. Comparative genomic analysis of six Illumina-sequenced Sodalis isolates from different host Glossina species shows pseudogenes make up ~40 % of the 2729 genes in the core genome, suggesting that they are stable and/or that Sodalis is a recent introduction across the genus Glossina as a facultative symbiont. These data shed further light on the importance of transcriptional and translational control in deciphering host-microbe interactions. The combination of genomics, transcriptomics and proteomics gives a multidimensional perspective for studying prokaryotic genomes with a view to elucidating evolutionary adaptation to novel environmental niches.


Enterobacteriaceae/genetics , Genes, Bacterial , Pseudogenes , Animals , Bacterial Proteins/genetics , Proteome , Sequence Analysis, DNA , Sequence Analysis, RNA , Symbiosis , Transcriptome , Tsetse Flies/microbiology
11.
Parasit Vectors ; 11(1): 549, 2018 Oct 20.
Article En | MEDLINE | ID: mdl-30342535

BACKGROUND: Aedes aegypti is the principal vector of several important arboviruses. Among the methods of vector control to limit transmission of disease are genetic strategies that involve the release of sterile or genetically modified non-biting males, which has generated interest in manipulating mosquito sex ratios. Sex determination in Ae. aegypti is controlled by a non-recombining Y chromosome-like region called the M locus, yet characterisation of this locus has been thwarted by the repetitive nature of the genome. In 2015, an M locus gene named Nix was identified that displays the qualities of a sex determination switch. RESULTS: With the use of a whole-genome bacterial artificial chromosome (BAC) library, we amplified and sequenced a ~200 kb region containing the male-determining gene Nix. In this study, we show that Nix is comprised of two exons separated by a 99 kb intron primarily composed of repetitive DNA, especially transposable elements. CONCLUSIONS: Nix, an unusually large and highly repetitive gene, exhibits features in common with Y chromosome genes in other organisms. We speculate that the lack of recombination at the M locus has allowed the expansion of repeats in a manner characteristic of a sex-limited chromosome, in accordance with proposed models of sex chromosome evolution in insects.


Aedes/genetics , Genome, Insect/genetics , Aedes/physiology , Animals , Base Sequence , Chromosomes, Artificial, Bacterial , Female , Gene Library , Genes, Insect , Genetic Loci , Male , Sex Chromosomes , Sex Determination Processes
12.
J Chem Phys ; 148(24): 241744, 2018 Jun 28.
Article En | MEDLINE | ID: mdl-29960328

Simulation and data analysis have evolved into powerful methods for discovering and understanding molecular modes of action and designing new compounds to exploit these modes. The combination provides a strong impetus to create and exploit new tools and techniques at the interfaces between physics, biology, and data science as a pathway to new scientific insight and accelerated discovery. In this context, we explore the rational design of novel antimicrobial peptides (short protein sequences exhibiting broad activity against multiple species of bacteria). We show how datasets can be harvested to reveal features which inform new design concepts. We introduce new analysis and visualization tools: a graphical representation of the k-mer spectrum as a fundamental property encoded in antimicrobial peptide databases and a data-driven representation to illustrate membrane binding and permeation of helical peptides.


Anti-Bacterial Agents/chemistry , Antimicrobial Cationic Peptides/chemistry , Data Mining , Databases, Protein , Membranes/chemistry , Natural Science Disciplines , Bacteria/metabolism , Drug Discovery , Membranes/metabolism
13.
Mol Cell Proteomics ; 15(8): 2554-75, 2016 08.
Article En | MEDLINE | ID: mdl-27226403

Despite 40 years of control efforts, onchocerciasis (river blindness) remains one of the most important neglected tropical diseases, with 17 million people affected. The etiological agent, Onchocerca volvulus, is a filarial nematode with a complex lifecycle involving several distinct stages in the definitive host and blackfly vector. The challenges of obtaining sufficient material have prevented high-throughput studies and the development of novel strategies for disease control and diagnosis. Here, we utilize the closest relative of O. volvulus, the bovine parasite Onchocerca ochengi, to compare stage-specific proteomes and host-parasite interactions within the secretome. We identified a total of 4260 unique O. ochengi proteins from adult males and females, infective larvae, intrauterine microfilariae, and fluid from intradermal nodules. In addition, 135 proteins were detected from the obligate Wolbachia symbiont. Observed protein families that were enriched in all whole body extracts relative to the complete search database included immunoglobulin-domain proteins, whereas redox and detoxification enzymes and proteins involved in intracellular transport displayed stage-specific overrepresentation. Unexpectedly, the larval stages exhibited enrichment for several mitochondrial-related protein families, including members of peptidase family M16 and proteins which mediate mitochondrial fission and fusion. Quantification of proteins across the lifecycle using the Hi-3 approach supported these qualitative analyses. In nodule fluid, we identified 94 O. ochengi secreted proteins, including homologs of transforming growth factor-ß and a second member of a novel 6-ShK toxin domain family, which was originally described from a model filarial nematode (Litomosoides sigmodontis). Strikingly, the 498 bovine proteins identified in nodule fluid were strongly dominated by antimicrobial proteins, especially cathelicidins. This first high-throughput analysis of an Onchocerca spp. proteome across the lifecycle highlights its profound complexity and emphasizes the extremely close relationship between O. ochengi and O. volvulus The insights presented here provide new candidates for vaccine development, drug targeting and diagnostic biomarkers.


Onchocerca/physiology , Onchocerciasis/parasitology , Proteomics/methods , Protozoan Proteins/metabolism , Animals , Cattle , Disease Models, Animal , Female , Gene Expression Regulation, Developmental , Host-Parasite Interactions , Humans , Male , Onchocerca/metabolism , Onchocerciasis/veterinary , Phylogeny , Protein Interaction Maps
14.
Proteomics ; 15(15): 2618-28, 2015 Aug.
Article En | MEDLINE | ID: mdl-25867681

Proteomics data can supplement genome annotation efforts, for example being used to confirm gene models or correct gene annotation errors. Here, we present a large-scale proteogenomics study of two important apicomplexan pathogens: Toxoplasma gondii and Neospora caninum. We queried proteomics data against a panel of official and alternate gene models generated directly from RNASeq data, using several newly generated and some previously published MS datasets for this meta-analysis. We identified a total of 201 996 and 39 953 peptide-spectrum matches for T. gondii and N. caninum, respectively, at a 1% peptide FDR threshold. This equated to the identification of 30 494 distinct peptide sequences and 2921 proteins (matches to official gene models) for T. gondii, and 8911 peptides/1273 proteins for N. caninum following stringent protein-level thresholding. We have also identified 289 and 140 loci for T. gondii and N. caninum, respectively, which mapped to RNA-Seq-derived gene models used in our analysis and apparently absent from the official annotation (release 10 from EuPathDB) of these species. We present several examples in our study where the RNA-Seq evidence can help in correction of the current gene model and can help in discovery of potential new genes. The findings of this study have been integrated into the EuPathDB. The data have been deposited to the ProteomeXchange with identifiers PXD000297and PXD000298.


Genomics/methods , Neospora/genetics , Neospora/metabolism , Proteomics/methods , Toxoplasma/genetics , Toxoplasma/metabolism , Amino Acid Sequence , Apicomplexa/genetics , Apicomplexa/metabolism , Databases, Genetic , Genes, Protozoan/genetics , Molecular Sequence Annotation/methods , Molecular Sequence Data , Peptides/genetics , Peptides/metabolism , Proteome/genetics , Proteome/metabolism , Protozoan Proteins/genetics , Protozoan Proteins/metabolism , Sequence Analysis, RNA/methods , Sequence Homology, Amino Acid , Tandem Mass Spectrometry/methods
15.
Proteomics ; 14(23-24): 2731-41, 2014 Dec.
Article En | MEDLINE | ID: mdl-25297486

The recent massive increase in capability for sequencing genomes is producing enormous advances in our understanding of biological systems. However, there is a bottleneck in genome annotation--determining the structure of all transcribed genes. Experimental data from MS studies can play a major role in confirming and correcting gene structure--proteogenomics. However, there are some technical and practical challenges to overcome, since proteogenomics requires pipelines comprising a complex set of interconnected modules as well as bespoke routines, for example in protein inference and statistics. We are introducing a complete, open source pipeline for proteogenomics, called ProteoAnnotator, which incorporates a graphical user interface and implements the Proteomics Standards Initiative mzIdentML standard for each analysis stage. All steps are included as standalone modules with the mzIdentML library, allowing other groups to re-use the whole pipeline or constituent parts within other tools. We have developed new modules for pre-processing and combining multiple search databases, for performing peptide-level statistics on mzIdentML files, for scoring grouped protein identifications matched to a given genomic locus to validate that updates to the official gene models are statistically sound and for mapping end results back onto the genome. ProteoAnnotator is available from http://www.proteoannotator.org/. All MS data have been deposited in the ProteomeXchange with identifiers PXD001042 and PXD001390 (http://proteomecentral.proteomexchange.org/dataset/PXD001042; http://proteomecentral.proteomexchange.org/dataset/PXD001390).


Genomics/methods , Proteins/metabolism , Proteomics/methods , Software
16.
Proteomics ; 14(6): 685-8, 2014 Mar.
Article En | MEDLINE | ID: mdl-24453188

The mzQuantML standard from the HUPO Proteomics Standards Initiative has recently been released, capturing quantitative data about peptides and proteins, following analysis of MS data. We present a Java application programming interface (API) for mzQuantML called jmzQuantML. The API provides robust bridges between Java classes and elements in mzQuantML files and allows random access to any part of the file. The API provides read and write capabilities, and is designed to be embedded in other software packages, enabling mzQuantML support to be added to proteomics software tools (http://code.google.com/p/jmzquantml/). The mzQuantML standard is designed around a multilevel validation system to ensure that files are structurally and semantically correct for different proteomics quantitative techniques. In this article, we also describe a Java software tool (http://code.google.com/p/mzquantml-validator/) for validating mzQuantML files, which is a formal part of the data standard.


Proteins/chemistry , Proteomics/methods , Software , Databases, Protein , Mass Spectrometry/methods , Peptides/chemistry , Programming Languages
17.
Mol Cell Proteomics ; 12(11): 3026-35, 2013 Nov.
Article En | MEDLINE | ID: mdl-23813117

The Proteomics Standards Initiative has recently released the mzIdentML data standard for representing peptide and protein identification results, for example, created by a search engine. When a new standard format is produced, it is important that software tools are available that make it straightforward for laboratory scientists to use it routinely and for bioinformaticians to embed support in their own tools. Here we report the release of several open-source Java-based software packages based on mzIdentML: ProteoIDViewer, mzidLibrary, and mzidValidator. The ProteoIDViewer is a desktop application allowing users to visualize mzIdentML-formatted results originating from any appropriate identification software; it supports visualization of all the features of the mzIdentML format. The mzidLibrary is a software library containing routines for importing data from external search engines, post-processing identification data (such as false discovery rate calculations), combining results from multiple search engines, performing protein inference, setting identification thresholds, and exporting results from mzIdentML to plain text files. The mzidValidator is able to process files and report warnings or errors if files are not correctly formatted or contain some semantic error. We anticipate that these developments will simplify adoption of the new standard in proteomics laboratories and the integration of mzIdentML into other software tools. All three tools are freely available in the public domain.


Peptides/chemistry , Proteins/chemistry , Proteomics/statistics & numerical data , Software , Proteomics/standards , Search Engine
18.
Proteomics ; 13(3-4): 480-92, 2013 Feb.
Article En | MEDLINE | ID: mdl-23319203

The development of the HUPO-Proteomics Standards Initiative standard data formats and Minimum Information About a Proteomics Experiment guidelines facilitate coordination within the scientific community. The data standards provide a framework to exchange and share data regardless of the source instrument or software. Nevertheless there remains a view that Proteomics Standards Initiative standards are challenging to use and integrate into routine laboratory pipelines. In this article, we review the tools available for integrating the different data standards and building compliant software. These tools are focused on a range of different data types and support different scenarios, intended for software developers or end users, allowing the standards to be used in a straightforward manner.


Electronic Data Processing/standards , Proteomics/standards , Guidelines as Topic , Information Dissemination , Information Management , Reference Standards , User-Computer Interface , Workflow
19.
Proteomics ; 12(6): 790-4, 2012 Mar.
Article En | MEDLINE | ID: mdl-22539429

We present a Java application programming interface (API), jmzIdentML, for the Human Proteome Organisation (HUPO) Proteomics Standards Initiative (PSI) mzIdentML standard for peptide and protein identification data. The API combines the power of Java Architecture of XML Binding (JAXB) and an XPath-based random-access indexer to allow a fast and efficient mapping of extensible markup language (XML) elements to Java objects. The internal references in the mzIdentML files are resolved in an on-demand manner, where the whole file is accessed as a random-access swap file, and only the relevant piece of XMLis selected for mapping to its corresponding Java object. The APIis highly efficient in its memory usage and can handle files of arbitrary sizes. The APIfollows the official release of the mzIdentML (version 1.1) specifications and is available in the public domain under a permissive licence at http://www.code.google.com/p/jmzidentml/.


Proteins/chemistry , Proteomics/methods , Proteomics/standards , Software , Amino Acid Sequence , Databases, Protein , Humans , Molecular Sequence Data , Peptides/chemistry , Proteome/chemistry , Software/standards
20.
J Proteome Res ; 10(4): 2088-94, 2011 Apr 01.
Article En | MEDLINE | ID: mdl-21222473

Confident identification of peptides via tandem mass spectrometry underpins modern high-throughput proteomics. This has motivated considerable recent interest in the postprocessing of search engine results to increase confidence and calculate robust statistical measures, for example through the use of decoy databases to calculate false discovery rates (FDR). FDR-based analyses allow for multiple testing and can assign a single confidence value for both sets and individual peptide spectrum matches (PSMs). We recently developed an algorithm for combining the results from multiple search engines, integrating FDRs for sets of PSMs made by different search engine combinations. Here we describe a web-server and a downloadable application that makes this routinely available to the proteomics community. The web server offers a range of outputs including informative graphics to assess the confidence of the PSMs and any potential biases. The underlying pipeline also provides a basic protein inference step, integrating PSMs into protein ambiguity groups where peptides can be matched to more than one protein. Importantly, we have also implemented full support for the mzIdentML data standard, recently released by the Proteomics Standards Initiative, providing users with the ability to convert native formats to mzIdentML files, which are available to download.


Algorithms , Peptides/analysis , Search Engine , Tandem Mass Spectrometry/instrumentation , Tandem Mass Spectrometry/methods , Databases, Protein , Humans , Internet , Proteomics/instrumentation , Proteomics/methods , User-Computer Interface
...