RESUMO
Antimicrobial peptides (AMPs) are a heterogeneous group of short polypeptides that target not only microorganisms but also viruses and cancer cells. Due to their lower selection for resistance compared with traditional antibiotics, AMPs have been attracting the ever-growing attention from researchers, including bioinformaticians. Machine learning represents the most cost-effective method for novel AMP discovery and consequently many computational tools for AMP prediction have been recently developed. In this article, we investigate the impact of negative data sampling on model performance and benchmarking. We generated 660 predictive models using 12 machine learning architectures, a single positive data set and 11 negative data sampling methods; the architectures and methods were defined on the basis of published AMP prediction software. Our results clearly indicate that similar training and benchmark data set, i.e. produced by the same or a similar negative data sampling method, positively affect model performance. Consequently, all the benchmark analyses that have been performed for AMP prediction models are significantly biased and, moreover, we do not know which model is the most accurate. To provide researchers with reliable information about the performance of AMP predictors, we also created a web server AMPBenchmark for fair model benchmarking. AMPBenchmark is available at http://BioGenies.info/AMPBenchmark.
Assuntos
Peptídeos Antimicrobianos , Benchmarking , Antibacterianos , Peptídeos/químicaRESUMO
SUMMARY: Antimicrobial peptides (AMPs) are the key components of the innate immune system that protect against pathogens, regulate the microbiome and are promising targets for pharmaceutical research. Computational tools based on machine learning have the potential to aid discovery of genes encoding novel AMPs but existing approaches are not designed for genome-wide scans. To facilitate such genome-wide discovery of AMPs we developed a fast and accurate AMP classification framework, ampir. ampir is designed for high throughput, integrates well with existing bioinformatics pipelines, and has much higher classification accuracy than existing methods when applied to whole genome data. AVAILABILITY AND IMPLEMENTATION: ampir is implemented primarily in R with core feature calculation methods written in C++. Release versions are available via CRAN and work on all major operating systems. The development version is maintained at https://github.com/legana/ampir. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Genoma , Software , Aprendizado de Máquina , Proteínas Citotóxicas Formadoras de PorosRESUMO
The salivary apparatus of the common octopus ( Octopus vulgaris) has been the subject of biochemical study for over a century. A combination of bioassays, behavioral studies and molecular analysis on O. vulgaris and related species suggests that its proteome should contain a mixture of highly potent neurotoxins and degradative proteins. However, a lack of genomic and transcriptomic data has meant that the amino acid sequences of these proteins remain almost entirely unknown. To address this, we assembled the posterior salivary gland transcriptome of O. vulgaris and combined it with high resolution mass spectrometry data from the posterior and anterior salivary glands of two adults, the posterior salivary glands of six paralarvae and the saliva from a single adult. We identified a total of 2810 protein groups from across this range of salivary tissues and age classes, including 84 with homology to known venom protein families. Additionally, we found 21 short secreted cysteine rich protein groups of which 12 were specific to cephalopods. By combining protein expression data with phylogenetic analysis we demonstrate that serine proteases expanded dramatically within the cephalopod lineage and that cephalopod specific proteins are strongly associated with the salivary apparatus.