RESUMO
Supervised machine learning algorithms are used by life scientists for a variety of objectives. Expert-curated public gene and protein databases are major resources for gathering data to train these algorithms. While these data resources are continuously updated, generally, these updates are not incorporated into published machine learning algorithms which thereby can become outdated soon after their introduction. In this paper, we propose a new model of operation for supervised machine learning algorithms that learn from genomic data. By defining these algorithms in a pipeline in which the training data gathering procedure and the learning process are automated, one can create a system that generates a classifier or predictor using information available from public resources. The proposed model is explained using three case studies on SignalP, MemLoci, and ApicoAP in which existing machine learning models are utilized in pipelines. Given that the vast majority of the procedures described for gathering training data can easily be automated, it is possible to transform valuable machine learning algorithms into self-evolving learners that benefit from the ever-changing data available for gene products and to develop new machine learning algorithms that are similarly capable.
Assuntos
Genômica/métodos , Reconhecimento Automatizado de Padrão/métodos , Algoritmos , Inteligência Artificial , Bases de Dados Genéticas , Modelos Teóricos , SoftwareRESUMO
BACKGROUND: Computational identification of apicoplast-targeted proteins is important in drug target determination for diseases such as malaria. While there are established methods for identifying proteins with a bipartite signal in multiple species of Apicomplexa, not all apicoplast-targeted proteins possess this bipartite signature. The publication of recent experimental findings of apicoplast membrane proteins, called transmembrane proteins, that do not possess a bipartite signal has made it feasible to devise a machine learning approach for identifying this new class of apicoplast-targeted proteins computationally. METHODOLOGY/PRINCIPAL FINDINGS: In this work, we develop a method for predicting apicoplast-targeted transmembrane proteins for multiple species of Apicomplexa, whereby several classifiers trained on different feature sets and based on different algorithms are evaluated and combined in an ensemble classification model to obtain the best expected performance. The feature sets considered are the hydrophobicity and composition characteristics of amino acids over transmembrane domains, the existence of short sequence motifs over cytosolically disposed regions, and Gene Ontology (GO) terms associated with given proteins. Our model, ApicoAMP, is an ensemble classification model that combines decisions of classifiers following the majority vote principle. ApicoAMP is trained on a set of proteins from 11 apicomplexan species and achieves 91% overall expected accuracy. CONCLUSIONS/SIGNIFICANCE: ApicoAMP is the first computational model capable of identifying apicoplast-targeted transmembrane proteins in Apicomplexa. The ApicoAMP prediction software is available at http://code.google.com/p/apicoamp/ and http://bcb.eecs.wsu.edu.
Assuntos
Apicomplexa/genética , Apicoplastos/genética , Biologia Computacional/métodos , Proteínas de Membrana/genética , Proteínas de Protozoários/genética , Motivos de Aminoácidos , Aminoácidos/análise , Aminoácidos/genética , Apicomplexa/química , Apicoplastos/química , Interações Hidrofóbicas e Hidrofílicas , Proteínas de Membrana/química , Transporte Proteico , Proteínas de Protozoários/químicaRESUMO
BACKGROUND: Most of the parasites of the phylum Apicomplexa contain a relict prokaryotic-derived plastid called the apicoplast. This organelle is important not only for the survival of the parasite, but its unique properties make it an ideal drug target. The majority of apicoplast-associated proteins are nuclear encoded and targeted post-translationally to the organellar lumen via a bipartite signaling mechanism that requires an N-terminal signal and transit peptide (TP). Attempts to define a consensus motif that universally identifies apicoplast TPs have failed. METHODOLOGY/PRINCIPAL FINDINGS: In this study, we propose a generalized rule-based classification model to identify apicoplast-targeted proteins (ApicoTPs) that use a bipartite signaling mechanism. Given a training set specific to an organism, this model, called ApicoAP, incorporates a procedure based on a genetic algorithm to tailor a discriminating rule that exploits the known characteristics of ApicoTPs. Performance of ApicoAP is evaluated for four labeled datasets of Plasmodium falciparum, Plasmodium yoelii, Babesia bovis, and Toxoplasma gondii proteins. ApicoAP improves the classification accuracy of the published dataset for P. falciparum to 94%, originally 90% using PlasmoAP. CONCLUSIONS/SIGNIFICANCE: We present a parametric model for ApicoTPs and a procedure to optimize the model parameters for a given training set. A major asset of this model is that it is customizable to different parasite genomes. The ApicoAP prediction software is available at http://code.google.com/p/apicoap/ and http://bcb.eecs.wsu.edu.