Negative example selection for protein function prediction: the NoGO database.

Youngs, Noah; Penfold-Brown, Duncan; Bonneau, Richard; Shasha, Dennis

Youngs, Noah; Penfold-Brown, Duncan; Bonneau, Richard; Shasha, Dennis.

Afiliação

Youngs N; Department of Computer Science, New York University, New York, New York, United States of America.
Penfold-Brown D; Social Media and Political Participation Lab, New York University, New York, New York, United States of America.
Bonneau R; Department of Computer Science, New York University, New York, New York, United States of America; Department of Biology, New York University, New York, New York, United States of America; Center for Genomics and Systems Biology, Department of Biology, New York University, New York, New York, United
Shasha D; Department of Computer Science, New York University, New York, New York, United States of America; Center for Genomics and Systems Biology, Department of Biology, New York University, New York, New York, United States of America.

PLoS Comput Biol ; 10(6): e1003644, 2014 Jun.

Article em En | MEDLINE | ID: mdl-24922051

ABSTRACT

ABSTRACT

Negative examples - genes that are known not to carry out a given protein function - are rarely recorded in genome and proteome annotation databases, such as the Gene Ontology database. Negative examples are required, however, for several of the most powerful machine learning methods for integrative protein function prediction. Most protein function prediction efforts have relied on a variety of heuristics for the choice of negative examples. Determining the accuracy of methods for negative example prediction is itself a non-trivial task, given that the Open World Assumption as applied to gene annotations rules out many traditional validation metrics. We present a rigorous comparison of these heuristics, utilizing a temporal holdout, and a novel evaluation strategy for negative examples. We add to this comparison several algorithms adapted from Positive-Unlabeled learning scenarios in text-classification, which are the current state of the art methods for generating negative examples in low-density annotation contexts. Lastly, we present two novel algorithms of our own construction, one based on empirical conditional probability, and the other using topic modeling applied to genes and annotations. We demonstrate that our algorithms achieve significantly fewer incorrect negative example predictions than the current state of the art, using multiple benchmarks covering multiple organisms. Our methods may be applied to generate negative examples for any type of method that deals with protein function, and to this end we provide a database of negative examples in several well-studied organisms, for general use (The NoGO database, available at bonneaulab.bio.nyu.edu/nogo.html).

Assuntos

Algoritmos; Bases de Dados Genéticas; Ontologia Genética; Proteínas/genética; Proteínas/fisiologia; Animais; Proteínas de Arabidopsis/genética; Proteínas de Arabidopsis/fisiologia; Inteligência Artificial; Biologia Computacional; Genoma; Humanos; Camundongos; Anotação de Sequência Molecular; Proteoma; Proteínas de Saccharomyces cerevisiae/genética; Proteínas de Saccharomyces cerevisiae/fisiologia

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Algoritmos / Proteínas / Bases de Dados Genéticas / Ontologia Genética Tipo de estudo: Prognostic_studies / Risk_factors_studies Limite: Animals / Humans Idioma: En Revista: PLoS Comput Biol Assunto da revista: BIOLOGIA / INFORMATICA MEDICA Ano de publicação: 2014 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google