Problems with the nested granularity of feature domains in bioinformatics: the eXtasy case.

Popovic, Dusan; Sifrim, Alejandro; Davis, Jesse; Moreau, Yves; De Moor, Bart

Popovic, Dusan; Sifrim, Alejandro; Davis, Jesse; Moreau, Yves; De Moor, Bart.

BMC Bioinformatics ; 16 Suppl 4: S2, 2015.

Article em En | MEDLINE | ID: mdl-25734591

RESUMO

BACKGROUND: Data from biomedical domains often have an inherit hierarchical structure. As this structure is usually implicit, its existence can be overlooked by practitioners interested in constructing and evaluating predictive models from such data. Ignoring these constructs leads to potentially problematic and the routinely unrecognized bias in the models and results. In this work, we discuss this bias in detail and propose a simple, sampling-based solution for it. Next, we explore its sources and extent on synthetic data. Finally, we demonstrate how the state-of-the-art variant prioritization framework, eXtasy, benefits from using the described approach in its Random forest-based core classification model. RESULTS AND CONCLUSIONS: The conducted simulations clearly indicate that the heterogeneous granularity of feature domains poses significant problems for both the standard Random forest classifier and a modification that relies on stratified bootstrapping. Conversely, using the proposed sampling scheme when training the classifier mitigates the described bias. Furthermore, when applied to the eXtasy data under a realistic class distribution scenario, a Random forest learned using the proposed sampling scheme displays much better precision that its standard version, without degrading recall. Moreover, the largest performance gains are achieved in the most important part of the operating range: the top of prioritized gene list.

Assuntos

Algoritmos; Biologia Computacional/métodos; Modelos Teóricos; Proteínas/análise; Simulação por Computador; Bases de Dados Factuais; Humanos; Mutação/genética; Proteínas/genética

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Algoritmos / Proteínas / Biologia Computacional / Modelos Teóricos Tipo de estudo: Prognostic_studies Limite: Humans Idioma: En Ano de publicação: 2015 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google