Detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study.

Molotkov, Ivan; Artomov, Mykyta

Molotkov, Ivan; Artomov, Mykyta.

Afiliación

Molotkov I; The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, United States.
Artomov M; Department of Pediatrics, The Ohio State University, Columbus, OH, United States.

Bioinform Adv ; 3(1): vbad128, 2023.

Article en En | MEDLINE | ID: mdl-37745001

RESUMEN

Motivation: Positive-unlabeled data consists of points with either positive or unknown labels. It is widespread in medical, genetic, and biological settings, creating a high demand for predictive positive-unlabeled models. The performance of such models is usually estimated using validation sets, assumed to be selected completely at random (SCAR) from known positive examples. For certain metrics, this assumption enables unbiased performance estimation when treating positive-unlabeled data as positive/negative. However, the SCAR assumption is often adopted without proper justifications, simply for the sake of convenience. Results: We provide an algorithm that under the weak assumptions of a lower bound on the number of positive examples can test for the violation of the SCAR assumption. Applying it to the problem of gene prioritization for complex genetic traits, we illustrate that the SCAR assumption is often violated there, causing the inflation of performance estimates, which we refer to as validation bias. We estimate the potential impact of validation bias on performance estimation. Our analysis reveals that validation bias is widespread in gene prioritization data and can significantly overestimate the performance of models. This finding elucidates the discrepancy between the reported good performance of models and their limited practical applications. Availability and implementation: Python code with examples of application of the validation bias detection algorithm is available at github.com/ArtomovLab/ValidationBias.

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Tipo de estudio: Prognostic_studies / Risk_factors_studies Idioma: En Revista: Bioinform Adv Año: 2023 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google