A cautionary tale about properly vetting datasets used in supervised learning predicting metabolic pathway involvement.

Huckvale, Erik D; Moseley, Hunter N B

Huckvale, Erik D; Moseley, Hunter N B.

Afiliación

Huckvale ED; Markey Cancer Center, University of Kentucky, Lexington, Kentucky, United States of America.
Moseley HNB; Markey Cancer Center, University of Kentucky, Lexington, Kentucky, United States of America.

PLoS One ; 19(5): e0299583, 2024.

Article en En | MEDLINE | ID: mdl-38696410

ABSTRACT

ABSTRACT

The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (~26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation (CV) performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We hope others will avoid similar mistakes.

Asunto(s)

Redes y Vías Metabólicas; Aprendizaje Automático Supervisado; Humanos; Conjuntos de Datos como Asunto

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Redes y Vías Metabólicas / Aprendizaje Automático Supervisado Límite: Humans Idioma: En Revista: PLoS One Asunto de la revista: CIENCIA / MEDICINA Año: 2024 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google