A cautionary tale about properly vetting datasets used in supervised learning predicting metabolic pathway involvement.

Huckvale, Erik D; Moseley, Hunter N B

Huckvale, Erik D; Moseley, Hunter N B.

Afiliação

Huckvale ED; Markey Cancer Center, University of Kentucky, Lexington, Kentucky, United States of America.
Moseley HNB; Markey Cancer Center, University of Kentucky, Lexington, Kentucky, United States of America.

PLoS One ; 19(5): e0299583, 2024.

Article em En | MEDLINE | ID: mdl-38696410

ABSTRACT

ABSTRACT

The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (~26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation (CV) performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We hope others will avoid similar mistakes.

Assuntos

Redes e Vias Metabólicas; Aprendizado de Máquina Supervisionado; Humanos; Conjuntos de Dados como Assunto

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Redes e Vias Metabólicas / Aprendizado de Máquina Supervisionado Limite: Humans Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google