Your browser doesn't support javascript.
loading
Node-degree aware edge sampling mitigates inflated classification performance in biomedical random walk-based graph representation learning.
Cappelletti, Luca; Rekerle, Lauren; Fontana, Tommaso; Hansen, Peter; Casiraghi, Elena; Ravanmehr, Vida; Mungall, Christopher J; Yang, Jeremy J; Spranger, Leonard; Karlebach, Guy; Caufield, J Harry; Carmody, Leigh; Coleman, Ben; Oprea, Tudor I; Reese, Justin; Valentini, Giorgio; Robinson, Peter N.
Afiliação
  • Cappelletti L; AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milano 20133, Italy.
  • Rekerle L; The Jackson Laboratory for Genomic Medicine, CT 06032, United States.
  • Fontana T; AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milano 20133, Italy.
  • Hansen P; The Jackson Laboratory for Genomic Medicine, CT 06032, United States.
  • Casiraghi E; AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milano 20133, Italy.
  • Ravanmehr V; Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94710, United States.
  • Mungall CJ; The Jackson Laboratory for Genomic Medicine, CT 06032, United States.
  • Yang JJ; Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94710, United States.
  • Spranger L; Department of Internal Medicine and UNM Comprehensive Cancer Center, UNM School of Medicine, Albuquerque, NM 87102, United States.
  • Karlebach G; Institute of Bioinformatics, Freie Universität Berlin, Berlin, 14195, Germany.
  • Caufield JH; The Jackson Laboratory for Genomic Medicine, CT 06032, United States.
  • Carmody L; Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94710, United States.
  • Coleman B; The Jackson Laboratory for Genomic Medicine, CT 06032, United States.
  • Oprea TI; The Jackson Laboratory for Genomic Medicine, CT 06032, United States.
  • Reese J; Institute for Systems Genomics, University of Connecticut, Farmington, CT 06032, United States.
  • Valentini G; Department of Internal Medicine and UNM Comprehensive Cancer Center, UNM School of Medicine, Albuquerque, NM 87102, United States.
  • Robinson PN; Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94710, United States.
Bioinform Adv ; 4(1): vbae036, 2024.
Article em En | MEDLINE | ID: mdl-38577542
ABSTRACT
Motivation Graph representation learning is a family of related approaches that learn low-dimensional vector representations of nodes and other graph elements called embeddings. Embeddings approximate characteristics of the graph and can be used for a variety of machine-learning tasks such as novel edge prediction. For many biomedical applications, partial knowledge exists about positive edges that represent relationships between pairs of entities, but little to no knowledge is available about negative edges that represent the explicit lack of a relationship between two nodes. For this reason, classification procedures are forced to assume that the vast majority of unlabeled edges are negative. Existing approaches to sampling negative edges for training and evaluating classifiers do so by uniformly sampling pairs of nodes.

Results:

We show here that this sampling strategy typically leads to sets of positive and negative examples with imbalanced node degree distributions. Using representative heterogeneous biomedical knowledge graph and random walk-based graph machine learning, we show that this strategy substantially impacts classification performance. If users of graph machine-learning models apply the models to prioritize examples that are drawn from approximately the same distribution as the positive examples are, then performance of models as estimated in the validation phase may be artificially inflated. We present a degree-aware node sampling approach that mitigates this effect and is simple to implement. Availability and implementation Our code and data are publicly available at https//github.com/monarch-initiative/negativeExampleSelection.

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article