Identifying genes targeted by disease-associated non-coding SNPs with a protein knowledge graph.

Vlietstra, Wytze J; Vos, Rein; van Mulligen, Erik M; Jenster, Guido W; Kors, Jan A

Vlietstra, Wytze J; Vos, Rein; van Mulligen, Erik M; Jenster, Guido W; Kors, Jan A.

Afiliación

Vlietstra WJ; Department of Medical Informatics, Erasmus MC, University Medical Center Rotterdam, Rotterdam, the Netherlands.
Vos R; Data Science, Life Science Operations Department, Elsevier B.V., Amsterdam, the Netherlands.
van Mulligen EM; Department of Medical Informatics, Erasmus MC, University Medical Center Rotterdam, Rotterdam, the Netherlands.
Jenster GW; Department of Methodology & Statistics, Maastricht University, Maastricht, the Netherlands.
Kors JA; Department of Medical Informatics, Erasmus MC, University Medical Center Rotterdam, Rotterdam, the Netherlands.

PLoS One ; 17(7): e0271395, 2022.

Article en En | MEDLINE | ID: mdl-35830458

ABSTRACT

ABSTRACT

Genome-wide association studies (GWAS) have identified many single nucleotide polymorphisms (SNPs) that play important roles in the genetic heritability of traits and diseases. With most of these SNPs located on the non-coding part of the genome, it is currently assumed that these SNPs influence the expression of nearby genes on the genome. However, identifying which genes are targeted by these disease-associated SNPs remains challenging. In the past, protein knowledge graphs have often been used to identify genes that are associated with disease, also referred to as "disease genes". Here, we explore whether protein knowledge graphs can be used to identify genes that are targeted by disease-associated non-coding SNPs by testing and comparing the performance of six existing methods for a protein knowledge graph, four of which were developed for disease gene identification. We compare our performance against two baselines (1) an existing state-of-the-art method that is based on guilt-by-association, and (2) the leading assumption that SNPs target the nearest gene on the genome. We test these methods with four reference sets, three of which were obtained by different means. Furthermore, we combine methods to investigate whether their combination improves performance. We find that protein knowledge graphs that include predicate information perform comparable to the current state of the art, achieving an area under the receiver operating characteristic curve (AUC) of 79.6% on average across all four reference sets. Protein knowledge graphs that lack predicate information perform comparable to our other baseline (genetic distance) which achieved an AUC of 75.7% across all four reference sets. Combining multiple methods improved performance to 84.9% AUC. We conclude that methods for a protein knowledge graph can be used to identify which genes are targeted by disease-associated non-coding SNPs.

Asunto(s)

Estudio de Asociación del Genoma Completo; Polimorfismo de Nucleótido Simple; Estudio de Asociación del Genoma Completo/métodos; Reconocimiento de Normas Patrones Automatizadas; Fenotipo

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Polimorfismo de Nucleótido Simple / Estudio de Asociación del Genoma Completo Tipo de estudio: Prognostic_studies / Risk_factors_studies Idioma: En Revista: PLoS One Asunto de la revista: CIENCIA / MEDICINA Año: 2022 Tipo del documento: Article País de afiliación: Países Bajos

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google