CoRTEx: contrastive learning for representing terms via explanations with applications on constructing biomedical knowledge graphs.

Ying, Huaiyuan; Zhao, Zhengyun; Zhao, Yang; Zeng, Sihang; Yu, Sheng

Ying, Huaiyuan; Zhao, Zhengyun; Zhao, Yang; Zeng, Sihang; Yu, Sheng.

Afiliação

Ying H; Center for Statistical Science, Department of Industrial Engineering, Tsinghua University, Beijing, 100084, China.
Zhao Z; Center for Statistical Science, Department of Industrial Engineering, Tsinghua University, Beijing, 100084, China.
Zhao Y; Weiyang College, Tsinghua University, Beijing, 100084, China.
Zeng S; Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA 98195, United States.
Yu S; Center for Statistical Science, Department of Industrial Engineering, Tsinghua University, Beijing, 100084, China.

J Am Med Inform Assoc ; 2024 May 23.

Article em En | MEDLINE | ID: mdl-38777805

ABSTRACT

ABSTRACT

OBJECTIVES:

Biomedical Knowledge Graphs play a pivotal role in various biomedical research domains. Concurrently, term clustering emerges as a crucial step in constructing these knowledge graphs, aiming to identify synonymous terms. Due to a lack of knowledge, previous contrastive learning models trained with Unified Medical Language System (UMLS) synonyms struggle at clustering difficult terms and do not generalize well beyond UMLS terms. In this work, we leverage the world knowledge from large language models (LLMs) and propose Contrastive Learning for Representing Terms via Explanations (CoRTEx) to enhance term representation and significantly improves term clustering. MATERIALS AND

METHODS:

The model training involves generating explanations for a cleaned subset of UMLS terms using ChatGPT. We employ contrastive learning, considering term and explanation embeddings simultaneously, and progressively introduce hard negative samples. Additionally, a ChatGPT-assisted BIRCH algorithm is designed for efficient clustering of a new ontology.

RESULTS:

We established a clustering test set and a hard negative test set, where our model consistently achieves the highest F1 score. With CoRTEx embeddings and the modified BIRCH algorithm, we grouped 35 580 932 terms from the Biomedical Informatics Ontology System (BIOS) into 22 104 559 clusters with O(N) queries to ChatGPT. Case studies highlight the model's efficacy in handling challenging samples, aided by information from explanations.

CONCLUSION:

By aligning terms to their explanations, CoRTEx demonstrates superior accuracy over benchmark models and robustness beyond its training set, and it is suitable for clustering terms for large-scale biomedical ontologies.

Palavras-chave

contrastive learning; knowledge injection; large language models; term clustering

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: J Am Med Inform Assoc Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2024 Tipo de documento: Article País de afiliação: China

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google