A Bi-level representation learning model for medical visual question answering.

Li, Yong; Long, Shaopei; Yang, Zhenguo; Weng, Heng; Zeng, Kun; Huang, Zhenhua; Lee Wang, Fu; Hao, Tianyong

Li, Yong; Long, Shaopei; Yang, Zhenguo; Weng, Heng; Zeng, Kun; Huang, Zhenhua; Lee Wang, Fu; Hao, Tianyong.

Afiliação

Li Y; School of Computer Science, South China Normal University, Guangzhou, China. Electronic address: lycutter@m.scnu.edu.cn.
Long S; School of Computer Science, South China Normal University, Guangzhou, China. Electronic address: 1030829350@qq.com.
Yang Z; School of Computer Science, Guangdong University of Technology, Guangzhou, China. Electronic address: yzg@gdut.edu.cn.
Weng H; State Key Laboratory of Dampness Syndrome of Chinese Medicine, The Second Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, China. Electronic address: wengh@gzucm.edu.cn.
Zeng K; School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China. Electronic address: zengkun@gmail.com.
Huang Z; School of Computer Science, South China Normal University, Guangzhou, China. Electronic address: huangzhenhua@m.scnu.edu.cn.
Lee Wang F; School of Science and Technology, Hong Kong Metropolitan University, Hong Kong SAR, China. Electronic address: pwang@hkmu.edu.hk.
Hao T; School of Computer Science, South China Normal University, Guangzhou, China. Electronic address: haoty@m.scnu.edu.cn.

J Biomed Inform ; 134: 104183, 2022 10.

Article em En | MEDLINE | ID: mdl-36038063

ABSTRACT

ABSTRACT

Medical Visual Question Answering (VQA) targets at answering questions related to given medical images and it contains tremendous potential in healthcare services. However, researches on medical VQA are still facing challenges, particularly on how to learn a fine-grained multimodal semantic representation from relatively small volume of data resources for answer prediction. Moreover, the long-tailed distribution labels of medical VQA data frequently result in poor performance of models. To this end, we propose a novel bi-level representation learning model with two reasoning modules to learn bi-level representations for the medical VQA task. One is sentence-level reasoning to learn sentence-level semantic representations from multimodal input. The other is token-level reasoning that employs an attention mechanism to generate a multimodal contextual vector by fusing image features and word embeddings. The contextual vector is used to filter irrelevant semantic representations from sentence-level reasoning to generate a fine-grained multimodal representation. Furthermore, a label-distribution-smooth margin loss is proposed to minimize generalization error bound of long-tailed distribution datasets by modifying margin bound of different labels in training set. Based on standard VQA-Rad dataset and PathVQA dataset, the proposed model achieves 0.7605 and 0.5434 on accuracy, 0.7741 and 0.5288 on F1-score, respectively, outperforming a set of state-of-the-art baseline models.

Assuntos

Aprendizado de Máquina; Semântica; Atenção à Saúde; Idioma; Aprendizagem

Palavras-chave

Label-distribution-smooth margin loss; Medical visual question answering; Sentence-level reasoning; Token-level reasoning

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Semântica / Aprendizado de Máquina Tipo de estudo: Prognostic_studies Idioma: En Revista: J Biomed Inform Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2022 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google