Contrastive pre-training and linear interaction attention-based transformer for universal medical reports generation.

Lin, Zhihong; Zhang, Donghao; Shi, Danli; Xu, Renjing; Tao, Qingyi; Wu, Lin; He, Mingguang; Ge, Zongyuan

Lin, Zhihong; Zhang, Donghao; Shi, Danli; Xu, Renjing; Tao, Qingyi; Wu, Lin; He, Mingguang; Ge, Zongyuan.

Afiliação

Lin Z; Faculty of Engineering, Monash University, Clayton, VIC, 3800, Australia. Electronic address: zhihong.lin@monash.edu.
Zhang D; Monash eResearch Center, Monash University, Clayton, VIC, 3800, Australia. Electronic address: donghao.zhang@monash.edu.
Shi D; State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangzhou, 510060, China. Electronic address: shidli@mail2.sysu.edu.cn.
Xu R; Microelectronics Thrust, The Hong Kong University of Science and Technology (Guangzhou), Nansha, Guangzhou, Guangdong, 511400, China. Electronic address: renjingxu@ust.hk.
Tao Q; NVIDIA AI Technology Center, 038988, Singapore. Electronic address: qtao002@e.ntu.edu.sg.
Wu L; School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, 230000, China. Electronic address: jolin.lwu@gmail.com.
He M; Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, East Melbourne, VIC, 3002, Australia. Electronic address: mingguang.he@unimelb.edu.au.
Ge Z; Monash eResearch Center, Monash University, Clayton, VIC, 3800, Australia. Electronic address: zongyuan.ge@monash.edu.

J Biomed Inform ; 138: 104281, 2023 02.

Article em En | MEDLINE | ID: mdl-36638935

RESUMO

Interpreting medical images such as chest X-ray images and retina images is an essential step for diagnosing and treating relevant diseases. Proposing automatic and reliable medical report generation systems can reduce the time-consuming workload, improve efficiencies of clinical workflows, and decrease practical variations between different clinical professionals. Many recent approaches based on image-encoder and language-decoder structure have been proposed to tackle this task. However, some technical challenges remain to be solved, including the fusion efficacy between the language and visual cues and the difficulty of obtaining an effective pre-trained image feature extractor for medical-specific tasks. In this work, we proposed the weighted query-key interacting attention module, including both the second-order and first-order interactions. Compared with the conventional scaled dot-product attention, this design generates a strong fusion mechanism between language and visual signals. In addition, we also proposed the contrastive pre-training step to reduce the domain gap between the image encoder and the target dataset. To test the generalizability of our learning scheme, we collected and verified our model on the world-first multi-modality retina report generation dataset referred to as Retina ImBank and another large-scale retina Chinese-based report dataset referred to as Retina Chinese. These two datasets will be made publicly available and serve as benchmarks to encourage further research exploration in this field. From our experimental results, we demonstrate that our proposed method has outperformed multiple state-of-the-art image captioning and medical report generation methods on IU X-RAY, MIMIC-CXR, Retina ImBank, and Retina Chinese datasets.

Assuntos

Benchmarking; Idioma; Aprendizagem; Prontuários Médicos; Registros

Palavras-chave

Medical report generation; Vision and language

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Benchmarking / Idioma Tipo de estudo: Prognostic_studies Idioma: En Ano de publicação: 2023 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google