Eye Gaze Guided Cross-Modal Alignment Network for Radiology Report Generation.

Peng, Peixi; Fan, Wanshu; Shen, Yue; Liu, Wenfei; Yang, Xin; Zhang, Qiang; Wei, Xiaopeng; Zhou, Dongsheng

ABSTRACT

The potential benefits of automatic radiology report generation, such as reducing misdiagnosis rates and enhancing clinical diagnosis efficiency, are significant. However, existing data-driven methods lack essential medical prior knowledge, which hampers their performance. Moreover, establishing global correspondences between radiology images and related reports, while achieving local alignments between images correlated with prior knowledge and text, remains a challenging task. To address these shortcomings, we introduce a novel Eye Gaze Guided Cross-modal Alignment Network (EGGCA-Net) for generating accurate medical reports. Our approach incorporates prior knowledge from radiologists' Eye Gaze Region (EGR) to refine the fidelity and comprehensibility of report generation. Specifically, we design a Dual Fine-Grained Branch (DFGB) and a Multi-Task Branch (MTB) to collaboratively ensure the alignment of visual and textual semantics across multiple levels. To establish fine-grained alignment between EGR-related images and sentences, we introduce the Sentence Fine-grained Prototype Module (SFPM) within DFGB to capture cross-modal information at different levels. Additionally, to learn the alignment of EGR-related image topics, we introduce the Multi-task Feature Fusion Module (MFFM) within MTB to refine the encoder output information. Finally, a specifically designed label matching mechanism is designed to generate reports that are consistent with the anticipated disease states. The experimental outcomes indicate that the introduced methodology surpasses previous advanced techniques, yielding enhanced performance on two extensively used benchmark datasets Open-i and MIMIC-CXR.