SDA-CLIP: surgical visual domain adaptation using video and text labels.

Li, Yuchong; Jia, Shuangfu; Song, Guangbi; Wang, Ping; Jia, Fucang

Li, Yuchong; Jia, Shuangfu; Song, Guangbi; Wang, Ping; Jia, Fucang.

Afiliação

Li Y; Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China.
Jia S; Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen, China.
Song G; Department of Operating Room, Hejian People's Hospital, Hejian, China.
Wang P; Medical Imaging Center, Luoping County People's Hospital, Qujing, China.
Jia F; Department of Hepatobiliary Surgery, The First Affiliated Hospital of Guangzhou Medical University, Guangzhou, China.

Quant Imaging Med Surg ; 13(10): 6989-7001, 2023 Oct 01.

Article em En | MEDLINE | ID: mdl-37869278

ABSTRACT

ABSTRACT

Background:

Surgical action recognition is an essential technology in context-aware-based autonomous surgery, whereas the accuracy is limited by clinical dataset scale. Leveraging surgical videos from virtual reality (VR) simulations to research algorithms for the clinical domain application, also known as domain adaptation, can effectively reduce the cost of data acquisition and annotation, and protect patient privacy.

Methods:

We introduced a surgical domain adaptation method based on the contrastive language-image pretraining model (SDA-CLIP) to recognize cross-domain surgical action. Specifically, we utilized the Vision Transformer (ViT) and Transformer to extract video and text embeddings, respectively. Text embedding was developed as a bridge between VR and clinical domains. Inter- and intra-modality loss functions were employed to enhance the consistency of embeddings of the same class. Further, we evaluated our method on the MICCAI 2020 EndoVis Challenge SurgVisDom dataset.

Results:

Our SDA-CLIP achieved a weighted F1-score of 65.9% (+18.9%) on the hard domain adaptation task (trained only with VR data) and 84.4% (+4.4%) on the soft domain adaptation task (trained with VR and clinical-like data), which outperformed the first place team of the challenge by a significant margin.

Conclusions:

The proposed SDA-CLIP model can effectively extract video scene information and textual semantic information, which greatly improves the performance of cross-domain surgical action recognition. The code is available at https//github.com/Lycus99/SDA-CLIP.

Palavras-chave

Surgical domain adaptation; cross-domain surgical action recognition; video-text learning

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2023 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links