EndoViT: pretraining vision transformers on a large collection of endoscopic images.

Batic, Dominik; Holm, Felix; Özsoy, Ege; Czempiel, Tobias; Navab, Nassir

Batic, Dominik; Holm, Felix; Özsoy, Ege; Czempiel, Tobias; Navab, Nassir.

Afiliação

Batic D; Chair for Computer Aided Medical Procedures, Technical University Munich, Munich, Germany.
Holm F; Chair for Computer Aided Medical Procedures, Technical University Munich, Munich, Germany. felix.holm@tum.de.
Özsoy E; Carl Zeiss AG, Munich, Germany. felix.holm@tum.de.
Czempiel T; Chair for Computer Aided Medical Procedures, Technical University Munich, Munich, Germany.
Navab N; Chair for Computer Aided Medical Procedures, Technical University Munich, Munich, Germany.

Int J Comput Assist Radiol Surg ; 19(6): 1085-1091, 2024 Jun.

Article em En | MEDLINE | ID: mdl-38570373

ABSTRACT

ABSTRACT

PURPOSE:

Automated endoscopy video analysis is essential for assisting surgeons during medical procedures, but it faces challenges due to complex surgical scenes and limited annotated data. Large-scale pretraining has shown great success in natural language processing and computer vision communities in recent years. These approaches reduce the need for annotated data, which is of great interest in the medical domain. In this work, we investigate endoscopy domain-specific self-supervised pretraining on large collections of data.

METHODS:

To this end, we first collect Endo700k, the largest publicly available corpus of endoscopic images, extracted from nine public Minimally Invasive Surgery (MIS) datasets. Endo700k comprises more than 700,000 images. Next, we introduce EndoViT, an endoscopy-pretrained Vision Transformer (ViT), and evaluate it on a diverse set of surgical downstream tasks.

RESULTS:

Our findings indicate that domain-specific pretraining with EndoViT yields notable advantages in complex downstream tasks. In the case of action triplet recognition, our approach outperforms ImageNet pretraining. In semantic segmentation, we surpass the state-of-the-art (SOTA) performance. These results demonstrate the effectiveness of our domain-specific pretraining approach in addressing the challenges of automated endoscopy video analysis.

CONCLUSION:

Our study contributes to the field of medical computer vision by showcasing the benefits of domain-specific large-scale self-supervised pretraining for vision transformers. We release both our code and pretrained models to facilitate further research in this direction https//github.com/DominikBatic/EndoViT .

Assuntos

Endoscopia; Humanos; Endoscopia/métodos; Endoscopia/educação; Processamento de Imagem Assistida por Computador/métodos; Gravação em Vídeo; Procedimentos Cirúrgicos Minimamente Invasivos/educação; Procedimentos Cirúrgicos Minimamente Invasivos/métodos

Palavras-chave

Endoscopy video analysis; Pretraining; Vision transformer

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Endoscopia Limite: Humans Idioma: En Revista: Int J Comput Assist Radiol Surg Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google