Improving Laryngoscopy Image Analysis Through Integration of Global Information and Local Features in VoFoCD Dataset.

Dao, Thao Thi Phuong; Huynh, Tuan-Luc; Pham, Minh-Khoi; Le, Trung-Nghia; Nguyen, Tan-Cong; Nguyen, Quang-Thuc; Tran, Bich Anh; Van, Boi Ngoc; Ha, Chanh Cong; Tran, Minh-Triet

Dao, Thao Thi Phuong; Huynh, Tuan-Luc; Pham, Minh-Khoi; Le, Trung-Nghia; Nguyen, Tan-Cong; Nguyen, Quang-Thuc; Tran, Bich Anh; Van, Boi Ngoc; Ha, Chanh Cong; Tran, Minh-Triet.

Afiliação

Dao TTP; University of Science, Ho Chi Minh City, Vietnam.
Huynh TL; John von Neumann Institute, Ho Chi Minh City, Vietnam.
Pham MK; Vietnam National University, Ho Chi Minh City, Vietnam.
Le TN; Department of Otolaryngology, Thong Nhat Hospital, Tan Binh District, Ho Chi Minh City, Vietnam.
Nguyen TC; University of Science, Ho Chi Minh City, Vietnam.
Nguyen QT; Vietnam National University, Ho Chi Minh City, Vietnam.
Tran BA; Dublin City University, Dublin, Ireland.
Van BN; University of Science, Ho Chi Minh City, Vietnam.
Ha CC; Vietnam National University, Ho Chi Minh City, Vietnam.
Tran MT; University of Science, Ho Chi Minh City, Vietnam.

J Imaging Inform Med ; 2024 May 29.

Article em En | MEDLINE | ID: mdl-38809338

ABSTRACT

ABSTRACT

The diagnosis and treatment of vocal fold disorders heavily rely on the use of laryngoscopy. A comprehensive vocal fold diagnosis requires accurate identification of crucial anatomical structures and potential lesions during laryngoscopy observation. However, existing approaches have yet to explore the joint optimization of the decision-making process, including object detection and image classification tasks simultaneously. In this study, we provide a new dataset, VoFoCD, with 1724 laryngology images designed explicitly for object detection and image classification in laryngoscopy images. Images in the VoFoCD dataset are categorized into four classes and comprise six glottic object types. Moreover, we propose a novel Multitask Efficient trAnsformer network for Laryngoscopy (MEAL) to classify vocal fold images and detect glottic landmarks and lesions. To further facilitate interpretability for clinicians, MEAL provides attention maps to visualize important learned regions for explainable artificial intelligence results toward supporting clinical decision-making. We also analyze our model's effectiveness in simulated clinical scenarios where shaking of the laryngoscopy process occurs. The proposed model demonstrates outstanding performance on our VoFoCD dataset. The accuracy for image classification and mean average precision at an intersection over a union threshold of 0.5 (mAP50) for object detection are 0.951 and 0.874, respectively. Our MEAL method integrates global knowledge, encompassing general laryngoscopy image classification, into local features, which refer to distinct anatomical regions of the vocal fold, particularly abnormal regions, including benign and malignant lesions. Our contribution can effectively aid laryngologists in identifying benign or malignant lesions of vocal folds and classifying images in the laryngeal endoscopy process visually.

Palavras-chave

Deep learning; Explainable AI; Image classification; Laryngoscopy; Object detection

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Idioma: En Revista: J Imaging Inform Med Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Vietnã

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Idioma: En Revista: J Imaging Inform Med Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Vietnã