RESUMO
Online surgical phase recognition plays a significant role towards building contextual tools that could quantify performance and oversee the execution of surgical workflows. Current approaches are limited since they train spatial feature extractors using frame-level supervision that could lead to incorrect predictions due to similar frames appearing at different phases, and poorly fuse local and global features due to computational constraints which can affect the analysis of long videos commonly encountered in surgical interventions. In this paper, we present a two-stage method, called Long Video Transformer (LoViT), emphasizing the development of a temporally-rich spatial feature extractor and a phase transition map. The temporally-rich spatial feature extractor is designed to capture critical temporal information within the surgical video frames. The phase transition map provides essential insights into the dynamic transitions between different surgical phases. LoViT combines these innovations with a multiscale temporal aggregator consisting of two cascaded L-Trans modules based on self-attention, followed by a G-Informer module based on ProbSparse self-attention for processing global temporal information. The multi-scale temporal head then leverages the temporally-rich spatial features and phase transition map to classify surgical phases using phase transition-aware supervision. Our approach outperforms state-of-the-art methods on the Cholec80 and AutoLaparo datasets consistently. Compared to Trans-SVNet, LoViT achieves a 2.4 pp (percentage point) improvement in video-level accuracy on Cholec80 and a 3.1 pp improvement on AutoLaparo. Our results demonstrate the effectiveness of our approach in achieving state-of-the-art performance of surgical phase recognition on two datasets of different surgical procedures and temporal sequencing characteristics. The project page is available at https://github.com/MRUIL/LoViT.
RESUMO
BACKGROUND: Laparoscopic pancreatoduodenectomy (LPD) is one of the most challenging operations and has a long learning curve. Artificial intelligence (AI) automated surgical phase recognition in intraoperative videos has many potential applications in surgical education, helping shorten the learning curve, but no study has made this breakthrough in LPD. Herein, we aimed to build AI models to recognize the surgical phase in LPD and explore the performance characteristics of AI models. METHODS: Among 69 LPD videos from a single surgical team, we used 42 in the building group to establish the models and used the remaining 27 videos in the analysis group to assess the models' performance characteristics. We annotated 13 surgical phases of LPD, including 4 key phases and 9 necessary phases. Two minimal invasive pancreatic surgeons annotated all the videos. We built two AI models for the key phase and necessary phase recognition, based on convolutional neural networks. The overall performance of the AI models was determined mainly by mean average precision (mAP). RESULTS: Overall mAPs of the AI models in the test set of the building group were 89.7% and 84.7% for key phases and necessary phases, respectively. In the 27-video analysis group, overall mAPs were 86.8% and 71.2%, with maximum mAPs of 98.1% and 93.9%. We found commonalities between the error of model recognition and the differences of surgeon annotation, and the AI model exhibited bad performance in cases with anatomic variation or lesion involvement with adjacent organs. CONCLUSIONS: AI automated surgical phase recognition can be achieved in LPD, with outstanding performance in selective cases. This breakthrough may be the first step toward AI- and video-based surgical education in more complex surgeries.
Assuntos
Inteligência Artificial , Laparoscopia , Pancreaticoduodenectomia , Gravação em Vídeo , Pancreaticoduodenectomia/métodos , Pancreaticoduodenectomia/educação , Humanos , Laparoscopia/métodos , Laparoscopia/educação , Curva de AprendizadoRESUMO
PURPOSE: Automatic surgical phase recognition is crucial for video-based assessment systems in surgical education. Utilizing temporal information is crucial for surgical phase recognition; hence, various recent approaches extract frame-level features to conduct full video temporal modeling. METHODS: For better temporal modeling, we propose SlowFast temporal modeling network (SF-TMN) for offline surgical phase recognition that can achieve not only frame-level full video temporal modeling but also segment-level full video temporal modeling. We employ a feature extraction network, pretrained on the target dataset, to extract features from video frames as the training data for SF-TMN. The Slow Path in SF-TMN utilizes all frame features for frame temporal modeling. The Fast Path in SF-TMN utilizes segment-level features summarized from frame features for segment temporal modeling. The proposed paradigm is flexible regarding the choice of temporal modeling networks. RESULTS: We explore MS-TCN and ASFormer as temporal modeling networks and experiment with multiple combination strategies for Slow and Fast Paths. We evaluate SF-TMN on Cholec80 and Cataract-101 surgical phase recognition tasks and demonstrate that SF-TMN can achieve state-of-the-art results on all considered metrics. SF-TMN with ASFormer backbone outperforms the state-of-the-art Swin BiGRU by approximately 1% in accuracy and 1.5% in recall on Cholec80. We also evaluate SF-TMN on action segmentation datasets including 50salads, GTEA, and Breakfast, and achieve state-of-the-art results. CONCLUSION: The improvement in the results shows that combining temporal information from both frame level and segment level by refining outputs with temporal refinement stages is beneficial for the temporal modeling of surgical phases.
Assuntos
Gravação em Vídeo , Humanos , Redes Neurais de Computação , Extração de Catarata/métodos , Cirurgia Assistida por Computador/métodosRESUMO
BACKGROUND: Gastric surgery involves numerous surgical phases; however, its steps can be clearly defined. Deep learning-based surgical phase recognition can promote stylization of gastric surgery with applications in automatic surgical skill assessment. This study aimed to develop a deep learning-based surgical phase-recognition model using multicenter videos of laparoscopic distal gastrectomy, and examine the feasibility of automatic surgical skill assessment using the developed model. METHODS: Surgical videos from 20 hospitals were used. Laparoscopic distal gastrectomy was defined and annotated into nine phases and a deep learning-based image classification model was developed for phase recognition. We examined whether the developed model's output, including the number of frames in each phase and the adequacy of the surgical field development during the phase of supra-pancreatic lymphadenectomy, correlated with the manually assigned skill assessment score. RESULTS: The overall accuracy of phase recognition was 88.8%. Regarding surgical skill assessment based on the number of frames during the phases of lymphadenectomy of the left greater curvature and reconstruction, the number of frames in the high-score group were significantly less than those in the low-score group (829 vs. 1,152, P < 0.01; 1,208 vs. 1,586, P = 0.01, respectively). The output score of the adequacy of the surgical field development, which is the developed model's output, was significantly higher in the high-score group than that in the low-score group (0.975 vs. 0.970, P = 0.04). CONCLUSION: The developed model had high accuracy in phase-recognition tasks and has the potential for application in automatic surgical skill assessment systems.
Assuntos
Laparoscopia , Neoplasias Gástricas , Humanos , Neoplasias Gástricas/cirurgia , Laparoscopia/métodos , Gastroenterostomia , Gastrectomia/métodosRESUMO
PURPOSE: Advances in surgical phase recognition are generally led by training deeper networks. Rather than going further with a more complex solution, we believe that current models can be exploited better. We propose a self-knowledge distillation framework that can be integrated into current state-of-the-art (SOTA) models without requiring any extra complexity to the models or annotations. METHODS: Knowledge distillation is a framework for network regularization where knowledge is distilled from a teacher network to a student network. In self-knowledge distillation, the student model becomes the teacher such that the network learns from itself. Most phase recognition models follow an encoder-decoder framework. Our framework utilizes self-knowledge distillation in both stages. The teacher model guides the training process of the student model to extract enhanced feature representations from the encoder and build a more robust temporal decoder to tackle the over-segmentation problem. RESULTS: We validate our proposed framework on the public dataset Cholec80. Our framework is embedded on top of four popular SOTA approaches and consistently improves their performance. Specifically, our best GRU model boosts performance by [Formula: see text] accuracy and [Formula: see text] F1-score over the same baseline model. CONCLUSION: We embed a self-knowledge distillation framework for the first time in the surgical phase recognition training pipeline. Experimental results demonstrate that our simple yet powerful framework can improve performance of existing phase recognition models. Moreover, our extensive experiments show that even with 75% of the training set we still achieve performance on par with the same baseline model trained on the full set.
Assuntos
Aprendizagem , Estudantes , HumanosRESUMO
Introduction: With increasing use of robotic surgical adjuncts, artificial intelligence and augmented reality in neurosurgery, the automated analysis of digital images and videos acquired over various procedures becomes a subject of increased interest. While several computer vision (CV) methods have been developed and implemented for analyzing surgical scenes, few studies have been dedicated to neurosurgery. Research question: In this work, we present a systematic literature review focusing on CV methodologies specifically applied to the analysis of neurosurgical procedures based on intra-operative images and videos. Additionally, we provide recommendations for the future developments of CV models in neurosurgery. Material and methods: We conducted a systematic literature search in multiple databases until January 17, 2023, including Web of Science, PubMed, IEEE Xplore, Embase, and SpringerLink. Results: We identified 17 studies employing CV algorithms on neurosurgical videos/images. The most common applications of CV were tool and neuroanatomical structure detection or characterization, and to a lesser extent, surgical workflow analysis. Convolutional neural networks (CNN) were the most frequently utilized architecture for CV models (65%), demonstrating superior performances in tool detection and segmentation. In particular, mask recurrent-CNN manifested most robust performance outcomes across different modalities. Discussion and conclusion: Our systematic review demonstrates that CV models have been reported that can effectively detect and differentiate tools, surgical phases, neuroanatomical structures, as well as critical events in complex neurosurgical scenes with accuracies above 95%. Automated tool recognition contributes to objective characterization and assessment of surgical performance, with potential applications in neurosurgical training and intra-operative safety management.
RESUMO
Surgical workflow analysis is essential to help optimize surgery by encouraging efficient communication and the use of resources. However, the performance of phase recognition is limited by the use of information related to the presence of surgical instruments. To address the problem, we propose visual modality-based multimodal fusion for surgical phase recognition to overcome the limited diversity of information such as the presence of instruments. Using the proposed methods, we extracted a visual kinematics-based index related to using instruments, such as movement and their interrelations during surgery. In addition, we improved recognition performance using an effective convolutional neural network (CNN)-based fusion method for visual features and a visual kinematics-based index (VKI). The visual kinematics-based index improves the understanding of a surgical procedure since information is related to instrument interaction. Furthermore, these indices can be extracted in any environment, such as laparoscopic surgery, and help obtain complementary information for system kinematics log errors. The proposed methodology was applied to two multimodal datasets, a virtual reality (VR) simulator-based dataset (PETRAW) and a private distal gastrectomy surgery dataset, to verify that it can help improve recognition performance in clinical environments. We also explored the influence of a visual kinematics-based index to recognize each surgical workflow by the instrument's existence and the instrument's trajectory. Through the experimental results of a distal gastrectomy video dataset, we validated the effectiveness of our proposed fusion approach in surgical phase recognition. The relatively simple yet index-incorporated fusion we propose can yield significant performance improvements over only CNN-based training and exhibits effective training results compared to fusion based on Transformers, which require a large amount of pre-trained data.
RESUMO
Video-recorded robotic-assisted surgeries allow the use of automated computer vision and artificial intelligence/deep learning methods for quality assessment and workflow analysis in surgical phase recognition. We considered a dataset of 209 videos of robotic-assisted laparoscopic inguinal hernia repair (RALIHR) collected from 8 surgeons, defined rigorous ground-truth annotation rules, then pre-processed and annotated the videos. We deployed seven deep learning models to establish the baseline accuracy for surgical phase recognition and explored four advanced architectures. For rapid execution of the studies, we initially engaged three dozen MS-level engineering students in a competitive classroom setting, followed by focused research. We unified the data processing pipeline in a confirmatory study, and explored a number of scenarios which differ in how the DL networks were trained and evaluated. For the scenario with 21 validation videos of all surgeons, the Video Swin Transformer model achieved ~0.85 validation accuracy, and the Perceiver IO model achieved ~0.84. Our studies affirm the necessity of close collaborative research between medical experts and engineers for developing automated surgical phase recognition models deployable in clinical settings.
RESUMO
Adapting intelligent context-aware systems (CAS) to future operating rooms (OR) aims to improve situational awareness and provide surgical decision support systems to medical teams. CAS analyzes data streams from available devices during surgery and communicates real-time knowledge to clinicians. Indeed, recent advances in computer vision and machine learning, particularly deep learning, paved the way for extensive research to develop CAS. In this work, a deep learning approach for analyzing laparoscopic videos for surgical phase recognition, tool classification, and weakly-supervised tool localization in laparoscopic videos was proposed. The ResNet-50 convolutional neural network (CNN) architecture was adapted by adding attention modules and fusing features from multiple stages to generate better-focused, generalized, and well-representative features. Then, a multi-map convolutional layer followed by tool-wise and spatial pooling operations was utilized to perform tool localization and generate tool presence confidences. Finally, the long short-term memory (LSTM) network was employed to model temporal information and perform tool classification and phase recognition. The proposed approach was evaluated on the Cholec80 dataset. The experimental results (i.e., 88.5% and 89.0% mean precision and recall for phase recognition, respectively, 95.6% mean average precision for tool presence detection, and a 70.1% F1-score for tool localization) demonstrated the ability of the model to learn discriminative features for all tasks. The performances revealed the importance of integrating attention modules and multi-stage feature fusion for more robust and precise detection of surgical phases and tools.
Assuntos
Conscientização , Laparoscopia , Salas Cirúrgicas , AtençãoRESUMO
BACKGROUND: Surgical video phase recognition is an essential technique in computer-assisted surgical systems for monitoring surgical procedures, which can assist surgeons in standardizing procedures and enhancing postsurgical assessment and indexing. However, the high similarity between the phases and temporal variations of cataract videos still poses the greatest challenge for video phase recognition. METHODS: In this paper, we introduce a global-local multi-stage temporal convolutional network (GL-MSTCN) to explore the subtle differences between high similarity surgical phases and mitigate the temporal variations of surgical videos. The presented work consists of a triple-stream network (i.e., pupil stream, instrument stream, and video frame stream) and a multi-stage temporal convolutional network. The triple-stream network first detects the pupil and surgical instruments regions in the frame separately and then obtains the fine-grained semantic features of the video frames. The proposed multi-stage temporal convolutional network improves the surgical phase recognition performance by capturing longer time series features through dilated convolutional layers with varying receptive fields. RESULTS: Our method is thoroughly validated on the CSVideo dataset with 32 cataract surgery videos and the public Cataract101 dataset with 101 cataract surgery videos, outperforming state-of-the-art approaches with 95.8% and 96.5% accuracy, respectively. CONCLUSIONS: The experimental results show that the use of global and local feature information can effectively enhance the model to explore fine-grained features and mitigate temporal and spatial variations, thus improving the surgical phase recognition performance of the proposed GL-MSTCN.
Assuntos
Catarata , Humanos , Semântica , Sistemas Computacionais , Fatores de TempoRESUMO
BACKGROUND: Because of the complexity of the intra-abdominal anatomy in the posterior approach, a longer learning curve has been observed in laparoscopic transabdominal preperitoneal (TAPP) inguinal hernia repair. Consequently, automatic tools using artificial intelligence (AI) to monitor TAPP procedures and assess learning curves are required. The primary objective of this study was to establish a deep learning-based automated surgical phase recognition system for TAPP. A secondary objective was to investigate the relationship between surgical skills and phase duration. METHODS: This study enrolled 119 patients who underwent the TAPP procedure. The surgical videos were annotated (delineated in time) and split into seven surgical phases (preparation, peritoneal flap incision, peritoneal flap dissection, hernia dissection, mesh deployment, mesh fixation, peritoneal flap closure, and additional closure). An AI model was trained to automatically recognize surgical phases from videos. The relationship between phase duration and surgical skills were also evaluated. RESULTS: A fourfold cross-validation was used to assess the performance of the AI model. The accuracy was 88.81 and 85.82%, in unilateral and bilateral cases, respectively. In unilateral hernia cases, the duration of peritoneal incision (p = 0.003) and hernia dissection (p = 0.014) detected via AI were significantly shorter for experts than for trainees. CONCLUSION: An automated surgical phase recognition system was established for TAPP using deep learning with a high accuracy. Our AI-based system can be useful for the automatic monitoring of surgery progress, improving OR efficiency, evaluating surgical skills and video-based surgical education. Specific phase durations detected via the AI model were significantly associated with the surgeons' learning curve.
Assuntos
Hérnia Inguinal , Laparoscopia , Humanos , Hérnia Inguinal/cirurgia , Herniorrafia/métodos , Telas Cirúrgicas , Inteligência Artificial , Laparoscopia/métodosRESUMO
PURPOSE: We tackle the problem of online surgical phase recognition in laparoscopic procedures, which is key in developing context-aware supporting systems. We propose a novel approach to take temporal context in surgical videos into account by precise modeling of temporal neighborhoods. METHODS: We propose a two-stage model to perform phase recognition. A CNN model is used as a feature extractor to project RGB frames into a high-dimensional feature space. We introduce a novel paradigm for surgical phase recognition which utilizes graph neural networks to incorporate temporal information. Unlike recurrent neural networks and temporal convolution networks, our graph-based approach offers a more generic and flexible way for modeling temporal relationships. Each frame is a node in the graph, and the edges in the graph are used to define temporal connections among the nodes. The flexible configuration of temporal neighborhood comes at the price of losing temporal order. To mitigate this, our approach takes temporal orders into account by encoding frame positions, which is important to reliably predict surgical phases. RESULTS: Experiments are carried out on the public Cholec80 dataset that contains 80 annotated videos. The experimental results highlight the superior performance of the proposed approach compared to the state-of-the-art models on this dataset. CONCLUSION: A novel approach for formulating video-based surgical phase recognition is presented. The results indicate that temporal information can be incorporated using graph-based models, and positional encoding is important to efficiently utilize temporal information. Graph networks open possibilities to use evidence theory for uncertainty analysis in surgical phase recognition.
Assuntos
Laparoscopia , Redes Neurais de Computação , Humanos , Laparoscopia/métodos , Fluxo de TrabalhoRESUMO
Recent years have witnessed artificial intelligence (AI) make meteoric leaps in both medicine and surgery, bridging the gap between the capabilities of humans and machines. Digitization of operating rooms and the creation of massive quantities of data have paved the way for machine learning and computer vision applications in surgery. Surgical phase recognition (SPR) is a newly emerging technology that uses data derived from operative videos to train machine and deep learning algorithms to identify the phases of surgery. Advancement of this technology will be key in establishing context-aware surgical systems in the future. By automatically recognizing and evaluating the current surgical scenario, these intelligent systems are able to provide intraoperative decision support, improve operating room efficiency, assess surgical skills, and aid in surgical training and education. Still in its infancy, SPR has been mainly studied in laparoscopic surgeries, with a contrasting stark lack of research within neurosurgery. Given the high-tech and rapidly advancing nature of neurosurgery, we believe SPR has a tremendous untapped potential in this field. Herein, we present an overview of the SPR technology, its potential applications in neurosurgery, and the challenges that lie ahead.
Assuntos
Aprendizado Profundo , Neurocirurgia , Inteligência Artificial , Humanos , Aprendizado de Máquina , Procedimentos NeurocirúrgicosRESUMO
In recent times, many studies concerning surgical video analysis are being conducted due to its growing importance in many medical applications. In particular, it is very important to be able to recognize the current surgical phase because the phase information can be utilized in various ways both during and after surgery. This paper proposes an efficient phase recognition network, called MomentNet, for cholecystectomy endoscopic videos. Unlike LSTM-based network, MomentNet is based on a multi-stage temporal convolutional network. Besides, to improve the phase prediction accuracy, the proposed method adopts a new loss function to supplement the general cross entropy loss function. The new loss function significantly improves the performance of the phase recognition network by constraining un-desirable phase transition and preventing over-segmentation. In addition, MomnetNet effectively applies positional encoding techniques, which are commonly applied in transformer architectures, to the multi-stage temporal convolution network. By using the positional encoding techniques, MomentNet can provide important temporal context, resulting in higher phase prediction accuracy. Furthermore, the MomentNet applies label smoothing technique to suppress overfitting and replaces the backbone network for feature extraction to further improve the network performance. As a result, the MomentNet achieves 92.31% accuracy in the phase recognition task with the Cholec80 dataset, which is 4.55% higher than that of the baseline architecture.
RESUMO
BACKGROUND: Artificial intelligence and computer vision have revolutionized laparoscopic surgical video analysis. However, there is no multi-center study focused on deep learning-based laparoscopic cholecystectomy phases recognizing. This work aims to apply artificial intelligence in recognizing and analyzing phases in laparoscopic cholecystectomy videos from multiple centers. METHODS: This observational cohort-study included 163 laparoscopic cholecystectomy videos collected from four medical centers. Videos were labeled by surgeons and a deep-learning model was developed based on 90 videos. Thereafter, the performance of the model was tested in additional ten videos by comparing it with the annotated ground truth of the surgeon. Deep-learning models were trained to identify laparoscopic cholecystectomy phases. The performance of models was measured using precision, recall, F1 score, and overall accuracy. With a high overall accuracy of the model, additional 63 videos as an analysis set were analyzed by the model to identify different phases. RESULTS: Mean concordance correlation coefficient for annotations of the surgeons across all operative phases was 92.38%. Also, the overall phase recognition accuracy of laparoscopic cholecystectomy by the model was 91.05%. In the analysis set, there was an average surgery time of 2195 ± 896 s, with a huge individual variance of different surgical phases. Notably, laparoscopic cholecystectomy in acute cholecystitis cases had prolonged overall durations, and the surgeon would spend more time in mobilizing the hepatocystic triangle phase. CONCLUSION: A deep-learning model based on multiple centers data can identify phases of laparoscopic cholecystectomy with a high degree of accuracy. With continued refinements, artificial intelligence could be utilized in huge data surgery analysis to achieve clinically relevant future applications.
Assuntos
Inteligência Artificial , Colecistectomia Laparoscópica , HumanosRESUMO
Computer-assisted interventions (CAI) aim to increase the effectiveness, precision and repeatability of procedures to improve surgical outcomes. The presence and motion of surgical tools is a key information input for CAI surgical phase recognition algorithms. Vision-based tool detection and recognition approaches are an attractive solution and can be designed to take advantage of the powerful deep learning paradigm that is rapidly advancing image recognition and classification. The challenge for such algorithms is the availability and quality of labelled data used for training. In this Letter, surgical simulation is used to train tool detection and segmentation based on deep convolutional neural networks and generative adversarial networks. The authors experiment with two network architectures for image segmentation in tool classes commonly encountered during cataract surgery. A commercially-available simulator is used to create a simulated cataract dataset for training models prior to performing transfer learning on real surgical data. To the best of authors' knowledge, this is the first attempt to train deep learning models for surgical instrument detection on simulated data while demonstrating promising results to generalise on real data. Results indicate that simulated data does have some potential for training advanced classification methods for CAI systems.