RESUMO
OBJECTIVES: We aimed to study classical, publicly available convolutional neural networks (3D-CNNs) using a combination of several cine-MR orientation planes for the estimation of left ventricular ejection fraction (LVEF) without contour tracing. METHODS: Cine-MR examinations carried out on 1082 patients from our institution were analysed by comparing the LVEF provided by the CVI42 software (V5.9.3) with the estimation resulting from different 3D-CNN models and various combinations of long- and short-axis orientation planes. RESULTS: The 3D-Resnet18 architecture appeared to be the most favourable, and the results gradually and significantly improved as several long-axis and short-axis planes were combined. Simply pasting multiple orientation views into composite frames increased performance. Optimal results were obtained by pasting two long-axis views and six short-axis views. The best configuration provided an R2 = 0.83, a mean absolute error (MAE) = 4.97, and a root mean square error (RMSE) = 6.29; the area under the ROC curve (AUC) for the classification of LVEF < 40% was 0.99, and for the classification of LVEF > 60%, the AUC was 0.97. Internal validation performed on 149 additional patients after model training provided very similar results (MAE 4.98). External validation carried out on 62 patients from another institution showed an MAE of 6.59. Our results in this area are among the most promising obtained to date using CNNs with cardiac magnetic resonance. CONCLUSION: (1) The use of traditional 3D-CNNs and a combination of multiple orientation planes is capable of estimating LVEF from cine-MRI data without segmenting ventricular contours, with a reliability similar to that of traditional methods. (2) Performance significantly improves as the number of orientation planes increases, providing a more complete view of the left ventricle.
RESUMO
Artificial intelligence (AI) holds significant potential for enhancing quality of gastrointestinal (GI) endoscopy, but the adoption of AI in clinical practice is hampered by the lack of rigorous standardisation and development methodology ensuring generalisability. The aim of the Quality Assessment of pre-clinical AI studies in Diagnostic Endoscopy (QUAIDE) Explanation and Checklist was to develop recommendations for standardised design and reporting of preclinical AI studies in GI endoscopy.The recommendations were developed based on a formal consensus approach with an international multidisciplinary panel of 32 experts among endoscopists and computer scientists. The Delphi methodology was employed to achieve consensus on statements, with a predetermined threshold of 80% agreement. A maximum three rounds of voting were permitted.Consensus was reached on 18 key recommendations, covering 6 key domains: data acquisition and annotation (6 statements), outcome reporting (3 statements), experimental setup and algorithm architecture (4 statements) and result presentation and interpretation (5 statements). QUAIDE provides recommendations on how to properly design (1. Methods, statements 1-14), present results (2. Results, statements 15-16) and integrate and interpret the obtained results (3. Discussion, statements 17-18).The QUAIDE framework offers practical guidance for authors, readers, editors and reviewers involved in AI preclinical studies in GI endoscopy, aiming at improving design and reporting, thereby promoting research standardisation and accelerating the translation of AI innovations into clinical practice.
RESUMO
Surgical technique is essential to ensure safe minimally invasive adrenalectomy. Due to the relative rarity of adrenal surgery, it is challenging to ensure adequate exposure in surgical training. Surgical video analysis supports auto-evaluation, expert assessment and could be a target for automatization. The developed ontology was validated by a European expert consensus and is applicable across the surgical techniques encountered in all participating centres, with an exemplary demonstration in bi-centric recordings. Standardization of adrenalectomy video analysis may foster surgical training and enable machine learning training for automated safety alerts.
Assuntos
Adrenalectomia , Técnica Delphi , Laparoscopia , Aprendizado de Máquina , Humanos , Adrenalectomia/educação , Adrenalectomia/métodos , Laparoscopia/educação , Laparoscopia/métodos , Projetos Piloto , Gravação em VídeoRESUMO
Artificial intelligence (AI) will power many of the tools in the armamentarium of digital surgeons. AI methods and surgical proof-of-concept flourish, but we have yet to witness clinical translation and value. Here we exemplify the potential of AI in the care pathway of colorectal cancer patients and discuss clinical, technical, and governance considerations of major importance for the safe translation of surgical AI for the benefit of our patients and practices.
Assuntos
Inteligência Artificial , Neoplasias Colorretais , Humanos , Neoplasias Colorretais/cirurgiaRESUMO
PURPOSE: The modern operating room is becoming increasingly complex, requiring innovative intra-operative support systems. While the focus of surgical data science has largely been on video analysis, integrating surgical computer vision with natural language capabilities is emerging as a necessity. Our work aims to advance visual question answering (VQA) in the surgical context with scene graph knowledge, addressing two main challenges in the current surgical VQA systems: removing question-condition bias in the surgical VQA dataset and incorporating scene-aware reasoning in the surgical VQA model design. METHODS: First, we propose a surgical scene graph-based dataset, SSG-VQA, generated by employing segmentation and detection models on publicly available datasets. We build surgical scene graphs using spatial and action information of instruments and anatomies. These graphs are fed into a question engine, generating diverse QA pairs. We then propose SSG-VQA-Net, a novel surgical VQA model incorporating a lightweight Scene-embedded Interaction Module, which integrates geometric scene knowledge in the VQA model design by employing cross-attention between the textual and the scene features. RESULTS: Our comprehensive analysis shows that our SSG-VQA dataset provides a more complex, diverse, geometrically grounded, unbiased and surgical action-oriented dataset compared to existing surgical VQA datasets and SSG-VQA-Net outperforms existing methods across different question types and complexities. We highlight that the primary limitation in the current surgical VQA systems is the lack of scene knowledge to answer complex queries. CONCLUSION: We present a novel surgical VQA dataset and model and show that results can be significantly improved by incorporating geometric scene features in the VQA model design. We point out that the bottleneck of the current surgical visual question-answer model lies in learning the encoded representation rather than decoding the sequence. Our SSG-VQA dataset provides a diagnostic benchmark to test the scene understanding and reasoning capabilities of the model. The source code and the dataset will be made publicly available at: https://github.com/CAMMA-public/SSG-VQA .
Assuntos
Salas Cirúrgicas , Humanos , Cirurgia Assistida por Computador/métodos , Processamento de Linguagem Natural , Gravação em VídeoRESUMO
PURPOSE: Most studies on surgical activity recognition utilizing artificial intelligence (AI) have focused mainly on recognizing one type of activity from small and mono-centric surgical video datasets. It remains speculative whether those models would generalize to other centers. METHODS: In this work, we introduce a large multi-centric multi-activity dataset consisting of 140 surgical videos (MultiBypass140) of laparoscopic Roux-en-Y gastric bypass (LRYGB) surgeries performed at two medical centers, i.e., the University Hospital of Strasbourg, France (StrasBypass70) and Inselspital, Bern University Hospital, Switzerland (BernBypass70). The dataset has been fully annotated with phases and steps by two board-certified surgeons. Furthermore, we assess the generalizability and benchmark different deep learning models for the task of phase and step recognition in 7 experimental studies: (1) Training and evaluation on BernBypass70; (2) Training and evaluation on StrasBypass70; (3) Training and evaluation on the joint MultiBypass140 dataset; (4) Training on BernBypass70, evaluation on StrasBypass70; (5) Training on StrasBypass70, evaluation on BernBypass70; Training on MultiBypass140, (6) evaluation on BernBypass70 and (7) evaluation on StrasBypass70. RESULTS: The model's performance is markedly influenced by the training data. The worst results were obtained in experiments (4) and (5) confirming the limited generalization capabilities of models trained on mono-centric data. The use of multi-centric training data, experiments (6) and (7), improves the generalization capabilities of the models, bringing them beyond the level of independent mono-centric training and validation (experiments (1) and (2)). CONCLUSION: MultiBypass140 shows considerable variation in surgical technique and workflow of LRYGB procedures between centers. Therefore, generalization experiments demonstrate a remarkable difference in model performance. These results highlight the importance of multi-centric datasets for AI model generalization to account for variance in surgical technique and workflows. The dataset and code are publicly available at https://github.com/CAMMA-public/MultiBypass140.
RESUMO
PURPOSE: Advances in deep learning have resulted in effective models for surgical video analysis; however, these models often fail to generalize across medical centers due to domain shift caused by variations in surgical workflow, camera setups, and patient demographics. Recently, object-centric learning has emerged as a promising approach for improved surgical scene understanding, capturing and disentangling visual and semantic properties of surgical tools and anatomy to improve downstream task performance. In this work, we conduct a multicentric performance benchmark of object-centric approaches, focusing on critical view of safety assessment in laparoscopic cholecystectomy, then propose an improved approach for unseen domain generalization. METHODS: We evaluate four object-centric approaches for domain generalization, establishing baseline performance. Next, leveraging the disentangled nature of object-centric representations, we dissect one of these methods through a series of ablations (e.g., ignoring either visual or semantic features for downstream classification). Finally, based on the results of these ablations, we develop an optimized method specifically tailored for domain generalization, LG-DG, that includes a novel disentanglement loss function. RESULTS: Our optimized approach, LG-DG, achieves an improvement of 9.28% over the best baseline approach. More broadly, we show that object-centric approaches are highly effective for domain generalization thanks to their modular approach to representation learning. CONCLUSION: We investigate the use of object-centric methods for unseen domain generalization, identify method-agnostic factors critical for performance, and present an optimized approach that substantially outperforms existing methods.
Assuntos
Colecistectomia Laparoscópica , Humanos , Colecistectomia Laparoscópica/métodos , Gravação em Vídeo , Aprendizado ProfundoRESUMO
PURPOSE: In medical research, deep learning models rely on high-quality annotated data, a process often laborious and time-consuming. This is particularly true for detection tasks where bounding box annotations are required. The need to adjust two corners makes the process inherently frame-by-frame. Given the scarcity of experts' time, efficient annotation methods suitable for clinicians are needed. METHODS: We propose an on-the-fly method for live video annotation to enhance the annotation efficiency. In this approach, a continuous single-point annotation is maintained by keeping the cursor on the object in a live video, mitigating the need for tedious pausing and repetitive navigation inherent in traditional annotation methods. This novel annotation paradigm inherits the point annotation's ability to generate pseudo-labels using a point-to-box teacher model. We empirically evaluate this approach by developing a dataset and comparing on-the-fly annotation time against traditional annotation method. RESULTS: Using our method, annotation speed was 3.2 × faster than the traditional annotation technique. We achieved a mean improvement of 6.51 ± 0.98 AP@50 over conventional method at equivalent annotation budgets on the developed dataset. CONCLUSION: Without bells and whistles, our approach offers a significant speed-up in annotation tasks. It can be easily implemented on any annotation platform to accelerate the integration of deep learning in video-based medical research.
Assuntos
Aprendizado Profundo , Gravação em Vídeo , Gravação em Vídeo/métodos , Humanos , Curadoria de Dados/métodosRESUMO
Assessing the critical view of safety in laparoscopic cholecystectomy requires accurate identification and localization of key anatomical structures, reasoning about their geometric relationships to one another, and determining the quality of their exposure. Prior works have approached this task by including semantic segmentation as an intermediate step, using predicted segmentation masks to then predict the CVS. While these methods are effective, they rely on extremely expensive ground-truth segmentation annotations and tend to fail when the predicted segmentation is incorrect, limiting generalization. In this work, we propose a method for CVS prediction wherein we first represent a surgical image using a disentangled latent scene graph, then process this representation using a graph neural network. Our graph representations explicitly encode semantic information - object location, class information, geometric relations - to improve anatomy-driven reasoning, as well as visual features to retain differentiability and thereby provide robustness to semantic errors. Finally, to address annotation cost, we propose to train our method using only bounding box annotations, incorporating an auxiliary image reconstruction objective to learn fine-grained object boundaries. We show that our method not only outperforms several baseline methods when trained with bounding box annotations, but also scales effectively when trained with segmentation masks, maintaining state-of-the-art performance.
Assuntos
Processamento de Imagem Assistida por Computador , Redes Neurais de Computação , SemânticaRESUMO
The growing availability of surgical digital data and developments in analytics such as artificial intelligence (AI) are being harnessed to improve surgical care. However, technical and cultural barriers to real-time intraoperative AI assistance exist. This early-stage clinical evaluation shows the technical feasibility of concurrently deploying several AIs in operating rooms for real-time assistance during procedures. In addition, potentially relevant clinical applications of these AI models are explored with a multidisciplinary cohort of key stakeholders.
Assuntos
Colecistectomia Laparoscópica , Humanos , Inteligência ArtificialRESUMO
BACKGROUND: Technical skill assessment in surgery relies on expert opinion. Therefore, it is time-consuming, costly, and often lacks objectivity. Analysis of intraoperative data by artificial intelligence (AI) has the potential for automated technical skill assessment. The aim of this systematic review was to analyze the performance, external validity, and generalizability of AI models for technical skill assessment in minimally invasive surgery. METHODS: A systematic search of Medline, Embase, Web of Science, and IEEE Xplore was performed to identify original articles reporting the use of AI in the assessment of technical skill in minimally invasive surgery. Risk of bias (RoB) and quality of the included studies were analyzed according to Quality Assessment of Diagnostic Accuracy Studies criteria and the modified Joanna Briggs Institute checklists, respectively. Findings were reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement. RESULTS: In total, 1958 articles were identified, 50 articles met eligibility criteria and were analyzed. Motion data extracted from surgical videos (n = 25) or kinematic data from robotic systems or sensors (n = 22) were the most frequent input data for AI. Most studies used deep learning (n = 34) and predicted technical skills using an ordinal assessment scale (n = 36) with good accuracies in simulated settings. However, all proposed models were in development stage, only 4 studies were externally validated and 8 showed a low RoB. CONCLUSION: AI showed good performance in technical skill assessment in minimally invasive surgery. However, models often lacked external validity and generalizability. Therefore, models should be benchmarked using predefined performance metrics and tested in clinical implementation studies.
Assuntos
Inteligência Artificial , Procedimentos Cirúrgicos Minimamente Invasivos , Humanos , Academias e Institutos , Benchmarking , Lista de ChecagemRESUMO
Formalizing surgical activities as triplets of the used instruments, actions performed, and target anatomies is becoming a gold standard approach for surgical activity modeling. The benefit is that this formalization helps to obtain a more detailed understanding of tool-tissue interaction which can be used to develop better Artificial Intelligence assistance for image-guided surgery. Earlier efforts and the CholecTriplet challenge introduced in 2021 have put together techniques aimed at recognizing these triplets from surgical footage. Estimating also the spatial locations of the triplets would offer a more precise intraoperative context-aware decision support for computer-assisted intervention. This paper presents the CholecTriplet2022 challenge, which extends surgical action triplet modeling from recognition to detection. It includes weakly-supervised bounding box localization of every visible surgical instrument (or tool), as the key actors, and the modeling of each tool-activity in the form of triplet. The paper describes a baseline method and 10 new deep learning algorithms presented at the challenge to solve the task. It also provides thorough methodological comparisons of the methods, an in-depth analysis of the obtained results across multiple metrics, visual and procedural challenges; their significance, and useful insights for future research directions and applications in surgery.
Assuntos
Inteligência Artificial , Cirurgia Assistida por Computador , Humanos , Endoscopia , Algoritmos , Cirurgia Assistida por Computador/métodos , Instrumentos CirúrgicosRESUMO
BACKGROUND: Surgery generates a vast amount of data from each procedure. Particularly video data provides significant value for surgical research, clinical outcome assessment, quality control, and education. The data lifecycle is influenced by various factors, including data structure, acquisition, storage, and sharing; data use and exploration, and finally data governance, which encompasses all ethical and legal regulations associated with the data. There is a universal need among stakeholders in surgical data science to establish standardized frameworks that address all aspects of this lifecycle to ensure data quality and purpose. METHODS: Working groups were formed, among 48 representatives from academia and industry, including clinicians, computer scientists and industry representatives. These working groups focused on: Data Use, Data Structure, Data Exploration, and Data Governance. After working group and panel discussions, a modified Delphi process was conducted. RESULTS: The resulting Delphi consensus provides conceptualized and structured recommendations for each domain related to surgical video data. We identified the key stakeholders within the data lifecycle and formulated comprehensive, easily understandable, and widely applicable guidelines for data utilization. Standardization of data structure should encompass format and quality, data sources, documentation, metadata, and account for biases within the data. To foster scientific data exploration, datasets should reflect diversity and remain adaptable to future applications. Data governance must be transparent to all stakeholders, addressing legal and ethical considerations surrounding the data. CONCLUSION: This consensus presents essential recommendations around the generation of standardized and diverse surgical video databanks, accounting for multiple stakeholders involved in data generation and use throughout its lifecycle. Following the SAGES annotation framework, we lay the foundation for standardization of data use, structure, and exploration. A detailed exploration of requirements for adequate data governance will follow.
Assuntos
Inteligência Artificial , Melhoria de Qualidade , Humanos , Consenso , Coleta de DadosRESUMO
Objective.Patient dose estimation in x-ray-guided interventions is essential to prevent radiation-induced biological side effects. Current dose monitoring systems estimate the skin dose based in dose metrics such as the reference air kerma. However, these approximations do not take into account the exact patient morphology and organs composition. Furthermore, accurate organ dose estimation has not been proposed for these procedures. Monte Carlo simulation can accurately estimate the dose by recreating the irradiation process generated during the x-ray imaging, but at a high computation time, limiting an intra-operative application. This work presents a fast deep convolutional neural network trained with MC simulations for patient dose estimation during x-ray-guided interventions.Approach.We introduced a modified 3D U-Net that utilizes a patient's CT scan and the numerical values of imaging settings as input to produce a Monte Carlo dose map. To create a dataset of dose maps, we simulated the x-ray irradiation process for the abdominal region using a publicly available dataset of 82 patient CT scans. The simulation involved varying the angulation, position, and tube voltage of the x-ray source for each scan. We additionally conducted a clinical study during endovascular abdominal aortic repairs to validate the reliability of our Monte Carlo simulation dose maps. Dose measurements were taken at four specific anatomical points on the skin and compared to the corresponding simulated doses. The proposed network was trained using a 4-fold cross-validation approach with 65 patients, and evaluating the performance on the remaining 17 patients during testing.Main results.The clinical validation demonstrated a average error within the anatomical points of 5.1%. The network yielded test errors of 11.5 ± 4.6% and 6.2 ± 1.5% for peak and average skin doses, respectively. Furthermore, the mean errors for the abdominal region and pancreas doses were 5.0 ± 1.4% and 13.1 ± 2.7%, respectively.Significance.Our network can accurately predict a personalized 3D dose map considering the current imaging settings. A short computation time was achieved, making our approach a potential solution for dose monitoring and reporting commercial systems.
Assuntos
Aprendizado Profundo , Humanos , Doses de Radiação , Raios X , Reprodutibilidade dos Testes , Imagens de Fantasmas , Método de Monte CarloRESUMO
The field of surgical computer vision has undergone considerable breakthroughs in recent years with the rising popularity of deep neural network-based methods. However, standard fully-supervised approaches for training such models require vast amounts of annotated data, imposing a prohibitively high cost; especially in the clinical domain. Self-Supervised Learning (SSL) methods, which have begun to gain traction in the general computer vision community, represent a potential solution to these annotation costs, allowing to learn useful representations from only unlabeled data. Still, the effectiveness of SSL methods in more complex and impactful domains, such as medicine and surgery, remains limited and unexplored. In this work, we address this critical need by investigating four state-of-the-art SSL methods (MoCo v2, SimCLR, DINO, SwAV) in the context of surgical computer vision. We present an extensive analysis of the performance of these methods on the Cholec80 dataset for two fundamental and popular tasks in surgical context understanding, phase recognition and tool presence detection. We examine their parameterization, then their behavior with respect to training data quantities in semi-supervised settings. Correct transfer of these methods to surgery, as described and conducted in this work, leads to substantial performance gains over generic uses of SSL - up to 7.4% on phase recognition and 20% on tool presence detection - as well as state-of-the-art semi-supervised phase recognition approaches by up to 14%. Further results obtained on a highly diverse selection of surgical datasets exhibit strong generalization properties. The code is available at https://github.com/CAMMA-public/SelfSupSurg.
Assuntos
Computadores , Redes Neurais de Computação , Humanos , Aprendizado de Máquina SupervisionadoRESUMO
Surgical video analysis facilitates education and research. However, video recordings of endoscopic surgeries can contain privacy-sensitive information, especially if the endoscopic camera is moved out of the body of patients and out-of-body scenes are recorded. Therefore, identification of out-of-body scenes in endoscopic videos is of major importance to preserve the privacy of patients and operating room staff. This study developed and validated a deep learning model for the identification of out-of-body images in endoscopic videos. The model was trained and evaluated on an internal dataset of 12 different types of laparoscopic and robotic surgeries and was externally validated on two independent multicentric test datasets of laparoscopic gastric bypass and cholecystectomy surgeries. Model performance was evaluated compared to human ground truth annotations measuring the receiver operating characteristic area under the curve (ROC AUC). The internal dataset consisting of 356,267 images from 48 videos and the two multicentric test datasets consisting of 54,385 and 58,349 images from 10 and 20 videos, respectively, were annotated. The model identified out-of-body images with 99.97% ROC AUC on the internal test dataset. Mean ± standard deviation ROC AUC on the multicentric gastric bypass dataset was 99.94 ± 0.07% and 99.71 ± 0.40% on the multicentric cholecystectomy dataset, respectively. The model can reliably identify out-of-body images in endoscopic videos and is publicly shared. This facilitates privacy preservation in surgical video analysis.
Assuntos
Aprendizado Profundo , Laparoscopia , Humanos , Privacidade , Gravação em Vídeo , ColecistectomiaRESUMO
Searching through large volumes of medical data to retrieve relevant information is a challenging yet crucial task for clinical care. However the primitive and most common approach to retrieval, involving text in the form of keywords, is severely limited when dealing with complex media formats. Content-based retrieval offers a way to overcome this limitation, by using rich media as the query itself. Surgical video-to-video retrieval in particular is a new and largely unexplored research problem with high clinical value, especially in the real-time case: using real-time video hashing, search can be achieved directly inside of the operating room. Indeed, the process of hashing converts large data entries into compact binary arrays or hashes, enabling large-scale search operations at a very fast rate. However, due to fluctuations over the course of a video, not all bits in a given hash are equally reliable. In this work, we propose a method capable of mitigating this uncertainty while maintaining a light computational footprint. We present superior retrieval results (3%-4% top 10 mean average precision) on a multi-task evaluation protocol for surgery, using cholecystectomy phases, bypass phases, and coming from an entirely new dataset introduced here, surgical events across six different surgery types. Success on this multi-task benchmark shows the generalizability of our approach for surgical video retrieval.
Assuntos
Algoritmos , Laparoscopia , Humanos , Colecistectomia , IncertezaRESUMO
PURPOSE: One of the recent advances in surgical AI is the recognition of surgical activities as triplets of [Formula: see text]instrument, verb, target[Formula: see text]. Albeit providing detailed information for computer-assisted intervention, current triplet recognition approaches rely only on single-frame features. Exploiting the temporal cues from earlier frames would improve the recognition of surgical action triplets from videos. METHODS: In this paper, we propose Rendezvous in Time (RiT)-a deep learning model that extends the state-of-the-art model, Rendezvous, with temporal modeling. Focusing more on the verbs, our RiT explores the connectedness of current and past frames to learn temporal attention-based features for enhanced triplet recognition. RESULTS: We validate our proposal on the challenging surgical triplet dataset, CholecT45, demonstrating an improved recognition of the verb and triplet along with other interactions involving the verb such as [Formula: see text]instrument, verb[Formula: see text]. Qualitative results show that the RiT produces smoother predictions for most triplet instances than the state-of-the-arts. CONCLUSION: We present a novel attention-based approach that leverages the temporal fusion of video frames to model the evolution of surgical actions and exploit their benefits for surgical triplet recognition.
RESUMO
Automatic recognition of fine-grained surgical activities, called steps, is a challenging but crucial task for intelligent intra-operative computer assistance. The development of current vision-based activity recognition methods relies heavily on a high volume of manually annotated data. This data is difficult and time-consuming to generate and requires domain-specific knowledge. In this work, we propose to use coarser and easier-to-annotate activity labels, namely phases, as weak supervision to learn step recognition with fewer step annotated videos. We introduce a step-phase dependency loss to exploit the weak supervision signal. We then employ a Single-Stage Temporal Convolutional Network (SS-TCN) with a ResNet-50 backbone, trained in an end-to-end fashion from weakly annotated videos, for temporal activity segmentation and recognition. We extensively evaluate and show the effectiveness of the proposed method on a large video dataset consisting of 40 laparoscopic gastric bypass procedures and the public benchmark CATARACTS containing 50 cataract surgeries.
Assuntos
Redes Neurais de Computação , Cirurgia Assistida por ComputadorRESUMO
PURPOSE: Surgical workflow and skill analysis are key technologies for the next generation of cognitive surgical assistance systems. These systems could increase the safety of the operation through context-sensitive warnings and semi-autonomous robotic assistance or improve training of surgeons via data-driven feedback. In surgical workflow analysis up to 91% average precision has been reported for phase recognition on an open data single-center video dataset. In this work we investigated the generalizability of phase recognition algorithms in a multicenter setting including more difficult recognition tasks such as surgical action and surgical skill. METHODS: To achieve this goal, a dataset with 33 laparoscopic cholecystectomy videos from three surgical centers with a total operation time of 22 h was created. Labels included framewise annotation of seven surgical phases with 250 phase transitions, 5514 occurences of four surgical actions, 6980 occurences of 21 surgical instruments from seven instrument categories and 495 skill classifications in five skill dimensions. The dataset was used in the 2019 international Endoscopic Vision challenge, sub-challenge for surgical workflow and skill analysis. Here, 12 research teams trained and submitted their machine learning algorithms for recognition of phase, action, instrument and/or skill assessment. RESULTS: F1-scores were achieved for phase recognition between 23.9% and 67.7% (n = 9 teams), for instrument presence detection between 38.5% and 63.8% (n = 8 teams), but for action recognition only between 21.8% and 23.3% (n = 5 teams). The average absolute error for skill assessment was 0.78 (n = 1 team). CONCLUSION: Surgical workflow and skill analysis are promising technologies to support the surgical team, but there is still room for improvement, as shown by our comparison of machine learning algorithms. This novel HeiChole benchmark can be used for comparable evaluation and validation of future work. In future studies, it is of utmost importance to create more open, high-quality datasets in order to allow the development of artificial intelligence and cognitive robotics in surgery.