RESUMO
Weakly supervised temporal action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video, given only video-level category labels in the training datasets. Due to the lack of boundary information during training, existing approaches formulate WTAL as a classification problem, i.e., generating the temporal class activation map (T-CAM) for localization. However, with only classification loss, the model would be suboptimized, i.e., the action-related scenes are enough to distinguish different class labels. Regarding other actions in the action-related scene (i.e., the scene same as positive actions) as co-scene actions, this suboptimized model would misclassify the co-scene actions as positive actions. To address this misclassification, we propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi-SCC), to discriminate the positive actions from co-scene actions. The proposed Bi-SCC first adopts a temporal context augmentation to generate an augmented video that breaks the correlation between positive actions and their co-scene actions in the inter-video. Then, a semantic consistency constraint (SCC) is used to enforce the predictions of the original video and augmented video to be consistent, hence suppressing the co-scene actions. However, we find that this augmented video would destroy the original temporal context. Simply applying the consistency constraint would affect the completeness of localized positive actions. Hence, we boost the SCC in a bidirectional way to suppress co-scene actions while ensuring the integrity of positive actions, by cross-supervising the original and augmented videos. Finally, our proposed Bi-SCC can be applied to current WTAL approaches and improve their performance. Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet. The code is available at https://github.com/lgzlIlIlI/BiSCC.
RESUMO
Left-ventricular ejection fraction (LVEF) is an important indicator of heart failure. Existing methods for LVEF estimation from video require large amounts of annotated data to achieve high performance, e.g. using 10,030 labeled echocardiogram videos to achieve mean absolute error (MAE) of 4.10. Labeling these videos is time-consuming however and limits potential downstream applications to other heart diseases. This paper presents the first semi-supervised approach for LVEF prediction. Unlike general video prediction tasks, LVEF prediction is specifically related to changes in the left ventricle (LV) in echocardiogram videos. By incorporating knowledge learned from predicting LV segmentations into LVEF regression, we can provide additional context to the model for better predictions. To this end, we propose a novel Cyclical Self-Supervision (CSS) method for learning video-based LV segmentation, which is motivated by the observation that the heartbeat is a cyclical process with temporal repetition. Prediction masks from our segmentation model can then be used as additional input for LVEF regression to provide spatial context for the LV region. We also introduce teacher-student distillation to distill the information from LV segmentation masks into an end-to-end LVEF regression model that only requires video inputs. Results show our method outperforms alternative semi-supervised methods and can achieve MAE of 4.17, which is competitive with state-of-the-art supervised performance, using half the number of labels. Validation on an external dataset also shows improved generalization ability from using our method.
Assuntos
Cardiopatias , Função Ventricular Esquerda , Humanos , Volume Sistólico , Ecocardiografia/métodos , Ventrículos do Coração/diagnóstico por imagemRESUMO
Surgical phase recognition is a fundamental task in computer-assisted surgery systems. Most existing works are under the supervision of expensive and time-consuming full annotations, which require the surgeons to repeat watching videos to find the precise start and end time for a surgical phase. In this paper, we introduce timestamp supervision for surgical phase recognition to train the models with timestamp annotations, where the surgeons are asked to identify only a single timestamp within the temporal boundary of a phase. This annotation can significantly reduce the manual annotation cost compared to the full annotations. To make full use of such timestamp supervisions, we propose a novel method called uncertainty-aware temporal diffusion (UATD) to generate trustworthy pseudo labels for training. Our proposed UATD is motivated by the property of surgical videos, i.e., the phases are long events consisting of consecutive frames. To be specific, UATD diffuses the single labelled timestamp to its corresponding high confident (i.e., low uncertainty) neighbour frames in an iterative way. Our study uncovers unique insights of surgical phase recognition with timestamp supervision: 1) timestamp annotation can reduce 74% annotation time compared with the full annotation, and surgeons tend to annotate those timestamps near the middle of phases; 2) extensive experiments demonstrate that our method can achieve competitive results compared with full supervision methods, while reducing manual annotation costs; 3) less is more in surgical phase recognition, i.e., less but discriminative pseudo labels outperform full but containing ambiguous frames; 4) the proposed UATD can be used as a plug-and-play method to clean ambiguous labels near boundaries between phases, and improve the performance of the current surgical phase recognition methods. Code and annotations obtained from surgeons are available at https://github.com/xmed-lab/TimeStamp-Surgical.
Assuntos
Cirurgia Assistida por ComputadorRESUMO
The understanding of language plays a key role in video grounding, where a target moment is localized according to a text query. From a biological point of view, language is naturally hierarchical, with the main clause (predicate phrase) providing coarse semantics and modifiers providing detailed descriptions. In video grounding, moments described by the main clause may exist in multiple clips of a long video, including both the ground-truth and background clips. Therefore, in order to correctly discriminate the ground-truth clip from the background ones, this co-existence leads to the negligence of the main clause, and concentrate the model on the modifiers that provide discriminative information on distinguishing the target proposal from the others. We first demonstrate this phenomenon empirically, and propose a Hierarchical Language Network (HLN) that exploits the language hierarchy, as well as a new learning approach called Multi-Instance Positive-Unlabelled Learning (MI-PUL) to alleviate the above problem. Specifically, in HLN, the localization is performed on various layers of the language hierarchy, so that the attention can be paid to different parts of the sentences, rather than only discriminative ones. Furthermore, MI-PUL allows the model to localize background clips that can be possibly described by the main clause, even without manual annotations. Therefore, the union of the two proposed components enhances the learning of the main clause, which is of critical importance in video grounding. Finally, we evaluate that our proposed HLN can plug into the current methods and improve their performance. Extensive experiments on challenging datasets show HLN significantly improve the state-of-the-art methods, especially achieving 6.15% gain in terms of [Formula: see text] on the TACoS dataset.
Assuntos
Idioma , Semântica , AtençãoRESUMO
Automatic surgical phase recognition plays a vital role in robot-assisted surgeries. Existing methods ignored a pivotal problem that surgical phases should be classified by learning segment-level semantics instead of solely relying on frame-wise information. This paper presents a segment-attentive hierarchical consistency network (SAHC) for surgical phase recognition from videos. The key idea is to extract hierarchical high-level semantic-consistent segments and use them to refine the erroneous predictions caused by ambiguous frames. To achieve it, we design a temporal hierarchical network to generate hierarchical high-level segments. Then, we introduce a hierarchical segment-frame attention module to capture relations between the low-level frames and high-level segments. By regularizing the predictions of frames and their corresponding segments via a consistency loss, the network can generate semantic-consistent segments and then rectify the misclassified predictions caused by ambiguous low-level frames. We validate SAHC on two public surgical video datasets, i.e., the M2CAI16 challenge dataset and the Cholec80 dataset. Experimental results show that our method outperforms previous state-of-the-arts and ablation studies prove the effectiveness of our proposed modules. Our code has been released at: https://github.com/xmed-lab/SAHC.
Assuntos
Procedimentos Cirúrgicos Robóticos , SemânticaRESUMO
Weakly Supervised Temporal Action Localization (WTAL) aims to localize action segments in untrimmed videos with only video-level category labels in the training phase. In WTAL, an action generally consists of a series of sub-actions, and different categories of actions may share the common sub-actions. However, to distinguish different categories of actions with only video-level class labels, current WTAL models tend to focus on discriminative sub-actions of the action, while ignoring those common sub-actions shared with different categories of actions. This negligence of common sub-actions would lead to the located action segments incomplete, i.e., only containing discriminative sub-actions. Different from current approaches of designing complex network architectures to explore more complete actions, in this paper, we introduce a novel supervision method named multi-hierarchical category supervision (MHCS) to find more sub-actions rather than only the discriminative ones. Specifically, action categories sharing similar sub-actions will be constructed as super-classes through hierarchical clustering. Hence, training with the new generated super-classes would encourage the model to pay more attention to the common sub-actions, which are ignored training with the original classes. Furthermore, our proposed MHCS is model-agnostic and non-intrusive, which can be directly applied to existing methods without changing their structures. Through extensive experiments, we verify that our supervision method can improve the performance of four state-of-the-art WTAL methods on three public datasets: THUMOS14, ActivityNet1.2, and ActivityNet1.3.
RESUMO
In temporal action localization (TAL), semi-supervised learning is a promising technique to mitigate the cost of precise boundary annotations. Semi-supervised approaches employing consistency regularization (CR), encouraging models to be robust to the perturbed inputs, have achieved great success in image classification problems. The success of CR is largely depended on the perturbations, where instances are perturbed to train a robust model without altering their semantic information. However, the perturbations for image or video classification tasks are not fit to apply to TAL. Since videos in TAL are too long to train the model with raw videos in an end-to-end manner. In this paper, we devise a method named K-farthest crossover to construct perturbations based on video features and apply it to TAL. Motivated by the observation that features in the same action instance become more and more similar during the training process while those in different action instances or backgrounds become more and more divergent, we add perturbations to each feature along temporal axis and adopt CR to encourage the model to retain this observation. Specifically, for a feature, we first find the top-k dissimilar features and average them to form a perturbation. Then, similar to chromosomal crossover, we select a large part of the feature and a small part of the perturbation to recombine a perturbed feature, which preserves the feature semantics yet enough discrepancy.
RESUMO
Facial expression recognition is of significant importance in criminal investigation and digital entertainment. Under unconstrained conditions, existing expression datasets are highly class-imbalanced, and the similarity between expressions is high. Previous methods tend to improve the performance of facial expression recognition through deeper or wider network structures, resulting in increased storage and computing costs. In this paper, we propose a new adaptive supervised objective named AdaReg loss, re-weighting category importance coefficients to address this class imbalance and increasing the discrimination power of expression representations. Inspired by human beings' cognitive mode, an innovative coarse-fine (C-F) labels strategy is designed to guide the model from easy to difficult to classify highly similar representations. On this basis, we propose a novel training framework named the emotional education mechanism (EEM) to transfer knowledge, composed of a knowledgeable teacher network (KTN) and a self-taught student network (STSN). Specifically, KTN integrates the outputs of coarse and fine streams, learning expression representations from easy to difficult. Under the supervision of the pre-trained KTN and existing learning experience, STSN can maximize the potential performance and compress the original KTN. Extensive experiments on public benchmarks demonstrate that the proposed method achieves superior performance compared to current state-of-the-art frameworks with 88.07% on RAF-DB, 63.97% on AffectNet and 90.49% on FERPlus.