Your browser doesn't support javascript.
loading
: 20 | 50 | 100
1 - 20 de 57
1.
Article En | MEDLINE | ID: mdl-38814778

Semi-supervised learning (SSL) suffers from severe performance degradation when labeled and unlabeled data come from inconsistent and imbalanced distribution. Nonetheless, there is a lack of theoretical guidance regarding a remedy for this issue. To bridge the gap between theoretical insights and practical solutions, we embark to an analysis of generalization bound of classic SSL algorithms. This analysis reveals that distribution inconsistency between unlabeled and labeled data can cause a significant generalization error bound. Motivated by this theoretical insight, we present a Triplet Adaptation Framework (TAF) to reduce the distribution divergence and improve the generalization of SSL models. TAF comprises three adapters: Balanced Residual Adapter, aiming to map the class distribution of labeled and unlabeled data to a uniform distribution for reducing class distribution divergence; Representation Adapter, aiming to map the representation distribution of unlabeled data to labeled one for reducing representation distribution divergence; and Pseudo-Label Adapter, aiming to align the predicted pseudo-labels with the class distribution of unlabeled data, thereby preventing erroneous pseudo-labels from exacerbating representation divergence. These three adapters collaborate synergistically to reduce the generalization bound, ultimately achieving a more robust and generalizable SSL model. Extensive experiments across various robust SSL scenarios validate the efficacy of our method.

2.
BMC Pediatr ; 24(1): 361, 2024 May 24.
Article En | MEDLINE | ID: mdl-38783283

BACKGROUND: Noonan syndrome (NS) is a rare genetic disease, and patients who suffer from it exhibit a facial morphology that is characterized by a high forehead, hypertelorism, ptosis, inner epicanthal folds, down-slanting palpebral fissures, a highly arched palate, a round nasal tip, and posteriorly rotated ears. Facial analysis technology has recently been applied to identify many genetic syndromes (GSs). However, few studies have investigated the identification of NS based on the facial features of the subjects. OBJECTIVES: This study develops advanced models to enhance the accuracy of diagnosis of NS. METHODS: A total of 1,892 people were enrolled in this study, including 233 patients with NS, 863 patients with other GSs, and 796 healthy children. We took one to 10 frontal photos of each subject to build a dataset, and then applied the multi-task convolutional neural network (MTCNN) for data pre-processing to generate standardized outputs with five crucial facial landmarks. The ImageNet dataset was used to pre-train the network so that it could capture generalizable features and minimize data wastage. We subsequently constructed seven models for facial identification based on the VGG16, VGG19, VGG16-BN, VGG19-BN, ResNet50, MobileNet-V2, and squeeze-and-excitation network (SENet) architectures. The identification performance of seven models was evaluated and compared with that of six physicians. RESULTS: All models exhibited a high accuracy, precision, and specificity in recognizing NS patients. The VGG19-BN model delivered the best overall performance, with an accuracy of 93.76%, precision of 91.40%, specificity of 98.73%, and F1 score of 78.34%. The VGG16-BN model achieved the highest AUC value of 0.9787, while all models based on VGG architectures were superior to the others on the whole. The highest scores of six physicians in terms of accuracy, precision, specificity, and the F1 score were 74.00%, 75.00%, 88.33%, and 61.76%, respectively. The performance of each model of facial recognition was superior to that of the best physician on all metrics. CONCLUSION: Models of computer-assisted facial recognition can improve the rate of diagnosis of NS. The models based on VGG19-BN and VGG16-BN can play an important role in diagnosing NS in clinical practice.


Noonan Syndrome , Humans , Noonan Syndrome/diagnosis , Child , Female , Male , Child, Preschool , Neural Networks, Computer , Infant , Adolescent , Automated Facial Recognition/methods , Diagnosis, Computer-Assisted/methods , Sensitivity and Specificity , Case-Control Studies
3.
Cortex ; 174: 241-255, 2024 05.
Article En | MEDLINE | ID: mdl-38582629

Shape is a property that could be perceived by vision and touch, and is classically considered to be supramodal. While there is mounting evidence for the shared cognitive and neural representation space between visual and tactile shape, previous research tended to rely on dissimilarity structures between objects and had not examined the detailed properties of shape representation in the absence of vision. To address this gap, we conducted three explicit object shape knowledge production experiments with congenitally blind and sighted participants, who were asked to produce verbal features, 3D clay models, and 2D drawings of familiar objects with varying levels of tactile exposure, including tools, large nonmanipulable objects, and animals. We found that the absence of visual experience (i.e., in the blind group) led to stronger differences in animals than in tools and large objects, suggesting that direct tactile experience of objects is essential for shape representation when vision is unavailable. For tools with rich tactile/manipulation experiences, the blind produced overall good shapes comparable to the sighted, yet also showed intriguing differences. The blind group had more variations and a systematic bias in the geometric property of tools (making them stubbier than the sighted), indicating that visual experience contributes to aligning internal representations and calibrating overall object configurations, at least for tools. Taken together, the object shape representation reflects the intricate orchestration of vision, touch and language.


Blindness , Touch Perception , Humans , Blindness/psychology , Vision, Ocular , Touch
4.
IEEE Trans Image Process ; 33: 1588-1599, 2024.
Article En | MEDLINE | ID: mdl-38358875

Attributed to the development of deep networks and abundant data, automatic face recognition (FR) has quickly reached human-level capacity in the past few years. However, the FR problem is not perfectly solved in case of large poses and uncontrolled occlusions. In this paper, we propose a novel bypass enhanced representation learning (BERL) method to improve face recognition under unconstrained scenarios. The proposed method integrates self-supervised learning and supervised learning together by attaching two auxiliary bypasses, a 3D reconstruction bypass and a blind inpainting bypass, to assist robust feature learning for face recognition. Among them, the 3D reconstruction bypass enforces the face recognition network to encode pose independent 3D facial information, which enhances the robustness to various poses. The blind inpainting bypass enforces the face recognition network to capture more facial context information for face inpainting, which enhances the robustness to occlusions. The whole framework is trained in end-to-end manner with two self-supervised tasks above and the classic supervised face identification task. During inference, the two auxiliary bypasses can be detached from the face recognition network, avoiding any additional computational overhead. Extensive experimental results on various face recognition benchmarks show that, without any cost of extra annotations and computations, our method outperforms state-of-the-art methods. Moreover, the learnt representations can also well generalize to other face-related downstream tasks such as the facial attribute recognition with limited labeled data.


Biometric Identification , Facial Recognition , Humans , Biometric Identification/methods , Face/diagnostic imaging , Face/anatomy & histology , Databases, Factual , Benchmarking
5.
Article En | MEDLINE | ID: mdl-38376968

In recent years, the security of deep learning models achieves more and more attentions with the rapid development of neural networks, which are vulnerable to adversarial examples. Almost all existing gradient-based attack methods use the sign function in the generation to meet the requirement of perturbation budget on L∞ norm. However, we find that the sign function may be improper for generating adversarial examples since it modifies the exact gradient direction. Instead of using the sign function, we propose to directly utilize the exact gradient direction with a scaling factor for generating adversarial perturbations, which improves the attack success rates of adversarial examples even with fewer perturbations. At the same time, we also theoretically prove that this method can achieve better black-box transferability. Moreover, considering that the best scaling factor varies across different images, we propose an adaptive scaling factor generator to seek an appropriate scaling factor for each image, which avoids the computational cost for manually searching the scaling factor. Our method can be integrated with almost all existing gradient-based attack methods to further improve their attack success rates. Extensive experiments on the CIFAR10 and ImageNet datasets show that our method exhibits higher transferability and outperforms the state-of-the-art methods.

6.
IEEE Trans Image Process ; 32: 4921-4934, 2023.
Article En | MEDLINE | ID: mdl-37603487

Scribble-supervised semantic segmentation is an appealing weakly supervised technique with low labeling cost. Existing approaches mainly consider diffusing the labeled region of scribble by low-level feature similarity to narrow the supervision gap between scribble labels and mask labels. In this study, we observe an annotation bias between scribble and object mask, i.e., label workers tend to scribble on the spacious region instead of corners. This label preference makes the model learn well on those frequently labeled regions but poor on rarely labeled pixels. Therefore, we propose BLPSeg to balance the label preference for complete segmentation. Specifically, the BLPSeg first predicts an annotation probability map to evaluate the rarity of labels on each image, then utilizes a novel BLP loss to balance the model training by up-weighting those rare annotations. Additionally, to further alleviate the impact of label preference, we design a local aggregation module (LAM) to propagate supervision from labeled to unlabeled regions in gradient backpropagation. We conduct extensive experiments to illustrate the effectiveness of our BLPSeg. Our single-stage method even outperforms other advanced multi-stage methods and achieves state-of-the-art performance.

7.
IEEE Trans Image Process ; 32: 3212-3225, 2023.
Article En | MEDLINE | ID: mdl-37256805

Facial action unit (AU) detection, aiming to classify AU present in the facial image, has long suffered from insufficient AU annotations. In this paper, we aim to mitigate this data scarcity issue by learning AU representations from a large number of unlabelled facial videos in a contrastive learning paradigm. We formulate the self-supervised AU representation learning signals in two-fold: 1) AU representation should be frame-wisely discriminative within a short video clip; 2) Facial frames sampled from different identities but show analogous facial AUs should have consistent AU representations. As to achieve these goals, we propose to contrastively learn the AU representation within a video clip and devise a cross-identity reconstruction mechanism to learn the person-independent representations. Specially, we adopt a margin-based temporal contrastive learning paradigm to perceive the temporal AU coherence and evolution characteristics within a clip that consists of consecutive input facial frames. Moreover, the cross-identity reconstruction mechanism facilitates pushing the faces from different identities but show analogous AUs close in the latent embedding space. Experimental results on three public AU datasets demonstrate that the learned AU representation is discriminative for AU detection. Our method outperforms other contrastive learning methods and significantly closes the performance gap between the self-supervised and supervised AU detection approaches.

8.
IEEE Trans Pattern Anal Mach Intell ; 45(10): 11733-11752, 2023 Oct.
Article En | MEDLINE | ID: mdl-37171920

Learning generalizable representation and classifier for class-imbalanced data is challenging for data-driven deep models. Most studies attempt to re-balance the data distribution, which is prone to overfitting on tail classes and underfitting on head classes. In this work, we propose Dual Compensation Residual Networks to better fit both tail and head classes. First, we propose dual Feature Compensation Module (FCM) and Logit Compensation Module (LCM) to alleviate the overfitting issue. The design of these two modules is based on the observation: an important factor causing overfitting is that there is severe feature drift between training and test data on tail classes. In details, the test features of a tail category tend to drift towards feature cloud of multiple similar head categories. So FCM estimates a multi-mode feature drift direction for each tail category and compensate for it. Furthermore, LCM translates the deterministic feature drift vector estimated by FCM along intra-class variations, so as to cover a larger effective compensation space, thereby better fitting the test features. Second, we propose a Residual Balanced Multi-Proxies Classifier (RBMC) to alleviate the under-fitting issue. Motivated by the observation that re-balancing strategy hinders the classifier from learning sufficient head knowledge and eventually causes underfitting, RBMC utilizes uniform learning with a residual path to facilitate classifier learning. Comprehensive experiments on Long-tailed and Class-Incremental benchmarks validate the efficacy of our method.

9.
IEEE Trans Pattern Anal Mach Intell ; 45(5): 5561-5578, 2023 May.
Article En | MEDLINE | ID: mdl-36173773

Alternatively inferring on the visual facts and commonsense is fundamental for an advanced visual question answering (VQA) system. This ability requires models to go beyond the literal understanding of commonsense. The system should not just treat objects as the entrance to query background knowledge, but fully ground commonsense to the visual world and imagine the possible relationships between objects, e.g., "fork, can lift, food". To comprehensively evaluate such abilities, we propose a VQA benchmark, Compositional Reasoning on vIsion and Commonsense(CRIC), which introduces new types of questions about CRIC, and an evaluation metric integrating the correctness of answering and commonsense grounding. To collect such questions and rich additional annotations to support the metric, we also propose an automatic algorithm to generate question samples from the scene graph associated with the images and the relevant knowledge graph. We further analyze several representative types of VQA models on the CRIC dataset. Experimental results show that grounding the commonsense to the image region and joint reasoning on vision and commonsense are still challenging for current approaches. The dataset is available at https://cricvqa.github.io.

10.
Article En | MEDLINE | ID: mdl-35895650

Data in the visual world often present long-tailed distributions. However, learning high-quality representations and classifiers for imbalanced data is still challenging for data-driven deep learning models. In this work, we aim at improving the feature extractor and classifier for long-tailed recognition via contrastive pretraining and feature normalization, respectively. First, we carefully study the influence of contrastive pretraining under different conditions, showing that current self-supervised pretraining for long-tailed learning is still suboptimal in both performance and speed. We thus propose a new balanced contrastive loss and a fast contrastive initialization scheme to improve previous long-tailed pretraining. Second, based on the motivative analysis on the normalization for classifier, we propose a novel generalized normalization classifier that consists of generalized normalization and grouped learnable scaling. It outperforms traditional inner product classifier as well as cosine classifier. Both the two components proposed can improve recognition ability on tail classes without the expense of head classes. We finally build a unified framework that achieves competitive performance compared with state of the arts on several long-tailed recognition benchmarks and maintains high efficiency.

11.
IEEE Trans Image Process ; 31: 3961-3972, 2022.
Article En | MEDLINE | ID: mdl-35648877

Cartoon face recognition is challenging as they typically have smooth color regions and emphasized edges, the key to recognizing cartoon faces is to precisely perceive their sparse and critical shape patterns. However, it is quite difficult to learn a shape-oriented representation for cartoon face recognition with convolutional neural networks (CNNs). To mitigate this issue, we propose the GraphJigsaw that constructs jigsaw puzzles at various stages in the classification network and solves the puzzles with the graph convolutional network (GCN) in a progressive manner. Solving the puzzles requires the model to spot the shape patterns of the cartoon faces as the texture information is quite limited. The key idea of GraphJigsaw is constructing a jigsaw puzzle by randomly shuffling the intermediate convolutional feature maps in the spatial dimension and exploiting the GCN to reason and recover the correct layout of the jigsaw fragments in a self-supervised manner. The proposed GraphJigsaw avoids training the classification model with the deconstructed images that would introduce noisy patterns and are harmful for the final classification. Specially, GraphJigsaw can be incorporated at various stages in a top-down manner within the classification model, which facilitates propagating the learned shape patterns gradually. GraphJigsaw does not rely on any extra manual annotation during the training process and incorporates no extra computation burden at inference time. Both quantitative and qualitative experimental results have verified the feasibility of our proposed GraphJigsaw, which consistently outperforms other face recognition or jigsaw-based methods on two popular cartoon face datasets with considerable improvements.


Facial Recognition , Neural Networks, Computer
12.
IEEE Trans Image Process ; 31: 3908-3919, 2022.
Article En | MEDLINE | ID: mdl-35622788

Most video-based person re-identification (re-id) methods only focus on appearance features but neglect motion features. In fact, motion features can help to distinguish the target persons that are hard to be identified only by appearance features. However, most existing temporal information modeling methods cannot extract motion features effectively or efficiently for v ideo-based re-id. In this paper, we propose a more efficient Motion Feature Aggregation (MFA) method to model and aggregate motion information in the feature map level for video-based re-id. The proposed MFA consists of (i) a coarse-grained motion learning module, which extracts coarse-grained motion features based on the position changes of body parts over time, and (ii) a fine-grained motion learning module, which extracts fine-grained motion features based on the appearance changes of body parts over time. These two modules can model motion information from different granularities and are complementary to each other. It is easy to combine the proposed method with existing network architectures for end-to-end training. Extensive experiments on four widely used datasets demonstrate that the motion features extracted by MFA are crucial complements to appearance features for video-based re-id, especially for the scenario with large appearance changes. Besides, the results on LS-VID, the current largest publicly available video-based re-id dataset, surpass the state-of-the-art methods by a large margin. The code is available at: https://github.com/guxinqian/Simple-ReID.


Neural Networks, Computer , Pattern Recognition, Automated , Humans , Motion , Pattern Recognition, Automated/methods , Video Recording/methods
13.
IEEE Trans Image Process ; 31: 3684-3696, 2022.
Article En | MEDLINE | ID: mdl-35580106

In object detection, enhancing feature representation using localization information has been revealed as a crucial procedure to improve detection performance. However, the localization information (i.e., regression feature and regression offset) captured by the regression branch is still not well utilized. In this paper, we propose a simple but effective method called Interactive Regression and Classification (IRC) to better utilize localization information. Specifically, we propose Feature Aggregation Module (FAM) and Localization Attention Module (LAM) to leverage localization information to the classification branch during forward propagation. Furthermore, the classifier also guides the learning of the regression branch during backward propagation, to guarantee that the localization information is beneficial to both regression and classification. Thus, the regression and classification branches are learned in an interactive manner. Our method can be easily integrated into anchor-based and anchor-free object detectors without increasing computation cost. With our method, the performance is significantly improved on many popular dense object detectors, including RetinaNet, FCOS, ATSS, PAA, GFL, GFLV2, OTA, GA-RetinaNet, RepPoints, BorderDet and VFNet. Based on ResNet-101 backbone, IRC achieves 47.2% AP on COCO test-dev, surpassing the previous state-of-the-art PAA (44.8% AP), GFL (45.0% AP) and without sacrificing the efficiency both in training and inference. Moreover, our best model (Res2Net-101-DCN) can achieve a single-model single-scale AP of 51.4%.

14.
Article En | MEDLINE | ID: mdl-37015478

Cross-modality face image synthesis such as sketch-to-photo, NIR-to-RGB, and RGB-to-depth has wide applications in face recognition, face animation, and digital entertainment. Conventional cross-modality synthesis methods usually require paired training data, i.e., each subject has images of both modalities. However, paired data can be difficult to acquire, while unpaired data commonly exist. In this paper, we propose a novel semi-supervised cross-modality synthesis method (namely CMOS-GAN), which can leverage both paired and unpaired face images to learn a robust cross-modality synthesis model. Specifically, CMOS-GAN uses a generator of encoder-decoder architecture for new modality synthesis. We leverage pixel-wise loss, adversarial loss, classification loss, and face feature loss to exploit the information from both paired multi-modality face images and unpaired face images for model learning. In addition, since we expect the synthetic new modality can also be helpful for improving face recognition accuracy, we further use a modified triplet loss to retain the discriminative features of the subject in the synthetic modality. Experiments on three cross-modality face synthesis tasks (NIR-to-VIS, RGB-to-depth, and sketch-to-photo) show the effectiveness of the proposed approach compared with the state-of-the-art. In addition, we also collect a large-scale RGB-D dataset (VIPL-MumoFace-3K) for the RGB-to-depth synthesis task. We plan to open-source our code and VIPL-MumoFace-3K dataset to the community.

15.
IEEE Trans Pattern Anal Mach Intell ; 44(1): 302-317, 2022 01.
Article En | MEDLINE | ID: mdl-32750828

Facial actions are usually encoded as anatomy-based action units (AUs), the labelling of which demands expertise and thus is time-consuming and expensive. To alleviate the labelling demand, we propose to leverage the large number of unlabelled videos by proposing a twin-cycle autoencoder (TAE) to learn discriminative representations for facial actions. TAE is inspired by the fact that facial actions are embedded in the pixel-wise displacements between two sequential face images (hereinafter, source and target) in the video. Therefore, learning the representations of facial actions can be achieved by learning the representations of the displacements. However, the displacements induced by facial actions are entangled with those induced by head motions. TAE is thus trained to disentangle the two kinds of movements by evaluating the quality of the synthesized images when either the facial actions or head pose is changed, aiming to reconstruct the target image. Experiments on AU detection show that TAE can achieve accuracy comparable to other existing AU detection methods including some supervised methods, thus validating the discriminant capacity of the representations learned by TAE. TAE's ability in decoupling the action-induced and pose-induced movements is also validated by visualizing the generated images and analyzing the facial image retrieval results qualitatively and quantitatively.


Algorithms , Face , Movement
16.
IEEE Trans Pattern Anal Mach Intell ; 44(9): 4894-4912, 2022 09.
Article En | MEDLINE | ID: mdl-33983879

Person re-identification (reID) plays an important role in computer vision. However, existing methods suffer from performance degradation in occluded scenes. In this work, we propose an occlusion-robust block, Region Feature Completion (RFC), for occluded reID. Different from most previous works that discard the occluded regions, RFC block can recover the semantics of occluded regions in feature space. First, a Spatial RFC (SRFC) module is developed. SRFC exploits the long-range spatial contexts from non-occluded regions to predict the features of occluded regions. The unit-wise prediction task leads to an encoder/decoder architecture, where the region-encoder models the correlation between non-occluded and occluded region, and the region-decoder utilizes the spatial correlation to recover occluded region features. Second, we introduce Temporal RFC (TRFC) module which captures the long-term temporal contexts to refine the prediction of SRFC. RFC block is lightweight, end-to-end trainable and can be easily plugged into existing CNNs to form RFCnet. Extensive experiments are conducted on occluded and commonly holistic reID benchmarks. Our method significantly outperforms existing methods on the occlusion datasets, while remains top even superior performance on holistic datasets. The source code is available at https://github.com/blue-blue272/OccludedReID-RFCnet.


Algorithms , Software , Humans
17.
IEEE Trans Cybern ; 52(10): 11014-11026, 2022 Oct.
Article En | MEDLINE | ID: mdl-34473639

In this article, we propose a novel method to simultaneously solve the data problem of dirty quality and poor quantity for person reidentification (ReID). Dirty quality refers to the wrong labels in image annotations. Poor quantity means that some identities have very few images (FewIDs). Training with these mislabeled data or FewIDs with triplet loss will lead to low generalization performance. To solve the label error problem, we propose a weighted label correction based on cross-entropy (wLCCE) strategy. Specifically, according to the influence range of the wrong labels, we first classify the mislabeled images into point label error and set label error. Then, we propose a weighted triplet loss (WTL) to correct the two label errors, respectively. To alleviate the poor quantity issue, we propose a feature simulation based on autoencoder (FSAE) method to generate some virtual samples for FewID. For the authenticity of the simulated features, we transfer the difference pattern of identities with multiple images (MultIDs) to FewIDs by training an autoencoder (AE)-based simulator. In this way, the FewIDs obtain richer expressions to distinguish from other identities. By dealing with a dirty and poor data problem, we can learn more robust ReID models using the triplet loss. We conduct extensive experiments on two public person ReID datasets: 1) Market-1501 and 2) DukeMTMC-reID, to verify the effectiveness of our approach.

18.
IEEE Trans Image Process ; 31: 788-798, 2022.
Article En | MEDLINE | ID: mdl-34890329

Face recognition remains a challenging task in unconstrained scenarios, especially when faces are partially occluded. To improve the robustness against occlusion, augmenting the training images with artificial occlusions has been proved as a useful approach. However, these artificial occlusions are commonly generated by adding a black rectangle or several object templates including sunglasses, scarfs and phones, which cannot well simulate the realistic occlusions. In this paper, based on the argument that the occlusion essentially damages a group of neurons, we propose a novel and elegant occlusion-simulation method via dropping the activations of a group of neurons in some elaborately selected channel. Specifically, we first employ a spatial regularization to encourage each feature channel to respond to local and different face regions. Then, the locality-aware channel-wise dropout (LCD) is designed to simulate occlusions by dropping out a few feature channels. The proposed LCD can encourage its succeeding layers to minimize the intra-class feature variance caused by occlusions, thus leading to improved robustness against occlusion. In addition, we design an auxiliary spatial attention module by learning a channel-wise attention vector to reweight the feature channels, which improves the contributions of non-occluded regions. Extensive experiments on various benchmarks show that the proposed method outperforms state-of-the-art methods with a remarkable improvement.


Facial Recognition , Face/diagnostic imaging
20.
IEEE Trans Image Process ; 30: 7649-7662, 2021.
Article En | MEDLINE | ID: mdl-34469295

Location is an important distinguishing information for instance segmentation. In this paper, we propose a novel model, called Location Sensitive Network (LSNet), for human instance segmentation. LSNet integrates instance-specific location information into one-stage segmentation framework. Specifically, in the segmentation branch, Pose Attention Module (PAM) encodes the location information into the attention regions through coordinates encoding. Based on the location information provided by PAM, the segmentation branch is able to effectively distinguish instances in feature-level. Moreover, we propose a combination operation named Keypoints Sensitive Combination (KSCom) to utilize the location information from multiple sampling points. These sampling points construct the points representation for instances via human keypoints and random points. Human keypoints provide the spatial locations and semantic information of the instances, and random points expand the receptive fields. Based on the points representation for each instance, KSCom effectively reduces the mis-classified pixels. Our method is validated by the experiments on public datasets. LSNet-5 achieves 56.2 mAP at 18.5 FPS on COCOPersons. Besides, the proposed method is significantly superior to its peers in the case of severe occlusion.


Algorithms , Image Processing, Computer-Assisted , Attention , Humans , Semantics
...