RESUMEN
Latest advances of deep learning paradigm and 3D imaging systems have raised the necessity for more complete datasets that allow exploitation of facial features such as pose, gender or age. In our work, we propose a new facial dataset collected with an innovative RGBâ»D multi-camera setup whose optimization is presented and validated. 3DWF includes 3D raw and registered data collection for 92 persons from low-cost RGBâ»D sensing devices to commercial scanners with great accuracy. 3DWF provides a complete dataset with relevant and accurate visual information for different tasks related to facial properties such as face tracking or 3D face reconstruction by means of annotated density normalized 2K clouds and RGBâ»D streams. In addition, we validate the reliability of our proposal by an original data augmentation method from a massive set of face meshes for facial landmark detection in 2D domain, and by head pose classification through common Machine Learning techniques directed towards proving alignment of collected data.
RESUMEN
[This corrects the article DOI: 10.3389/fpsyt.2022.882957.].
RESUMEN
Background: Interventions aimed at easing negative moral (social) emotions and restoring social bonds - such as amend-making and forgiving-have a prominent role in the treatment of moral injury. As real-life contact between persons involved in prior morally injurious situations is not always possible or desirable, virtual reality may offer opportunities for such interventions in a safe and focused way. Objective: To explore the effects of the use of deepfake technology in the treatment of patients suffering from PTSD and moral injury as a result of being forced by persons in authority to undergo and commit sexual violence (so-called betrayal trauma). Methods: Two women who had experienced sexual violence underwent one session of confrontation with the perpetrator using deepfake technology. The women could talk via ZOOM with the perpetrator, whose picture was converted in moving images using deepfake technology. A therapist answered the questions of the women in the role of the perpetrator. Outcome measures were positive and negative emotions, dominance in relation to perpetrator, self-blame, self-forgiveness, and PTSD-symptom severity. Results: Both participants were positive about the intervention. Although they knew it was fake, the deepfaked perpetrator seemed very real to them. They both reported more positive and less negative emotions, dominance in relation to the perpetrator and self-forgiveness, and less self-blame and PTSD-symptoms after the intervention. Conclusion: Victim-perpetrator confrontation using deepfake technology is a promising intervention to influence moral injury-related symptoms in victims of sexual violence. Deepfake technology may also show promise in simulating other interactions between persons involved in morally injurious events.
RESUMEN
Light source position (LSP) estimation is a difficult yet an important problem in computer vision. A common approach for estimating the LSP assumes Lambert's law. However, in real-world scenes, Lambert's law does not hold for all different types of surfaces. Instead of assuming all that surfaces follow Lambert's law, our approach classifies image surface segments based on their photometric and geometric surface attributes (i.e. glossy, matte, curved, and so on) and assigns weights to image surface segments based on their suitability for LSP estimation. In addition, we propose the use of the estimated camera pose to globally constrain LSP for RGB-D video sequences. Experiments on Boom and a newly collected RGB-D video data sets show that the state-of-the-art methods are outperformed by the proposed method. The results demonstrate that weighting image surface segments based on their attributes outperform the state-of-the-art methods in which the image surface segments are considered to equally contribute. In particular, by using the proposed surface weighting, the angular error for LSP estimation is reduced from 12.6° to 8.2° and 24.6° to 4.8° for Boom and RGB-D video data sets, respectively. Moreover, using the camera pose to globally constrain LSP provides higher accuracy (4.8°) compared with using single frames (8.5°).
RESUMEN
This paper focuses on fine-grained object classification using recognized scene text in natural images. While the state-of-the-art relies on visual cues only, this paper is the first work which proposes to combine textual and visual cues. Another novelty is the textual cue extraction. Unlike the state-of-the-art text detection methods, we focus more on the background instead of text regions. Once text regions are detected, they are further processed by two methods to perform text recognition, i.e., ABBYY commercial OCR engine and a state-of-the-art character recognition algorithm. Then, to perform textual cue encoding, bi- and trigrams are formed between the recognized characters by considering the proposed spatial pairwise constraints. Finally, extracted visual and textual cues are combined for fine-grained classification. The proposed method is validated on four publicly available data sets: ICDAR03, ICDAR13, Con-Text, and Flickr-logo. We improve the state-of-the-art end-to-end character recognition by a large margin of 15% on ICDAR03. We show that textual cues are useful in addition to visual cues for fine-grained classification. We show that textual cues are also useful for logo retrieval. Adding textual cues outperforms visual- and textual-only in fine-grained classification (70.7% to 60.3%) and logo retrieval (57.4% to 54.8%).
RESUMEN
Object detection is an important research area in the field of computer vision. Many detection algorithms have been proposed. However, each object detector relies on specific assumptions of the object appearance and imaging conditions. As a consequence, no algorithm can be considered universal. With the large variety of object detectors, the subsequent question is how to select and combine them. In this paper, we propose a framework to learn how to combine object detectors. The proposed method uses (single) detectors like Deformable Part Models, Color Names and Ensemble of Exemplar-SVMs, and exploits their correlation by high-level contextual features to yield a combined detection list. Experiments on the PASCAL VOC07 and VOC10 data sets show that the proposed method significantly outperforms single object detectors, DPM (8.4%), CN (6.8%) and EES (17.0%) on VOC07 and DPM (6.5%), CN (5.5%) and EES (16.2%) on VOC10. We show with an experiment that there are no constraints on the type of the detector. The proposed method outperforms (2.4%) the state-of-the-art object detector (RCNN) on VOC07 when Regions with Convolutional Neural Network is combined with other detectors used in this paper.