Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 13 de 13
Filter
Add more filters











Publication year range
1.
J Imaging ; 9(6)2023 Jun 19.
Article in English | MEDLINE | ID: mdl-37367470

ABSTRACT

The widespread use of deep learning techniques for creating realistic synthetic media, commonly known as deepfakes, poses a significant threat to individuals, organizations, and society. As the malicious use of these data could lead to unpleasant situations, it is becoming crucial to distinguish between authentic and fake media. Nonetheless, though deepfake generation systems can create convincing images and audio, they may struggle to maintain consistency across different data modalities, such as producing a realistic video sequence where both visual frames and speech are fake and consistent one with the other. Moreover, these systems may not accurately reproduce semantic and timely accurate aspects. All these elements can be exploited to perform a robust detection of fake content. In this paper, we propose a novel approach for detecting deepfake video sequences by leveraging data multimodality. Our method extracts audio-visual features from the input video over time and analyzes them using time-aware neural networks. We exploit both the video and audio modalities to leverage the inconsistencies between and within them, enhancing the final detection performance. The peculiarity of the proposed method is that we never train on multimodal deepfake data, but on disjoint monomodal datasets which contain visual-only or audio-only deepfakes. This frees us from leveraging multimodal datasets during training, which is desirable given their lack in the literature. Moreover, at test time, it allows to evaluate the robustness of our proposed detector on unseen multimodal deepfakes. We test different fusion techniques between data modalities and investigate which one leads to more robust predictions by the developed detectors. Our results indicate that a multimodal approach is more effective than a monomodal one, even if trained on disjoint monomodal datasets.

2.
Sci Rep ; 12(1): 18306, 2022 10 31.
Article in English | MEDLINE | ID: mdl-36316363

ABSTRACT

A great deal of the images found in scientific publications are retouched, reused, or composed to enhance the quality of the presentation. In most instances, these edits are benign and help the reader better understand the material in a paper. However, some edits are instances of scientific misconduct and undermine the integrity of the presented research. Determining the legitimacy of edits made to scientific images is an open problem that no current technology can perform satisfactorily in a fully automated fashion. It thus remains up to human experts to inspect images as part of the peer-review process. Nonetheless, image analysis technologies promise to become helpful to experts to perform such an essential yet arduous task. Therefore, we introduce SILA, a system that makes image analysis tools available to reviewers and editors in a principled way. Further, SILA is the first human-in-the-loop end-to-end system that starts by processing article PDF files, performs image manipulation detection on the automatically extracted figures, and ends with image provenance graphs expressing the relationships between the images in question, to explain potential problems. To assess its efficacy, we introduce a dataset of scientific papers from around the globe containing annotated image manipulations and inadvertent reuse, which can serve as a benchmark for the problem at hand. Qualitative and quantitative results of the system are described using this dataset.


Subject(s)
Image Processing, Computer-Assisted , Scientific Misconduct , Humans , Publications
3.
Sensors (Basel) ; 21(22)2021 Nov 19.
Article in English | MEDLINE | ID: mdl-34833788

ABSTRACT

Attention and awareness towards musculoskeletal disorders (MSDs) in the dental profession has increased considerably in the last few years. From recent literature reviews, it appears that the prevalence of MSDs in dentists concerns between 64 and 93%. In our clinical trial, we have assessed the dentist posture during the extraction of 90 third lower molars depending on whether the operator performs the intervention by the use of the operating microscope, surgical loupes, or with the naked eye. In particular, we analyzed the evolution of the body posture during different interventions evaluating the impact of visual aids with respect to naked eye interventions. The presented posture assessment approach is based on 3D acquisitions of the upper body, based on planar markers, which allows us to discriminate spatial displacements up to 2 mm in translation and 1 degree in rotation. We found a significant reduction of neck bending in interventions using visual aids, in particular for those performed with the microscope. We further investigated the impact of different postures on MSD risk using a widely adopted evaluation tool for ergonomic investigations of workplaces, named (RULA) Rapid Upper Limb Assessment. The analysis performed in this clinical trial is based on a 3D marker tracker that is able to follow a surgeon's upper limbs during interventions. The method highlighted pros and cons of different approaches.


Subject(s)
Musculoskeletal Diseases , Occupational Diseases , Audiovisual Aids , Dentistry , Ergonomics , Humans , Musculoskeletal Diseases/diagnosis , Posture
4.
J Imaging ; 7(8)2021 Aug 05.
Article in English | MEDLINE | ID: mdl-34460771

ABSTRACT

Identifying the source camera of images and videos has gained significant importance in multimedia forensics. It allows tracing back data to their creator, thus enabling to solve copyright infringement cases and expose the authors of hideous crimes. In this paper, we focus on the problem of camera model identification for video sequences, that is, given a video under analysis, detecting the camera model used for its acquisition. To this purpose, we develop two different CNN-based camera model identification methods, working in a novel multi-modal scenario. Differently from mono-modal methods, which use only the visual or audio information from the investigated video to tackle the identification task, the proposed multi-modal methods jointly exploit audio and visual information. We test our proposed methodologies on the well-known Vision dataset, which collects almost 2000 video sequences belonging to different devices. Experiments are performed, considering native videos directly acquired by their acquisition devices and videos uploaded on social media platforms, such as YouTube and WhatsApp. The achieved results show that the proposed multi-modal approaches significantly outperform their mono-modal counterparts, representing a valuable strategy for the tackled problem and opening future research to even more challenging scenarios.

5.
Sensors (Basel) ; 20(24)2020 Dec 14.
Article in English | MEDLINE | ID: mdl-33327431

ABSTRACT

Internet of Things (IoT) applications play a relevant role in today's industry in sharing diagnostic data with off-site service teams, as well as in enabling reliable predictive maintenance systems. Several interventions scenarios, however, require the physical presence of a human operator: Augmented Reality (AR), together with a broad-band connection, represents a major opportunity to integrate diagnostic data with real-time in-situ acquisitions. Diagnostic information can be shared with remote specialists that are able to monitor and guide maintenance operations from a control room as if they were in place. Furthermore, integrating heterogeneous sensors with AR visualization displays could largely improve operators' safety in complex and dangerous industrial plants. In this paper, we present a complete setup for a remote assistive maintenance intervention based on 5G networking and tested at a Vodafone Base Transceiver Station (BTS) within the Vodafone 5G Program. Technicians' safety was improved by means of a lightweight AR Head-Mounted Display (HDM) equipped with a thermal camera and a depth sensor to foresee possible collisions with hot surfaces and dangerous objects, by leveraging the processing power of remote computing paired with the low latency of 5G connection. Field testing confirmed that the proposed approach can be a viable solution for egocentric environment understanding and enables an immersive integration of the obtained augmented data within the real scene.

6.
Sensors (Basel) ; 20(21)2020 Oct 27.
Article in English | MEDLINE | ID: mdl-33120975

ABSTRACT

The problem of performing remote biomedical measurements using just a video stream of a subject face is called remote photoplethysmography (rPPG). The aim of this work is to propose a novel method able to perform rPPG using single-photon avalanche diode (SPAD) cameras. These are extremely accurate cameras able to detect even a single photon and are already used in many other applications. Moreover, a novel method that mixes deep learning and traditional signal analysis is proposed in order to extract and study the pulse signal. Experimental results show that this system achieves accurate results in the estimation of biomedical information such as heart rate, respiration rate, and tachogram. Lastly, thanks to the adoption of the deep learning segmentation method and dependability checks, this method could be adopted in non-ideal working conditions-for example, in the presence of partial facial occlusions.


Subject(s)
Biometry , Deep Learning , Photoplethysmography , Signal Processing, Computer-Assisted , Algorithms , Face , Heart Rate , Humans , Respiratory Rate
7.
IEEE Trans Image Process ; 25(5): 2298-310, 2016 May.
Article in English | MEDLINE | ID: mdl-26992023

ABSTRACT

Video content is routinely acquired and distributed in a digital compressed format. In many cases, the same video content is encoded multiple times. This is the typical scenario that arises when a video, originally encoded directly by the acquisition device, is then re-encoded, either after an editing operation, or when uploaded to a sharing website. The analysis of the bitstream reveals details of the last compression step (i.e., the codec adopted and the corresponding encoding parameters), while masking the previous compression history. Therefore, in this paper, we consider a processing chain of two coding steps, and we propose a method that exploits coding-based footprints to identify both the codec and the size of the group of pictures (GOPs) used in the first coding step. This sort of analysis is useful in video forensics, when the analyst is interested in determining the characteristics of the originating source device, and in video quality assessment, since quality is determined by the whole compression history. The proposed method relies on the fact that lossy coding is an (almost) idempotent operation. That is, re-encoding a video sequence with the same codec and coding parameters produces a sequence that is similar to the former. As a consequence, if the second codec in the chain does not significantly alter the sequence, it is possible to analyze this sort of similarity to identify the first codec and the adopted GOP size. The method was extensively validated on a very large data set of video sequences generated by encoding content with a diversity of codecs (MPEG-2, MPEG-4, H.264/AVC, and DIRAC) and different encoding parameters. In addition, a proof of concept showing that the proposed method can also be used on videos downloaded from YouTube is reported.

8.
IEEE Trans Vis Comput Graph ; 22(10): 2262-2274, 2016 10.
Article in English | MEDLINE | ID: mdl-26761820

ABSTRACT

We present a method for accelerating the computation of specular reflections in complex 3D enclosures, based on acoustic beam tracing. Our method constructs the beam tree on the fly through an iterative lookup process of a precomputed data structure that collects the information on the exact mutual visibility among all reflectors in the environment (region-to-region visibility). This information is encoded in the form of visibility regions that are conveniently represented in the space of acoustic rays using the Plücker coordinates. During the beam tracing phase, the visibility of the environment from the source position (the beam tree) is evaluated by traversing the precomputed visibility data structure and testing the presence of beams inside the visibility regions. The Plücker parameterization simplifies this procedure and reduces its computational burden, as it turns out to be an iterative intersection of linear subspaces. Similarly, during the path determination phase, acoustic paths are found by testing their presence within the nodes of the beam tree data structure. The simulations show that, with an average computation time per beam in the order of a dozen of microseconds, the proposed method can compute a large number of beams at rates suitable for interactive applications with moving sources and receivers.

9.
IEEE Trans Image Process ; 25(3): 1109-23, 2016 Mar.
Article in English | MEDLINE | ID: mdl-26685239

ABSTRACT

Transform coding is routinely used for lossy compression of discrete sources with memory. The input signal is divided into N-dimensional vectors, which are transformed by means of a linear mapping. Then, transform coefficients are quantized and entropy coded. In this paper, we consider the problem of identifying the transform matrix as well as the quantization step sizes. First, we study the case in which the only available information is a set of P transform decoded vectors. We formulate the problem in terms of finding the lattice with the largest determinant that contains all observed vectors. We propose an algorithm that is able to find the optimal solution and we formally study its convergence properties. Three potential realms of application are considered as example scenarios for the proposed theory: 1) parameter retrieval in the presence of a chain of two transform coders; 2) image tampering identification; and 3) parameter estimation for predictive coders. We show that, despite their differences, all three scenarios can be tackled by applying the same fundamental methodology. Experiments on both the synthetic data and the real images validate the proposed approach.

10.
IEEE Trans Image Process ; 24(11): 3546-60, 2015 Nov.
Article in English | MEDLINE | ID: mdl-26080384

ABSTRACT

Binary local features represent an effective alternative to real-valued descriptors, leading to comparable results for many visual analysis tasks while being characterized by significantly lower computational complexity and memory requirements. When dealing with large collections, a more compact representation based on global features is often preferred, which can be obtained from local features by means of, e.g., the bag-of-visual word model. Several applications, including, for example, visual sensor networks and mobile augmented reality, require visual features to be transmitted over a bandwidth-limited network, thus calling for coding techniques that aim at reducing the required bit budget while attaining a target level of efficiency. In this paper, we investigate a coding scheme tailored to both local and global binary features, which aims at exploiting both spatial and temporal redundancy by means of intra- and inter-frame coding. In this respect, the proposed coding scheme can conveniently be adopted to support the analyze-then-compress (ATC) paradigm. That is, visual features are extracted from the acquired content, encoded at remote nodes, and finally transmitted to a central controller that performs the visual analysis. This is in contrast with the traditional approach, in which visual content is acquired at a node, compressed and then sent to a central unit for further processing, according to the compress-then-analyze (CTA) paradigm. In this paper, we experimentally compare the ATC and the CTA by means of rate-efficiency curves in the context of two different visual analysis tasks: 1) homography estimation and 2) content-based retrieval. Our results show that the novel ATC paradigm based on the proposed coding primitives can be competitive with the CTA, especially in bandwidth limited scenarios.

11.
IEEE Trans Image Process ; 23(5): 2262-76, 2014 May.
Article in English | MEDLINE | ID: mdl-24818244

ABSTRACT

Visual features are successfully exploited in several applications (e.g., visual search, object recognition and tracking, etc.) due to their ability to efficiently represent image content. Several visual analysis tasks require features to be transmitted over a bandwidth-limited network, thus calling for coding techniques to reduce the required bit budget, while attaining a target level of efficiency. In this paper, we propose, for the first time, a coding architecture designed for local features (e.g., SIFT, SURF) extracted from video sequences. To achieve high coding efficiency, we exploit both spatial and temporal redundancy by means of intraframe and interframe coding modes. In addition, we propose a coding mode decision based on rate-distortion optimization. The proposed coding scheme can be conveniently adopted to implement the analyze-then-compress (ATC) paradigm in the context of visual sensor networks. That is, sets of visual features are extracted from video frames, encoded at remote nodes, and finally transmitted to a central controller that performs visual analysis. This is in contrast to the traditional compress-then-analyze (CTA) paradigm, in which video sequences acquired at a node are compressed and then sent to a central unit for further processing. In this paper, we compare these coding paradigms using metrics that are routinely adopted to evaluate the suitability of visual features in the context of content-based retrieval, object recognition, and tracking. Experimental results demonstrate that, thanks to the significant coding gains achieved by the proposed coding scheme, ATC outperforms CTA with respect to all evaluation metrics.

12.
IEEE Trans Image Process ; 18(11): 2491-504, 2009 Nov.
Article in English | MEDLINE | ID: mdl-19635704

ABSTRACT

In the last decade, the increased possibility to produce, edit, and disseminate multimedia contents has not been adequately balanced by similar advances in protecting these contents from unauthorized diffusion of forged copies. When the goal is to detect whether or not a digital content has been tampered with in order to alter its semantics, the use of multimedia hashes turns out to be an effective solution to offer proof of legitimacy and to possibly identify the introduced tampering. We propose an image hashing algorithm based on compressive sensing principles, which solves both the authentication and the tampering identification problems. The original content producer generates a hash using a small bit budget by quantizing a limited number of random projections of the authentic image. The content user receives the (possibly altered) image and uses the hash to estimate the mean square error distortion between the original and the received image. In addition, if the introduced tampering is sparse in some orthonormal basis or redundant dictionary, an approximation is given in the pixel domain. We emphasize that the hash is universal, e.g., the same hash signature can be used to detect and identify different types of tampering. At the cost of additional complexity at the decoder, the proposed algorithm is robust to moderate content-preserving transformations including cropping, scaling, and rotation. In addition, in order to keep the size of the hash small, hash encoding/decoding takes advantage of distributed source codes.

13.
IEEE Trans Image Process ; 17(7): 1129-43, 2008 Jul.
Article in English | MEDLINE | ID: mdl-18586621

ABSTRACT

Consider the problem of transmitting multiple video streams to fulfill a constant bandwidth constraint. The available bit budget needs to be distributed across the sequences in order to meet some optimality criteria. For example, one might want to minimize the average distortion or, alternatively, minimize the distortion variance, in order to keep almost constant quality among the encoded sequences. By working in the rho-domain, we propose a low-delay rate allocation scheme that, at each time instant, provides a closed form solution for either the aforementioned problems. We show that minimizing the distortion variance instead of the average distortion leads, for each of the multiplexed sequences, to a coding penalty less than 0.5 dB, in terms of average PSNR. In addition, our analysis provides an explicit relationship between model parameters and this loss. In order to smooth the distortion also along time, we accommodate a shared encoder buffer to compensate for rate fluctuations. Although the proposed scheme is general, and it can be adopted for any video and image coding standard, we provide experimental evidence by transcoding bitstreams encoded using the state-of-the-art H.264/AVC standard. The results of our simulations reveal that is it possible to achieve distortion smoothing both in time and across the sequences, without sacrificing coding efficiency.


Subject(s)
Algorithms , Data Compression/methods , Image Enhancement/methods , Image Interpretation, Computer-Assisted/methods , Signal Processing, Computer-Assisted , Video Recording/methods , Quality Control , Reproducibility of Results , Sensitivity and Specificity
SELECTION OF CITATIONS
SEARCH DETAIL