Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 44
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Int J Comput Vis ; 126(2): 333-357, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-31983807

RESUMO

Human behavior and affect is inherently a dynamic phenomenon involving temporal evolution of patterns manifested through a multiplicity of non-verbal behavioral cues including facial expressions, body postures and gestures, and vocal outbursts. A natural assumption for human behavior modeling is that a continuous-time characterization of behavior is the output of a linear time-invariant system when behavioral cues act as the input (e.g., continuous rather than discrete annotations of dimensional affect). Here we study the learning of such dynamical system under real-world conditions, namely in the presence of noisy behavioral cues descriptors and possibly unreliable annotations by employing structured rank minimization. To this end, a novel structured rank minimization method and its scalable variant are proposed. The generalizability of the proposed framework is demonstrated by conducting experiments on 3 distinct dynamic behavior analysis tasks, namely (i) conflict intensity prediction, (ii) prediction of valence and arousal, and (iii) tracklet matching. The attained results outperform those achieved by other state-of-the-art methods for these tasks and, hence, evidence the robustness and effectiveness of the proposed approach.

2.
Int J Comput Vis ; 122(1): 17-33, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-32269419

RESUMO

Fitting algorithms for Active Appearance Models (AAMs) are usually considered to be robust but slow or fast but less able to generalize well to unseen variations. In this paper, we look into AAM fitting algorithms and make the following orthogonal contributions: We present a simple "project-out" optimization framework that unifies and revises the most well-known optimization problems and solutions in AAMs. Based on this framework, we describe robust simultaneous AAM fitting algorithms the complexity of which is not prohibitive for current systems. We then go on one step further and propose a new approximate project-out AAM fitting algorithm which we coin Extended Project-Out Inverse Compositional (E-POIC). In contrast to current algorithms, E-POIC is both efficient and robust. Next, we describe a part-based AAM employing a translational motion model, which results in superior fitting and convergence properties. We also show that the proposed AAMs, when trained "in-the-wild" using SIFT descriptors, perform surprisingly well even for the case of unseen unconstrained images. Via a number of experiments on unconstrained human and animal face databases, we show that our combined contributions largely bridge the gap between exact and current approximate methods for AAM fitting and perform comparably with state-of-the-art face alignment systems.

3.
Int J Comput Vis ; 122(2): 270-291, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-32226226

RESUMO

The unconstrained acquisition of facial data in real-world conditions may result in face images with significant pose variations, illumination changes, and occlusions, affecting the performance of facial landmark localization and recognition methods. In this paper, a novel method, robust to pose, illumination variations, and occlusions is proposed for joint face frontalization and landmark localization. Unlike the state-of-the-art methods for landmark localization and pose correction, where large amount of manually annotated images or 3D facial models are required, the proposed method relies on a small set of frontal images only. By observing that the frontal facial image of both humans and animals, is the one having the minimum rank of all different poses, a model which is able to jointly recover the frontalized version of the face as well as the facial landmarks is devised. To this end, a suitable optimization problem is solved, concerning minimization of the nuclear norm (convex surrogate of the rank function) and the matrix ℓ 1 norm accounting for occlusions. The proposed method is assessed in frontal view reconstruction of human and animal faces, landmark localization, pose-invariant face recognition, face verification in unconstrained conditions, and video inpainting by conducting experiment on 9 databases. The experimental results demonstrate the effectiveness of the proposed method in comparison to the state-of-the-art methods for the target problems.

4.
IEEE Trans Cybern ; 53(6): 3454-3466, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-35439155

RESUMO

Video-to-speech is the process of reconstructing the audio speech from a video of a spoken utterance. Previous approaches to this task have relied on a two-step process where an intermediate representation is inferred from the video and is then decoded into waveform audio using a vocoder or a waveform reconstruction algorithm. In this work, we propose a new end-to-end video-to-speech model based on generative adversarial networks (GANs) which translates spoken video to waveform end-to-end without using any intermediate representation or separate waveform synthesis algorithm. Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech, which is then fed to a waveform critic and a power critic. The use of an adversarial loss based on these two critics enables the direct synthesis of the raw audio waveform and ensures its realism. In addition, the use of our three comparative losses helps establish direct correspondence between the generated audio and the input video. We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID, and that it is the first end-to-end model to produce intelligible speech for Lip Reading in the Wild (LRW), featuring hundreds of speakers recorded entirely "in the wild." We evaluate the generated samples in two different scenarios-seen and unseen speakers-using four objective metrics which measure the quality and intelligibility of artificial speech. We demonstrate that the proposed approach outperforms all previous works in most metrics on GRID and LRW.

5.
IEEE Trans Pattern Anal Mach Intell ; 45(11): 12944-12959, 2023 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-37022892

RESUMO

This article presents a novel method for face clustering in videos using a video-centralised transformer. Previous works often employed contrastive learning to learn frame-level representation and used average pooling to aggregate the features along the temporal dimension. This approach may not fully capture the complicated video dynamics. In addition, despite the recent progress in video-based contrastive learning, few have attempted to learn a self-supervised clustering-friendly face representation that benefits the video face clustering task. To overcome these limitations, our method employs a transformer to directly learn video-level representations that can better reflect the temporally-varying property of faces in videos, while we also propose a video-centralised self-supervised framework to train the transformer model. We also investigate face clustering in egocentric videos, a fast-emerging field that has not been studied yet in works related to face clustering. To this end, we present and release the first large-scale egocentric video face clustering dataset named EasyCom-Clustering. We evaluate our proposed method on both the widely used Big Bang Theory (BBT) dataset and the new EasyCom-Clustering dataset. Results show the performance of our video-centralised transformer has surpassed all previous state-of-the-art methods on both benchmarks, exhibiting a self-attentive understanding of face videos.

6.
Artigo em Inglês | MEDLINE | ID: mdl-35275815

RESUMO

Image-based age estimation aims to predict a person's age from facial images. It is used in a variety of real-world applications. Although end-to-end deep models have achieved impressive results for age estimation on benchmark datasets, their performance in-the-wild still leaves much room for improvement due to the challenges caused by large variations in head pose, facial expressions, and occlusions. To address this issue, we propose a simple yet effective method to explicitly incorporate facial semantics into age estimation, so that the model would learn to correctly focus on the most informative facial components from unaligned facial images regardless of head pose and non-rigid deformation. To this end, we design a face parsing-based network to learn semantic information at different scales and a novel face parsing attention module to leverage these semantic features for age estimation. To evaluate our method on in-the-wild data, we also introduce a new challenging large-scale benchmark called IMDB-Clean. This dataset is created by semi-automatically cleaning the noisy IMDB-WIKI dataset using a constrained clustering method. Through comprehensive experiment on IMDB-Clean and other benchmark datasets, under both intra-dataset and cross-dataset evaluation protocols, we show that our method consistently outperforms all existing age estimation methods and achieves a new state-of-the-art performance. To the best of our knowledge, our work presents the first attempt of leveraging face parsing attention to achieve semantic-aware age estimation, which may be inspiring to other high level facial analysis tasks.

7.
Front Neurosci ; 15: 781196, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-35069100

RESUMO

Understanding speech becomes a demanding task when the environment is noisy. Comprehension of speech in noise can be substantially improved by looking at the speaker's face, and this audiovisual benefit is even more pronounced in people with hearing impairment. Recent advances in AI have allowed to synthesize photorealistic talking faces from a speech recording and a still image of a person's face in an end-to-end manner. However, it has remained unknown whether such facial animations improve speech-in-noise comprehension. Here we consider facial animations produced by a recently introduced generative adversarial network (GAN), and show that humans cannot distinguish between the synthesized and the natural videos. Importantly, we then show that the end-to-end synthesized videos significantly aid humans in understanding speech in noise, although the natural facial motions yield a yet higher audiovisual benefit. We further find that an audiovisual speech recognizer (AVSR) benefits from the synthesized facial animations as well. Our results suggest that synthesizing facial motions from speech can be used to aid speech comprehension in difficult listening environments.

8.
IEEE Trans Pattern Anal Mach Intell ; 43(3): 1022-1040, 2021 03.
Artigo em Inglês | MEDLINE | ID: mdl-31581074

RESUMO

Natural human-computer interaction and audio-visual human behaviour sensing systems, which would achieve robust performance in-the-wild are more needed than ever as digital devices are increasingly becoming an indispensable part of our life. Accurately annotated real-world data are the crux in devising such systems. However, existing databases usually consider controlled settings, low demographic variability, and a single task. In this paper, we introduce the SEWA database of more than 2,000 minutes of audio-visual data of 398 people coming from six cultures, 50 percent female, and uniformly spanning the age range of 18 to 65 years old. Subjects were recorded in two different contexts: while watching adverts and while discussing adverts in a video chat. The database includes rich annotations of the recordings in terms of facial landmarks, facial action units (FAU), various vocalisations, mirroring, and continuously valued valence, arousal, liking, agreement, and prototypic examples of (dis)liking. This database aims to be an extremely valuable resource for researchers in affective computing and automatic human sensing and is expected to push forward the research in human behaviour analysis, including cultural studies. Along with the database, we provide extensive baseline experiments for automatic FAU detection and automatic valence, arousal, and (dis)liking intensity estimation.


Assuntos
Algoritmos , Emoções , Adolescente , Adulto , Idoso , Atitude , Bases de Dados Factuais , Face , Feminino , Humanos , Pessoa de Meia-Idade , Adulto Jovem
9.
IEEE Trans Cybern ; 50(5): 2288-2301, 2020 May.
Artigo em Inglês | MEDLINE | ID: mdl-30561363

RESUMO

The ability to localize visual objects that are associated with an audio source and at the same time to separate the audio signal is a cornerstone in audio-visual signal-processing applications. However, available methods mainly focus on localizing only the visual objects, without audio separation abilities. Besides that, these methods often rely on either laborious preprocessing steps to segment video frames into semantic regions, or additional supervisions to guide their localization. In this paper, we aim to address the problem of visual source localization and audio separation in an unsupervised manner and avoid all preprocessing or post-processing steps. To this end, we devise a novel structured matrix decomposition method that decomposes the data matrix of each modality as a superposition of three terms: 1) a low-rank matrix capturing the background information; 2) a sparse matrix capturing the correlated components among the two modalities and, hence, uncovering the sound source in visual modality and the associated sound in audio modality; and 3) a third sparse matrix accounting for uncorrelated components, such as distracting objects in visual modality and irrelevant sound in audio modality. The generality of the proposed method is demonstrated by applying it onto three applications, namely: 1) visual localization of a sound source; 2) visually assisted audio separation; and 3) active speaker detection. Experimental results indicate the effectiveness of the proposed method on these application domains.


Assuntos
Processamento de Imagem Assistida por Computador/métodos , Processamento de Sinais Assistido por Computador , Algoritmos , Aprendizado Profundo , Humanos , Localização de Som , Gravação em Vídeo
10.
IEEE Trans Pattern Anal Mach Intell ; 31(1): 39-58, 2009 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-19029545

RESUMO

Automated analysis of human affective behavior has attracted increasing attention from researchers in psychology, computer science, linguistics, neuroscience, and related disciplines. However, the existing methods typically handle only deliberately displayed and exaggerated expressions of prototypical emotions despite the fact that deliberate behaviour differs in visual appearance, audio profile, and timing from spontaneously occurring behaviour. To address this problem, efforts to develop algorithms that can process naturally occurring human affective behaviour have recently emerged. Moreover, an increasing number of efforts are reported toward multimodal fusion for human affect analysis including audiovisual fusion, linguistic and paralinguistic fusion, and multi-cue visual fusion based on facial expressions, head movements, and body gestures. This paper introduces and surveys these recent advances. We first discuss human emotion perception from a psychological perspective. Next we examine available approaches to solving the problem of machine understanding of human affective behavior, and discuss important issues like the collection and availability of training and test data. We finally outline some of the scientific and engineering challenges to advancing human affect sensing technology.


Assuntos
Afeto/fisiologia , Algoritmos , Inteligência Artificial , Emoções/fisiologia , Expressão Facial , Monitorização Fisiológica/métodos , Reconhecimento Automatizado de Padrão/métodos , Espectrografia do Som/métodos
11.
Artigo em Inglês | MEDLINE | ID: mdl-29993690

RESUMO

We propose a Multi-Instance-Learning (MIL) approach for weakly-supervised learning problems, where a training set is formed by bags (sets of feature vectors or instances) and only labels at bag-level are provided. Specifically, we consider the Multi-Instance Dynamic-Ordinal-Regression (MI-DOR) setting, where the instance labels are naturally represented as ordinal variables and bags are structured as temporal sequences. To this end, we propose Multi-Instance Dynamic Ordinal Random Fields (MI-DORF). In this framework, we treat instance-labels as temporally-dependent latent variables in an Undirected Graphical Model. Different MIL assumptions are modelled via newly introduced high-order potentials relating bag and instance-labels within the energy function of the model. We also extend our framework to address the Partially-Observed MI-DOR problems, where a subset of instance labels are available during training.We show on the tasks of weakly-supervised facial behavior analysis, Facial Action Unit (DISFA dataset) and Pain (UNBC dataset) Intensity estimation, that the proposed framework outperforms alternative learning approaches. Furthermore, we show that MIDORF can be employed to reduce the data annotation efforts in this context by large-scale.

12.
IEEE Trans Image Process ; 26(2): 1040-1053, 2017 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-28026767

RESUMO

Active appearance models (AAMs) are generative models of shape and appearance that have proven very attractive for their ability to handle wide changes in illumination, pose, and occlusion when trained in the wild, while not requiring large training data set like regression-based or deep learning methods. The problem of fitting an AAM is usually formulated as a non-linear least squares one and the main way of solving it is a standard Gauss-Newton algorithm. In this paper, we extend AAMs in two ways: we first extend the Gauss-Newton framework by formulating a bidirectional fitting method that deforms both the image and the template to fit a new instance. We then formulate a second order method by deriving an efficient Newton method for AAMs fitting. We derive both methods in a unified framework for two types of AAMs, holistic and part-based, and additionally show how to exploit the structure in the problem to derive fast yet exact solutions. We perform a thorough evaluation of all algorithms on three challenging and recently annotated in-the-wild data sets, and investigate fitting accuracy, convergence properties, and the influence of noise in the initialization. We compare our proposed methods to other algorithms and show that they yield state-of-the-art results, out-performing other methods while having superior convergence properties.

13.
IEEE Trans Image Process ; 26(12): 5603-5617, 2017 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-28783634

RESUMO

The analysis of high-dimensional, possibly temporally misaligned, and time-varying visual data is a fundamental task in disciplines, such as image, vision, and behavior computing. In this paper, we focus on dynamic facial behavior analysis and in particular on the analysis of facial expressions. Distinct from the previous approaches, where sets of facial landmarks are used for face representation, raw pixel intensities are exploited for: 1) unsupervised analysis of the temporal phases of facial expressions and facial action units (AUs) and 2) temporal alignment of a certain facial behavior displayed by two different persons. To this end, the slow features nonnegative matrix factorization (SFNMF) is proposed in order to learn slow varying parts-based representations of time varying sequences capturing the underlying dynamics of temporal phenomena, such as facial expressions. Moreover, the SFNMF is extended in order to handle two temporally misaligned data sequences depicting the same visual phenomena. To do so, the dynamic time warping is incorporated into the SFNMF, allowing the temporal alignment of the data sets onto the subspace spanned by the estimated nonnegative shared latent features amongst the two visual sequences. Extensive experimental results in two video databases demonstrate the effectiveness of the proposed methods in: 1) unsupervised detection of the temporal phases of posed and spontaneous facial events and 2) temporal alignment of facial expressions, outperforming by a large margin the state-of-the-art methods that they are compared to.

14.
IEEE Trans Image Process ; 26(10): 4697-4711, 2017 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-28678708

RESUMO

Most of existing models for facial behavior analysis rely on generic classifiers, which fail to generalize well to previously unseen data. This is because of inherent differences in source (training) and target (test) data, mainly caused by variation in subjects' facial morphology, camera views, and so on. All of these account for different contexts in which target and source data are recorded, and thus, may adversely affect the performance of the models learned solely from source data. In this paper, we exploit the notion of domain adaptation and propose a data efficient approach to adapt already learned classifiers to new unseen contexts. Specifically, we build upon the probabilistic framework of Gaussian processes (GPs), and introduce domain-specific GP experts (e.g., for each subject). The model adaptation is facilitated in a probabilistic fashion, by conditioning the target expert on the predictions from multiple source experts. We further exploit the predictive variance of each expert to define an optimal weighting during inference. We evaluate the proposed model on three publicly available data sets for multi-class (MultiPIE) and multi-label (DISFA, FERA2015) facial expression analysis by performing adaptation of two contextual factors: "where" (view) and "who" (subject). In our experiments, the proposed approach consistently outperforms: 1) both source and target classifiers, while using a small number of target examples during the adaptation and 2) related state-of-the-art approaches for supervised domain adaptation.


Assuntos
Face/diagnóstico por imagem , Expressão Facial , Distribuição Normal , Reconhecimento Automatizado de Padrão/métodos , Algoritmos , Bases de Dados Factuais , Humanos , Processamento de Imagem Assistida por Computador
15.
Artigo em Inglês | MEDLINE | ID: mdl-29606917

RESUMO

The field of Automatic Facial Expression Analysis has grown rapidly in recent years. However, despite progress in new approaches as well as benchmarking efforts, most evaluations still focus on either posed expressions, near-frontal recordings, or both. This makes it hard to tell how existing expression recognition approaches perform under conditions where faces appear in a wide range of poses (or camera views), displaying ecologically valid expressions. The main obstacle for assessing this is the availability of suitable data, and the challenge proposed here addresses this limitation. The FG 2017 Facial Expression Recognition and Analysis challenge (FERA 2017) extends FERA 2015 to the estimation of Action Units occurrence and intensity under different camera views. In this paper we present the third challenge in automatic recognition of facial expressions, to be held in conjunction with the 12th IEEE conference on Face and Gesture Recognition, May 2017, in Washington, United States. Two sub-challenges are defined: the detection of AU occurrence, and the estimation of AU intensity. In this work we outline the evaluation protocol, the data used, and the results of a baseline method for both sub-challenges.

16.
IEEE Trans Syst Man Cybern B Cybern ; 36(2): 433-49, 2006 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-16602602

RESUMO

Automatic analysis of human facial expression is a challenging problem with many applications. Most of the existing automated systems for facial expression analysis attempt to recognize a few prototypic emotional expressions, such as anger and happiness. Instead of representing another approach to machine analysis of prototypic facial expressions of emotion, the method presented in this paper attempts to handle a large range of human facial behavior by recognizing facial muscle actions that produce expressions. Virtually all of the existing vision systems for facial muscle action detection deal only with frontal-view face images and cannot handle temporal dynamics of facial actions. In this paper, we present a system for automatic recognition of facial action units (AUs) and their temporal models from long, profile-view face image sequences. We exploit particle filtering to track 15 facial points in an input face-profile sequence, and we introduce facial-action-dynamics recognition from continuous video input using temporal rules. The algorithm performs both automatic segmentation of an input video into facial expressions pictured and recognition of temporal segments (i.e., onset, apex, offset) of 27 AUs occurring alone or in a combination in the input face-profile video. A recognition rate of 87% is achieved.


Assuntos
Inteligência Artificial , Face/anatomia & histologia , Face/fisiologia , Expressão Facial , Interpretação de Imagem Assistida por Computador/métodos , Movimento/fisiologia , Reconhecimento Automatizado de Padrão/métodos , Algoritmos , Análise por Conglomerados , Humanos , Aumento da Imagem/métodos , Armazenamento e Recuperação da Informação/métodos , Fotografação/métodos , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Técnica de Subtração , Fatores de Tempo , Gravação em Vídeo/métodos
17.
IEEE Trans Syst Man Cybern B Cybern ; 36(3): 710-9, 2006 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-16761823

RESUMO

This paper addresses the problem of human-action recognition by introducing a sparse representation of image sequences as a collection of spatiotemporal events that are localized at points that are salient both in space and time. The spatiotemporal salient points are detected by measuring the variations in the information content of pixel neighborhoods not only in space but also in time. An appropriate distance metric between two collections of spatiotemporal salient points is introduced, which is based on the chamfer distance and an iterative linear time-warping technique that deals with time expansion or time-compression issues. A classification scheme that is based on relevance vector machines and on the proposed distance measure is proposed. Results on real image sequences from a small database depicting people performing 19 aerobic exercises are presented.


Assuntos
Inteligência Artificial , Interpretação de Imagem Assistida por Computador/métodos , Modelos Biológicos , Movimento , Reconhecimento Automatizado de Padrão/métodos , Análise e Desempenho de Tarefas , Gravação em Vídeo/métodos , Algoritmos , Simulação por Computador , Humanos , Técnica de Subtração , Fatores de Tempo
18.
IEEE Trans Image Process ; 25(5): 2021-34, 2016 May.
Artigo em Inglês | MEDLINE | ID: mdl-27008268

RESUMO

Face images convey rich information which can be perceived as a superposition of low-complexity components associated with attributes, such as facial identity, expressions, and activation of facial action units (AUs). For instance, low-rank components characterizing neutral facial images are associated with identity, while sparse components capturing non-rigid deformations occurring in certain face regions reveal expressions and AU activations. In this paper, the discriminant incoherent component analysis (DICA) is proposed in order to extract low-complexity components, corresponding to facial attributes, which are mutually incoherent among different classes (e.g., identity, expression, and AU activation) from training data, even in the presence of gross sparse errors. To this end, a suitable optimization problem, involving the minimization of nuclear-and l1 -norm, is solved. Having found an ensemble of class-specific incoherent components by the DICA, an unseen (test) image is expressed as a group-sparse linear combination of these components, where the non-zero coefficients reveal the class(es) of the respective facial attribute(s) that it belongs to. The performance of the DICA is experimentally assessed on both synthetic and real-world data. Emphasis is placed on face analysis tasks, namely, joint face and expression recognition, face recognition under varying percentages of training data corruption, subject-independent expression recognition, and AU detection by conducting experiments on four data sets. The proposed method outperforms all the methods that are compared with all the tasks and experimental settings.

19.
IEEE Trans Cybern ; 46(12): 2758-2771, 2016 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-26513822

RESUMO

Accent is a soft biometric trait that can be inferred from pronunciation and articulation patterns characterizing the speaking style of an individual. Past research has addressed the task of classifying accent, as belonging to a native language speaker or a foreign language speaker, by means of the audio modality only. However, features extracted from the visual stream of speech have been successfully used to extend or substitute audio-only approaches that target speech or language recognition. Motivated by these findings, we investigate to what extent temporal visual speech dynamics attributed to accent can be modeled and identified when the audio stream is missing or noisy, and the speech content is unknown. We present here a fully automated approach to discriminating native from non-native English speech, based exclusively on visual cues. A systematic evaluation of various appearance and shape features for the target problem is conducted, with the former consistently yielding superior performance. Subject-independent cross-validation experiments are conducted on mobile phone recordings of continuous speech and isolated word utterances spoken by 56 subjects from the challenging MOBIO database. High performance is achieved on a text-dependent (TD) protocol, with the best score of 76.5% yielded by fusion of five hidden Markov models trained on appearance features. Our framework is also efficient even when tested on examples of speech unseen in the training phase, although performing less accurately compared to the TD case.

20.
IEEE Trans Pattern Anal Mach Intell ; 38(9): 1748-61, 2016 09.
Artigo em Inglês | MEDLINE | ID: mdl-26595911

RESUMO

Certain inner feelings and physiological states like pain are subjective states that cannot be directly measured, but can be estimated from spontaneous facial expressions. Since they are typically characterized by subtle movements of facial parts, analysis of the facial details is required. To this end, we formulate a new regression method for continuous estimation of the intensity of facial behavior interpretation, called Doubly Sparse Relevance Vector Machine (DSRVM). DSRVM enforces double sparsity by jointly selecting the most relevant training examples (a.k.a. relevance vectors) and the most important kernels associated with facial parts relevant for interpretation of observed facial expressions. This advances prior work on multi-kernel learning, where sparsity of relevant kernels is typically ignored. Empirical evaluation on challenging Shoulder Pain videos, and the benchmark DISFA and SEMAINE datasets demonstrate that DSRVM outperforms competing approaches with a multi-fold reduction of running times in training and testing.


Assuntos
Algoritmos , Expressão Facial , Face , Humanos , Análise de Regressão
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA