RESUMO
3D morphable model (3DMM) fitting on 2D data is traditionally done via unconstrained optimization with regularization terms to ensure that the result is a plausible face shape and is consistent with a set of 2D landmarks. This paper presents inequality-constrained 3DMM fitting as the first alternative to regularization in optimization-based 3DMM fitting. Inequality constraints on the 3DMM's shape coefficients ensure face-like shapes without modifying the objective function for smoothness, thus allowing for more flexibility to capture person-specific shape details. Moreover, inequality constraints on landmarks increase robustness in a way that does not require per-image tuning. We show that the proposed method stands out with its ability to estimate person-specific face shapes by jointly fitting a 3DMM to multiple frames of a person. Further, when used with a robust objective function, namely gradient correlation, the method can work "in-the-wild" even with a 3DMM constructed from controlled data. Lastly, we show how to use the log-barrier method to efficiently implement the method. To our knowledge, we present the first 3DMM fitting framework that requires no learning yet is accurate, robust, and efficient. The absence of learning enables a generic solution that allows flexibility in the input image size, interchangeable morphable models, and incorporation of camera matrix.
RESUMO
The standard benchmark metric for 3D face reconstruction is the geometric error between reconstructed meshes and the ground truth. Nearly all recent reconstruction methods are validated on real ground truth scans, in which case one needs to establish point correspondence prior to error computation, which is typically done with the Chamfer (i.e., nearest neighbor) criterion. However, a simple yet fundamental question have not been asked: Is the Chamfer error an appropriate and fair benchmark metric for 3D face reconstruction? More generally, how can we determine which error estimator is a better benchmark metric? We present a meta-evaluation framework that uses synthetic data to evaluate the quality of a geometric error estimator as a benchmark metric for face reconstruction. Further, we use this framework to experimentally compare four geometric error estimators. Results show that the standard approach not only severely underestimates the error, but also does so inconsistently across reconstruction methods, to the point of even altering the ranking of the compared methods. Moreover, although non-rigid ICP leads to a metric with smaller estimation bias, it could still not correctly rank all compared reconstruction methods, and is significantly more time consuming than Chamfer. In sum, we show several issues present in the current benchmarking and propose a procedure using synthetic data to address these issues.
RESUMO
Advances in computational behavior analysis via artificial intelligence (AI) promise to improve mental healthcare services by providing clinicians with tools to assist diagnosis or measurement of treatment outcomes. This potential has spurred an increasing number of studies in which automated pipelines predict diagnoses of mental health conditions. However, a fundamental question remains unanswered: How do the predictions of the AI algorithms correspond and compare with the predictions of humans? This is a critical question if AI technology is to be used as an assistive tool, because the utility of an AI algorithm would be negligible if it provides little information beyond what clinicians can readily infer. In this paper, we compare the performance of 19 human raters (8 autism experts and 11 non-experts) and that of an AI algorithm in terms of predicting autism diagnosis from short (3-minute) videos of N = 42 participants in a naturalistic conversation. Results show that the AI algorithm achieves an average accuracy of 80.5%, which is comparable to that of clinicians with expertise in autism (83.1%) and clinical research staff without specialized expertise (78.3%). Critically, diagnoses that were inaccurately predicted by most humans (experts and non-experts, alike) were typically correctly predicted by AI. Our results highlight the potential of AI as an assistive tool that can augment clinician diagnostic decision-making.
RESUMO
Autism spectrum disorder (ASD) is a neurodevelopmental condition characterized in part by difficulties in verbal and nonverbal social communication. Evidence indicates that autistic people, compared to neurotypical peers, exhibit differences in head movements, a key form of nonverbal communication. Despite the crucial role of head movements in social communication, research on this nonverbal cue is relatively scarce compared to other forms of nonverbal communication, such as facial expressions and gestures. There is a need for scalable, reliable, and accurate instruments for measuring head movements directly within the context of social interactions. In this study, we used computer vision and machine learning to examine the head movement patterns of neurotypical and autistic individuals during naturalistic, face-to-face conversations, at both the individual (monadic) and interpersonal (dyadic) levels. Our model predicts diagnostic status using dyadic head movement data with an accuracy of 80%, highlighting the value of head movement as a marker of social communication. The monadic data pipeline had lower accuracy (69.2%) compared to the dyadic approach, emphasizing the importance of studying back-and-forth social communication cues within a true social context. The proposed classifier is not intended for diagnostic purposes, and future research should replicate our findings in larger, more representative samples.
RESUMO
Advances in computational behavior analysis have the potential to increase our understanding of behavioral patterns and developmental trajectories in neurotypical individuals, as well as in individuals with mental health conditions marked by motor, social, and emotional difficulties. This study focuses on investigating how head movement patterns during face-to-face conversations vary with age from childhood through adulthood. We rely on computer vision techniques due to their suitability for analysis of social behaviors in naturalistic settings, since video data capture can be unobtrusively embedded within conversations between two social partners. The methods in this work include unsupervised learning for movement pattern clustering, and supervised classification and regression as a function of age. The results demonstrate that 3-minute video recordings of head movements during conversations show patterns that distinguish between participants that are younger vs. older than 12 years with 78% accuracy. Additionally, we extract relevant patterns of head movement upon which the age distinction was determined by our models.
RESUMO
Motor imitation is a critical developmental skill area that has been strongly and specifically linked to autism spectrum disorder (ASD). However, methodological variability across studies has precluded a clear understanding of the extent and impact of imitation differences in ASD, underscoring a need for more automated, granular measurement approaches that offer greater precision and consistency. In this paper, we investigate the utility of a novel motor imitation measurement approach for accurately differentiating between youth with ASD and typically developing (TD) youth. Findings indicate that youth with ASD imitate body movements significantly differently from TD youth upon repeated administration of a brief, simple task, and that a classifier based on body coordination features derived from this task can differentiate between autistic and TD youth with 82% accuracy. Our method illustrates that group differences are driven not only by interpersonal coordination with the imitated video stimulus, but also by intrapersonal coordination. Comparison of 2D and 3D tracking shows that both approaches achieve the same classification accuracy of 82%, which is highly promising with regard to scalability for larger samples and a range of non-laboratory settings. This work adds to a rapidly growing literature highlighting the promise of computational behavior analysis for detecting and characterizing motor differences in ASD and identifying potential motor biomarkers.
RESUMO
Fitting 3D morphable models (3DMMs) on faces is a well-studied problem, motivated by various industrial and research applications. 3DMMs express a 3D facial shape as a linear sum of basis functions. The resulting shape, however, is a plausible face only when the basis coefficients take values within limited intervals. Methods based on unconstrained optimization address this issue with a weighted â 2 penalty on coefficients; however, determining the weight of this penalty is difficult, and the existence of a single weight that works universally is questionable. We propose a new formulation that does not require the tuning of any weight parameter. Specifically, we formulate 3DMM fitting as an inequality-constrained optimization problem, where the primary constraint is that basis coefficients should not exceed the interval that is learned when the 3DMM is constructed. We employ additional constraints to exploit sparse landmark detectors, by forcing the facial shape to be within the error bounds of a reliable detector. To enable operation "in-the-wild", we use a robust objective function, namely Gradient Correlation. Our approach performs comparably with deep learning (DL) methods on "in-the-wild" data that have inexact ground truth, and better than DL methods on more controlled data with exact ground truth. Since our formulation does not require any learning, it enjoys a versatility that allows it to operate with multiple frames of arbitrary sizes. This study's results encourage further research on 3DMM fitting with inequality-constrained optimization methods, which have been unexplored compared to unconstrained methods.
RESUMO
Separating facial pose and expression within images requires a camera model for 3D-to-2D mapping. The weak perspective (WP) camera has been the most popular choice; it is the default, if not the only option, in state-of-the-art facial analysis methods and software. WP camera is justified by the supposition that its errors are negligible when the subjects are relatively far from the camera, yet this claim has never been tested despite nearly 20 years of research. This paper critically examines the suitability of WP camera for separating facial pose and expression. First, we theoretically show that WP causes pose-expression ambiguity, as it leads to estimation of spurious expressions. Next, we experimentally quantify the magnitude of spurious expressions. Finally, we test whether spurious expressions have detrimental effects on a common facial analysis application, namely Action Unit (AU) detection. Contrary to conventional wisdom, we find that severe pose-expression ambiguity exists even when subjects are not close to the camera, leading to large false positive rates in AU detection. We also demonstrate that the magnitude and characteristics of spurious expressions depend on the point distribution model used to model the expressions. Our results suggest that common assumptions about WP need to be revisited in facial expression modeling, and that facial analysis software should encourage and facilitate the use of the true camera model whenever possible.
RESUMO
Finding the largest subset of sequences (i.e., time series) that are correlated above a certain threshold, within large datasets, is of significant interest for computer vision and pattern recognition problems across domains, including behavior analysis, computational biology, neuroscience, and finance. Maximal clique algorithms can be used to solve this problem, but they are not scalable. We present an approximate, but highly efficient and scalable, method that represents the search space as a union of sets called ϵ-expanded clusters, one of which is theoretically guaranteed to contain the largest subset of synchronized sequences. The method finds synchronized sets by fitting a Euclidean ball on ϵ-expanded clusters, using Jung's theorem. We validate the method on data from the three distinct domains of facial behavior analysis, finance, and neuroscience, where we respectively discover the synchrony among pixels of face videos, stock market item prices, and dynamic brain connectivity data. Experiments show that our method produces results comparable to, but up to 300 times faster than, maximal clique algorithms, with speed gains increasing exponentially with the number of input sequences.
RESUMO
Communication with humans is a multi-faceted phenomenon where the emotions, personality and non-verbal behaviours, as well as the verbal behaviours, play a significant role, and human-robot interaction (HRI) technologies should respect this complexity to achieve efficient and seamless communication. In this paper, we describe the design and execution of five public demonstrations made with two HRI systems that aimed at automatically sensing and analysing human participants' non-verbal behaviour and predicting their facial action units, facial expressions and personality in real time while they interacted with a small humanoid robot. We describe an overview of the challenges faced together with the lessons learned from those demonstrations in order to better inform the science and engineering fields to design and build better robots with more purposeful interaction capabilities. This article is part of the theme issue 'From social brains to social robots: applying neurocognitive insights to human-robot interaction'.
Assuntos
Emoções , Personalidade , Robótica , Expressão Facial , Humanos , FalaRESUMO
Autism spectrum disorder (ASD) is a neurodevelopmental condition characterized by impaired social communication and the presence of restricted, repetitive patterns of behaviors and interests. Prior research suggests that restricted patterns of behavior in ASD may be cross-domain phenomena that are evident in a variety of modalities. Computational studies of language in ASD provide support for the existence of an underlying dimension of restriction that emerges during a conversation. Similar evidence exists for restricted patterns of facial movement. Using tools from computational linguistics, computer vision, and information theory, this study tests whether cognitive-motor restriction can be detected across multiple behavioral domains in adults with ASD during a naturalistic conversation. Our methods identify restricted behavioral patterns, as measured by entropy in word use and mouth movement. Results suggest that adults with ASD produce significantly less diverse mouth movements and words than neurotypical adults, with an increased reliance on repeated patterns in both domains. The diversity values of the two domains are not significantly correlated, suggesting that they provide complementary information.
RESUMO
The extraction of descriptive features from sequences of faces is a fundamental problem in facial expression analysis. Facial expressions are represented by psychologists as a combination of elementary movements known as action units: each movement is localised and its intensity is specified with a score that is small when the movement is subtle and large when the movement is pronounced. Inspired by this approach, we propose a novel data-driven feature extraction framework that represents facial expression variations as a linear combination of localised basis functions, whose coefficients are proportional to movement intensity. We show that the linear basis functions required by this framework can be obtained by training a sparse linear model with Gabor phase shifts computed from facial videos. The proposed framework addresses generalisation issues that are not addressed by existing learnt representations, and achieves, with the same learning parameters, state-of-the-art results in recognising both posed expressions and spontaneous micro-expressions. This performance is confirmed even when the data used to train the model differ from test data in terms of the intensity of facial movements and frame rate.
RESUMO
The growing use of cameras embedded in autonomous robotic platforms and worn by people is increasing the importance of accurate global motion estimation (GME). However, existing GME methods may degrade considerably under illumination variations. In this paper, we address this problem by proposing a biologically-inspired GME method that achieves high estimation accuracy in the presence of illumination variations. We mimic the early layers of the human visual cortex with the spatio-temporal Gabor motion energy by adopting the pioneering model of Adelson and Bergen and we provide the closed-form expressions that enable the study and adaptation of this model to different application needs. Moreover, we propose a normalisation scheme for motion energy to tackle temporal illumination variations. Finally, we provide an overall GME scheme which, to the best of our knowledge, achieves the highest accuracy on the Pose, Illumination, and Expression (PIE) database.
RESUMO
Accurate face registration is a key step for several image analysis applications. However, existing registration methods are prone to temporal drift errors or jitter among consecutive frames. In this paper, we propose an iterative rigid registration framework that estimates the misalignment with trained regressors. The input of the regressors is a robust motion representation that encodes the motion between a misaligned frame and the reference frame(s), and enables reliable performance under non-uniform illumination variations. Drift errors are reduced when the motion representation is computed from multiple reference frames. Furthermore, we use the L2 norm of the representation as a cue for performing coarse-to-fine registration efficiently. Importantly, the framework can identify registration failures and correct them. Experiments show that the proposed approach achieves significantly higher registration accuracy than the state-of-the-art techniques in challenging sequences.
Assuntos
Algoritmos , Face/anatomia & histologia , Processamento de Imagem Assistida por Computador/métodos , Bases de Dados Factuais , Expressão Facial , Feminino , Humanos , MasculinoRESUMO
Automatic affect analysis has attracted great interest in various contexts including the recognition of action units and basic or non-basic emotions. In spite of major efforts, there are several open questions on what the important cues to interpret facial expressions are and how to encode them. In this paper, we review the progress across a range of affect recognition applications to shed light on these fundamental questions. We analyse the state-of-the-art solutions by decomposing their pipelines into fundamental components, namely face registration, representation, dimensionality reduction and recognition. We discuss the role of these components and highlight the models and new trends that are followed in their design. Moreover, we provide a comprehensive analysis of facial representations by uncovering their advantages and limitations; we elaborate on the type of information they encode and discuss how they deal with the key challenges of illumination variations, registration errors, head-pose variations, occlusions, and identity bias. This survey allows us to identify open issues and to define future directions for designing real-world affect recognition systems.