Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
1.
IEEE Trans Pattern Anal Mach Intell ; 46(7): 4944-4956, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38306260

RESUMEN

Supervised person re-identification (Re-ID) approaches are sensitive to label corrupted data, which is inevitable and generally ignored in the field of person Re-ID. In this paper, we propose a two-stage noise-tolerant paradigm (TSNT) for labeling corrupted person Re-ID. Specifically, at stage one, we present a self-refining strategy to separately train each network in TSNT by concentrating more on pure samples. These pure samples are progressively refurbished via mining the consistency between annotations and predictions. To enhance the tolerance of TSNT to noisy labels, at stage two, we employ a co-training strategy to collaboratively supervise the learning of the two networks. Concretely, a rectified cross-entropy loss is proposed to learn the mutual information from the peer network by assigning large weights to the refurbished reliable samples. Moreover, a noise-robust triplet loss is formulated for further improving the robustness of TSNT by increasing inter-class distances and reducing intra-class distances in the label-corrupted dataset, where a constraint condition for reliability discrimination is carefully designed to select reliable triplets. Extensive experiments demonstrate the superiority of TSNT, for instance, on the Market1501 dataset, our paradigm achieves 90.3% rank-1 accuracy (6.2% improvement over the state-of-the-art method) under noise ratio 20%.

2.
IEEE Trans Pattern Anal Mach Intell ; 45(12): 15394-15405, 2023 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-37773900

RESUMEN

In many applications, we are constrained to learn classifiers from very limited data (few-shot classification). The task becomes even more challenging if it is also required to identify samples from unknown categories (open-set classification). Learning a good abstraction for a class with very few samples is extremely difficult, especially under open-set settings. As a result, open-set recognition has received limited attention in the few-shot setting. However, it is a critical task in many applications like environmental monitoring, where the number of labeled examples for each class is limited. Existing few-shot open-set recognition (FSOSR) methods rely on thresholding schemes, with some considering uniform probability for open-class samples. However, this approach is often inaccurate, especially for fine-grained categorization, and makes them highly sensitive to the choice of a threshold. To address these concerns, we propose Reconstructing Exemplar-based Few-shot Open-set ClaSsifier (ReFOCS). By using a novel exemplar reconstruction-based meta-learning strategy ReFOCS streamlines FSOSR eliminating the need for a carefully tuned threshold by learning to be self-aware of the openness of a sample. The exemplars, act as class representatives and can be either provided in the training dataset or estimated in the feature domain. By testing on a wide variety of datasets, we show ReFOCS to outperform multiple state-of-the-art methods.

3.
IEEE Trans Image Process ; 30: 8886-8899, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34665727

RESUMEN

Prior works on text-based video moment localization focus on temporally grounding the textual query in an untrimmed video. These works assume that the relevant video is already known and attempt to localize the moment on that relevant video only. Different from such works, we relax this assumption and address the task of localizing moments in a corpus of videos for a given sentence query. This task poses a unique challenge as the system is required to perform: 2) retrieval of the relevant video where only a segment of the video corresponds with the queried sentence, 2) temporal localization of moment in the relevant video based on sentence query. Towards overcoming this challenge, we propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences. In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries. Qualitative and quantitative results on three benchmark text-based video moment retrieval datasets - Charades-STA, DiDeMo, and ActivityNet Captions - demonstrate that our method achieves promising performance on the proposed task of temporal localization of moments in a corpus of videos.

4.
IEEE Trans Image Process ; 30: 3017-3028, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-33571092

RESUMEN

Most person re-identification methods, being supervised techniques, suffer from the burden of massive annotation requirement. Unsupervised methods overcome this need for labeled data, but perform poorly compared to the supervised alternatives. In order to cope with this issue, we introduce the problem of learning person re-identification models from videos with weak supervision. The weak nature of the supervision arises from the requirement of video-level labels, i.e. person identities who appear in the video, in contrast to the more precise frame-level annotations. Towards this goal, we propose a multiple instance attention learning framework for person re-identification using such video-level labels. Specifically, we first cast the video person re-identification task into a multiple instance learning setting, in which person images in a video are collected into a bag. The relations between videos with similar labels can be utilized to identify persons, on top of that, we introduce a co-person attention mechanism which mines the similarity correlations between videos with person identities in common. The attention weights are obtained based on all person images instead of person tracklets in a video, making our learned model less affected by noisy annotations. Extensive experiments demonstrate the superiority of the proposed method over the related methods on two weakly labeled person re-identification datasets.

5.
IEEE Trans Pattern Anal Mach Intell ; 42(3): 554-567, 2020 03.
Artículo en Inglés | MEDLINE | ID: mdl-30387722

RESUMEN

Activity recognition is a challenging problem with many practical applications. In addition to the visual features, recent approaches have benefited from the use of context, e.g., inter-relationships among the activities and objects. However, these approaches require data to be labeled, entirely available beforehand, and not designed to be updated continuously, which make them unsuitable for surveillance applications. In contrast, we propose a continuous-learning framework for context-aware activity recognition from unlabeled video, which has two distinct advantages over existing methods. First, it employs a novel active-learning technique that not only exploits the informativeness of the individual activities but also utilizes their contextual information during query selection; this leads to significant reduction in expensive manual annotation effort. Second, the learned models can be adapted online as more data is available. We formulate a conditional random field model that encodes the context and devise an information-theoretic approach that utilizes entropy and mutual information of the nodes to compute the set of most informative queries, which are labeled by a human. These labels are combined with graphical inference techniques for incremental updates. We provide a theoretical formulation of the active learning framework with an analytic solution. Experiments on six challenging datasets demonstrate that our framework achieves superior performance with significantly less manual labeling.

6.
Artículo en Inglés | MEDLINE | ID: mdl-30998468

RESUMEN

In this paper, we present a novel approach to find informative and anomalous samples in videos exploiting the concept of typicality from information theory. In most video analysis tasks, selection of the most informative samples from a huge pool of training data in order to learn a good recognition model is an important problem. Furthermore, it is also useful to reduce the annotation cost as it is time-consuming to annotate unlabeled samples. Typicality is a simple and powerful technique which can be applied to compress the training data to learn a good classification model. In a continuous video clip, an activity shares a strong correlation with its previous activities. We assume that the activity samples that appear in a video form a Markov chain. We explicitly show how typicality can be utilized in this scenario. We compute an atypical score for a sample using typicality and the Markovian property, which can be applied to two challenging vision problems-(a) sample selection for learning activity recognition models, and (b) anomaly detection. In the first case, our approach leads to a significant reduction of manual labeling cost while achieving similar or better recognition performance compared to a model trained with the entire training set. For the latter case, the atypical score has been exploited in identifying anomalous activities in videos where our results demonstrate the effectiveness of the proposed framework over other recent strategies.

7.
IEEE Trans Image Process ; 28(7): 3286-3300, 2019 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-30703026

RESUMEN

With advanced image journaling tools, one can easily alter the semantic meaning of an image by exploiting certain manipulation techniques such as copy clone, object splicing, and removal, which mislead the viewers. In contrast, the identification of these manipulations becomes a very challenging task as manipulated regions are not visually apparent. This paper proposes a high-confidence manipulation localization architecture that utilizes resampling features, long short-term memory (LSTM) cells, and an encoder-decoder network to segment out manipulated regions from non-manipulated ones. Resampling features are used to capture artifacts, such as JPEG quality loss, upsampling, downsampling, rotation, and shearing. The proposed network exploits larger receptive fields (spatial maps) and frequency-domain correlation to analyze the discriminative characteristics between the manipulated and non-manipulated regions by incorporating the encoder and LSTM network. Finally, the decoder network learns the mapping from low-resolution feature maps to pixel-wise predictions for image tamper localization. With the predicted mask provided by the final layer (softmax) of the proposed architecture, end-to-end training is performed to learn the network parameters through back-propagation using the ground-truth masks. Furthermore, a large image splicing dataset is introduced to guide the training process. The proposed method is capable of localizing image manipulations at the pixel level with high precision, which is demonstrated through rigorous experimentation on three diverse datasets.

8.
IEEE Trans Image Process ; 26(10): 4712-4724, 2017 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-28574359

RESUMEN

Most video summarization approaches have focused on extracting a summary from a single video; we propose an unsupervised framework for summarizing a collection of videos. We observe that each video in the collection may contain some information that other videos do not have, and thus exploring the underlying complementarity could be beneficial in creating a diverse informative summary. We develop a novel diversity-aware sparse optimization method for multi-video summarization by exploring the complementarity within the videos. Our approach extracts a multi-video summary, which is both interesting and representative in describing the whole video collection. To efficiently solve our optimization problem, we develop an alternating minimization algorithm that minimizes the overall objective function with respect to one video at a time while fixing the other videos. Moreover, we introduce a new benchmark data set, Tour20, that contains 140 videos with multiple manually created summaries, which were acquired in a controlled experiment. Finally, by extensive experiments on the new Tour20 data set and several other multi-view data sets, we show that the proposed approach clearly outperforms the state-of-the-art methods on the two problems-topic-oriented video summarization and multi-view video summarization in a camera network.

9.
Artículo en Inglés | MEDLINE | ID: mdl-26887008

RESUMEN

Technologically advanced imaging techniques have allowed us to generate and study the internal part of a tissue over time by capturing serial optical images that contain spatio-temporal slices of hundreds of tightly packed cells. Image registration of such live-imaging datasets of developing multicelluar tissues is one of the essential components of all image analysis pipelines. In this paper, we present a fully automated 4D(X-Y-Z-T) registration method of live imaging stacks that takes care of both temporal and spatial misalignments. We present a novel landmark selection methodology where the shape features of individual cells are not of high quality and highly distinguishable. The proposed registration method finds the best image slice correspondence from consecutive image stacks to account for vertical growth in the tissue and the discrepancy in the choice of the starting focal point. Then, it uses local graph-based approach to automatically find corresponding landmark pairs, and finally the registration parameters are used to register the entire image stack. The proposed registration algorithm combined with an existing tracking method is tested on multiple image stacks of tightly packed cells of Arabidopsis shoot apical meristem and the results show that it significantly improves the accuracy of cell lineages and division statistics.


Asunto(s)
Arabidopsis/citología , Técnicas Citológicas/métodos , Imagenología Tridimensional/métodos , Microscopía Confocal/métodos , Algoritmos , Meristema/citología , Modelos Biológicos , Brotes de la Planta/citología
10.
IEEE Trans Pattern Anal Mach Intell ; 38(7): 1397-410, 2016 07.
Artículo en Inglés | MEDLINE | ID: mdl-26441444

RESUMEN

Distributed algorithms have recently gained immense popularity. With regards to computer vision applications, distributed multi-target tracking in a camera network is a fundamental problem. The goal is for all cameras to have accurate state estimates for all targets. Distributed estimation algorithms work by exchanging information between sensors that are communication neighbors. Vision-based distributed multi-target state estimation has at least two characteristics that distinguishes it from other applications. First, cameras are directional sensors and often neighboring sensors may not be sensing the same targets, i.e., they are naive with respect to that target. Second, in the presence of clutter and multiple targets, each camera must solve a data association problem. This paper presents an information-weighted, consensus-based, distributed multi-target tracking algorithm referred to as the Multi-target Information Consensus (MTIC) algorithm that is designed to address both the naivety and the data association problems. It converges to the centralized minimum mean square error estimate. The proposed MTIC algorithm and its extensions to non-linear camera models, termed as the Extended MTIC (EMTIC), are robust to false measurements and limited resources like power, bandwidth and the real-time operational requirements. Simulation and experimental analysis are provided to support the theoretical results.

11.
IEEE Trans Pattern Anal Mach Intell ; 38(9): 1859-71, 2016 09.
Artículo en Inglés | MEDLINE | ID: mdl-26485472

RESUMEN

Existing data association techniques mostly focus on matching pairs of data-point sets and then repeating this process along space-time to achieve long term correspondences. However, in many problems such as person re-identification, a set of data-points may be observed at multiple spatio-temporal locations and/or by multiple agents in a network and simply combining the local pairwise association results between sets of data-points often leads to inconsistencies over the global space-time horizons. In this paper, we propose a Novel Network Consistent Data Association (NCDA) framework formulated as an optimization problem that not only maintains consistency in association results across the network, but also improves the pairwise data association accuracies. The proposed NCDA can be solved as a binary integer program leading to a globally optimal solution and is capable of handling the challenging data-association scenario where the number of data-points varies across different sets of instances in the network. We also present an online implementation of NCDA method that can dynamically associate new observations to already observed data-points in an iterative fashion, while maintaining network consistency. We have tested both the batch and the online NCDA in two application areas-person re-identification and spatio-temporal cell tracking and observed consistent and highly accurate data association results in all the cases.

12.
IEEE Trans Pattern Anal Mach Intell ; 37(7): 1360-72, 2015 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-26352445

RESUMEN

In this paper, rather than modeling activities in videos individually, we jointly model and recognize related activities in a scene using both motion and context features. This is motivated from the observations that activities related in space and time rarely occur independently and can serve as the context for each other. We propose a two-layer conditional random field model, that represents the action segments and activities in a hierarchical manner. The model allows the integration of both motion and various context features at different levels and automatically learns the statistics that capture the patterns of the features. With weakly labeled training data, the learning problem is formulated as a max-margin problem and is solved by an iterative algorithm. Rather than generating activity labels for individual activities, our model simultaneously predicts an optimum structural label for the related activities in the scene. We show promising results on the UCLA Office Dataset and VIRAT Ground Dataset that demonstrate the benefit of hierarchical modeling of related activities using both motion and context features.

13.
IEEE Trans Pattern Anal Mach Intell ; 37(8): 1656-69, 2015 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-26353002

RESUMEN

Person re-identification in a non-overlapping multicamera scenario is an open challenge in computer vision because of the large changes in appearances caused by variations in viewing angle, lighting, background clutter, and occlusion over multiple cameras. As a result of these variations, features describing the same person get transformed between cameras. To model the transformation of features, the feature space is nonlinearly warped to get the "warp functions". The warp functions between two instances of the same target form the set of feasible warp functions while those between instances of different targets form the set of infeasible warp functions. In this work, we build upon the observation that feature transformations between cameras lie in a nonlinear function space of all possible feature transformations. The space consisting of all the feasible and infeasible warp functions is the warp function space (WFS). We propose to learn a discriminating surface separating these two sets of warp functions in the WFS and to re-identify persons by classifying a test warp function as feasible or infeasible. Towards this objective, a Random Forest (RF) classifier is employed which effectively chooses the warp function components according to their importance in separating the feasible and the infeasible warp functions in the WFS. Extensive experiments on five datasets are carried out to show the superior performance of the proposed approach over state-of-the-art person re-identification methods. We show that our approach outperforms all other methods when large illumination variations are considered. At the same time it has been shown that our method reaches the best average performance over multiple combinations of the datasets, thus, showing that our method is not designed only to address a specific challenge posed by a particular dataset.


Asunto(s)
Algoritmos , Identificación Biométrica/métodos , Bases de Datos Factuales , Humanos , Grabación en Video
14.
Med Image Anal ; 19(1): 149-63, 2015 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-25461334

RESUMEN

Modern live imaging technique enables us to observe the internal part of a tissue over time by generating serial optical images containing spatio-temporal slices of hundreds of tightly packed cells. Automated tracking of plant and animal cells from such time lapse live-imaging datasets of a developing multicellular tissue is required for quantitative, high throughput analysis of cell division, migration and cell growth. In this paper, we present a novel cell tracking method that exploits the tight spatial topology of neighboring cells in a multicellular field as contextual information and combines it with physical features of individual cells for generating reliable cell lineages. The 2D image slices of multicellular tissues are modeled as a conditional random field and pairwise cell to cell similarities are obtained by estimating marginal probability distributions through loopy belief propagation on this CRF. These similarity scores are further used in a spatio-temporal graph labeling problem to obtain the optimal and feasible set of correspondences between individual cell slices across the 4D image dataset. We present results on (3D+t) confocal image stacks of Arabidopsis shoot meristem and show that the method is capable of handling many visual analysis challenges associated with such cell tracking problems, viz. poor feature quality of individual cells, low SNR in parts of images, variable number of cells across slices and cell division detection.


Asunto(s)
Arabidopsis/citología , Rastreo Celular/métodos , Imagenología Tridimensional/métodos , Microscopía por Video/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos , Algoritmos , Arabidopsis/fisiología , Recuento de Células/métodos , División Celular/fisiología , Movimiento Celular/fisiología , Células Cultivadas , Aumento de la Imagen/métodos , Interpretación de Imagen Asistida por Computador/métodos , Reproducibilidad de los Resultados , Sensibilidad y Especificidad , Análisis Espacio-Temporal
15.
PLoS One ; 8(8): e67202, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23940509

RESUMEN

The need for quantification of cell growth patterns in a multilayer, multi-cellular tissue necessitates the development of a 3D reconstruction technique that can estimate 3D shapes and sizes of individual cells from Confocal Microscopy (CLSM) image slices. However, the current methods of 3D reconstruction using CLSM imaging require large number of image slices per cell. But, in case of Live Cell Imaging of an actively developing tissue, large depth resolution is not feasible in order to avoid damage to cells from prolonged exposure to laser radiation. In the present work, we have proposed an anisotropic Voronoi tessellation based 3D reconstruction framework for a tightly packed multilayer tissue with extreme z-sparsity (2-4 slices/cell) and wide range of cell shapes and sizes. The proposed method, named as the 'Adaptive Quadratic Voronoi Tessellation' (AQVT), is capable of handling both the sparsity problem and the non-uniformity in cell shapes by estimating the tessellation parameters for each cell from the sparse data-points on its boundaries. We have tested the proposed 3D reconstruction method on time-lapse CLSM image stacks of the Arabidopsis Shoot Apical Meristem (SAM) and have shown that the AQVT based reconstruction method can correctly estimate the 3D shapes of a large number of SAM cells.


Asunto(s)
Anisotropía , Arabidopsis/citología , Procesamiento de Imagen Asistido por Computador/métodos , Imagenología Tridimensional/métodos , Meristema/citología , Animales
16.
Artículo en Inglés | MEDLINE | ID: mdl-24384704

RESUMEN

Study of the molecular control of organ growth requires establishment of the causal relationship between gene expression and cell behaviors. We seek to understand this relationship at the shoot apical meristem (SAM) of model plant Arabidopsis thaliana. This requires the spatial mapping and temporal alignment of different functional domains into a single template. Live-cell imaging techniques allow us to observe real-time organ primordia growth and gene expression dynamics at cellular resolution. In this paper, we propose a framework for the measurement of growth features at the 3D reconstructed surface of organ primordia, as well as algorithms for robust time alignment of primordia. We computed areas and deformation values from reconstructed 3D surfaces of individual primordia from live-cell imaging data. Based on these growth measurements, we applied a multiple feature landscape matching (LAM-M) algorithm to ensure a reliable temporal alignment of multiple primordia. Although the original landscape matching (LAM) algorithm motivated our alignment approach, it sometimes fails to properly align growth curves in the presence of high noise/distortion. To overcome this shortcoming, we modified the cost function to consider the landscape of the corresponding growth features. We also present an alternate parameter-free growth alignment algorithm which performs as well as LAM-M for high-quality data, but is more robust to the presence of outliers or noise. Results on primordia and guppy evolutionary growth data show that the proposed alignment framework performs at least as well as the LAM algorithm in the general case, and significantly better in the case of increased noise.


Asunto(s)
Algoritmos , Arabidopsis/citología , Arabidopsis/crecimiento & desarrollo , Imagenología Tridimensional/métodos , Microscopía por Video/métodos , Brotes de la Planta/citología , Brotes de la Planta/crecimiento & desarrollo , Aumento de la Célula , Proliferación Celular , Interpretación de Imagen Asistida por Computador/métodos , Análisis Espacio-Temporal , Técnica de Sustracción
17.
IEEE Trans Image Process ; 21(7): 3282-95, 2012 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-22374359

RESUMEN

The performance of dynamic scene algorithms often suffers because of the inability to effectively acquire features on the targets, particularly when they are distributed over a wide field of view. In this paper, we propose an integrated analysis and control framework for a pan, tilt, zoom (PTZ) camera network in order to maximize various scene understanding performance criteria (e.g., tracking accuracy, best shot, and image resolution) through dynamic camera-to-target assignment and efficient feature acquisition. Moreover, we consider the situation where processing is distributed across the network since it is often unrealistic to have all the image data at a central location. In such situations, the cameras, although autonomous, must collaborate among themselves because each camera's PTZ parameter entails constraints on the others. Motivated by recent work in cooperative control of sensor networks, we propose a distributed optimization strategy, which can be modeled as a game involving the cameras and targets. The cameras gain by reducing the error covariance of the tracked targets or through higher resolution feature acquisition, which, however, comes at the risk of losing the dynamic target. Through the optimization of this reward-versus-risk tradeoff, we are able to control the PTZ parameters of the cameras and assign them to targets dynamically. The tracks, upon which the control algorithm is dependent, are obtained through a consensus estimation algorithm whereby cameras can arrive at a consensus on the state of each target through a negotiation strategy. We analyze the performance of this collaborative sensing strategy in active camera networks in a simulation environment, as well as a real-life camera network.

18.
IEEE Trans Pattern Anal Mach Intell ; 33(8): 1681-8, 2011 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-21135432

RESUMEN

Linear and multilinear models (PCA, 3DMM, AAM/ASM, and multilinear tensors) of object shape/appearance have been very popular in computer vision. In this paper, we analyze the applicability of these heuristic models from the fundamental physical laws of object motion and image formation. We prove that under suitable conditions, the image appearance space can be closely approximated to be multilinear, with the illumination and texture subspaces being trilinearly combined with the direct sum of the motion and deformation subspaces. This result provides a physics-based understanding of many of the successes and limitations of the linear and multilinear approaches existing in the computer vision literature, and also identifies some of the conditions under which they are valid. It provides an analytical representation of the image space in terms of different physical factors that affect the image formation process. Numerical analysis of the accuracy of the physics-based models is performed, and tracking results on real data are presented.

19.
IEEE Trans Image Process ; 19(10): 2564-79, 2010 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-20550994

RESUMEN

Camera networks are being deployed for various applications like security and surveillance, disaster response and environmental modeling. However, there is little automated processing of the data. Moreover, most methods for multicamera analysis are centralized schemes that require the data to be present at a central server. In many applications, this is prohibitively expensive, both technically and economically. In this paper, we investigate distributed scene analysis algorithms by leveraging upon concepts of consensus that have been studied in the context of multiagent systems, but have had little applications in video analysis. Each camera estimates certain parameters based upon its own sensed data which is then shared locally with the neighboring cameras in an iterative fashion, and a final estimate is arrived at in the network using consensus algorithms. We specifically focus on two basic problems-tracking and activity recognition. For multitarget tracking in a distributed camera network, we show how the Kalman-Consensus algorithm can be adapted to take into account the directional nature of video sensors and the network topology. For the activity recognition problem, we derive a probabilistic consensus scheme that combines the similarity scores of neighboring cameras to come up with a probability for each action at the network level. Thorough experimental results are shown on real data along with a quantitative analysis.


Asunto(s)
Algoritmos , Redes de Comunicación de Computadores , Reconocimiento de Normas Patrones Automatizadas/métodos , Vigilancia de la Población/métodos , Grabación en Video , Humanos , Cadenas de Markov , Movimiento/fisiología , Reproducibilidad de los Resultados , Grabación en Video/instrumentación , Grabación en Video/métodos
20.
IEEE Trans Image Process ; 18(6): 1326-39, 2009 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-19398409

RESUMEN

Pattern recognition in video is a challenging task because of the multitude of spatio-temporal variations that occur in different videos capturing the exact same event. While traditional pattern-theoretic approaches account for the spatial changes that occur due to lighting and pose, very little has been done to address the effect of temporal rate changes in the executions of an event. In this paper, we provide a systematic model-based approach to learn the nature of such temporal variations (time warps) while simultaneously allowing for the spatial variations in the descriptors. We illustrate our approach for the problem of action recognition and provide experimental justification for the importance of accounting for rate variations in action recognition. The model is composed of a nominal activity trajectory and a function space capturing the probability distribution of activity-specific time warping transformations. We use the square-root parameterization of time warps to derive geodesics, distance measures, and probability distributions on the space of time warping functions. We then design a Bayesian algorithm which treats the execution rate function as a nuisance variable and integrates it out using Monte Carlo sampling, to generate estimates of class posteriors. This approach allows us to learn the space of time warps for each activity while simultaneously capturing other intra- and interclass variations. Next, we discuss a special case of this approach which assumes a uniform distribution on the space of time warping functions and show how computationally efficient inference algorithms may be derived for this special case. We discuss the relative advantages and disadvantages of both approaches and show their efficacy using experiments on gait-based person identification and activity recognition.


Asunto(s)
Algoritmos , Modelos Estadísticos , Movimiento/fisiología , Reconocimiento de Normas Patrones Automatizadas/métodos , Antropometría , Teorema de Bayes , Marcha/fisiología , Humanos , Método de Montecarlo , Grabación en Video
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...