Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 21
Filter
1.
Article in English | MEDLINE | ID: mdl-38662568

ABSTRACT

While pre-training large-scale video-language models (VLMs) has shown remarkable potential for various downstream video-language tasks, existing VLMs can still suffer from certain commonly seen limitations, e.g., coarse-grained cross-modal aligning, under-modeling of temporal dynamics, detached video-language view. In this work, we target enhancing VLMs with a fine-grained structural spatio-temporal alignment learning method (namely Finsta). First of all, we represent the input texts and videos with fine-grained scene graph (SG) structures, both of which are further unified into a holistic SG (HSG) for bridging two modalities. Then, an SG-based framework is built, where the textual SG (TSG) is encoded with a graph Transformer, while the video dynamic SG (DSG) and the HSG are modeled with a novel recurrent graph Transformer for spatial and temporal feature propagation. A spatial-temporal Gaussian differential graph Transformer is further devised to strengthen the sense of the changes in objects across spatial and temporal dimensions. Next, based on the fine-grained structural features of TSG and DSG, we perform object-centered spatial alignment and predicate-centered temporal alignment respectively, enhancing the video-language grounding in both the spatiality and temporality. We design our method as a plug&play system, which can be integrated into existing well-trained VLMs for further representation augmentation, without training from scratch or relying on SG annotations in downstream applications. On 6 representative VL modeling tasks over 12 datasets in both standard and long-form video scenarios, Finsta consistently improves the existing 13 strong-performing VLMs persistently, and refreshes the current state-of-the-art end task performance significantly in both the fine-tuning and zero-shot settings.

2.
IEEE Trans Neural Netw Learn Syst ; 35(4): 5054-5063, 2024 Apr.
Article in English | MEDLINE | ID: mdl-37053061

ABSTRACT

The present machine learning schema typically uses a one-pass model inference (e.g., forward propagation) to make predictions in the testing phase. It is inherently different from human students who double-check the answer during examinations especially when the confidence is low. To bridge this gap, we propose a learning to double-check (L2D) framework, which formulates double check as a learnable procedure with two core operations: recognizing unreliable predictions and revising predictions. To judge the correctness of a prediction, we resort to counterfactual faithfulness in causal theory and design a contrastive faithfulness measure. In particular, L2D generates counterfactual features by imagining: "what would the sample features be if its label was the predicted class" and judges the prediction by the faithfulness of the counterfactual features. Furthermore, we design a simple and effective revision module to revise the original model prediction according to the faithfulness. We apply the L2D framework to three classification models and conduct experiments on two public datasets for image classification, validating the effectiveness of L2D in prediction correctness judgment and revision.

3.
Article in English | MEDLINE | ID: mdl-37556333

ABSTRACT

Video Question Answering (VideoQA) is the task of answering questions about a video. At its core is the understanding of the alignments between video scenes and question semantics to yield the answer. In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), tends to over-exploit the spurious correlations between question-irrelevant scenes and answers, instead of inspecting the causal effect of question-critical scenes, which undermines the prediction with unreliable reasoning. In this work, we take a causal look at VideoQA and propose a modal-agnostic learning framework, named Invariant Grounding for VideoQA (IGV), to ground the question-critical scene, whose causal relations with answers are invariant across different interventions on the complement. With IGV, leading VideoQA models are forced to shield the answering from the negative influence of spurious correlations, which significantly improves their reasoning ability. To unleash the potential of this framework, we further provide a Transformer-Empowered Invariant Grounding for VideoQA (TIGV), a substantial instantiation of IGV framework that naturally integrates the idea of invariant grounding into a transformer-style backbone. Experiments on four benchmark datasets validate our design in terms of accuracy, visual explainability, and generalization ability over the leading baselines. Our code is available at https://github.com/yl3800/TIGV.

4.
IEEE Trans Pattern Anal Mach Intell ; 45(11): 13265-13280, 2023 Nov.
Article in English | MEDLINE | ID: mdl-37402185

ABSTRACT

We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully- and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining.

5.
IEEE Trans Pattern Anal Mach Intell ; 45(10): 12601-12617, 2023 Oct.
Article in English | MEDLINE | ID: mdl-37155378

ABSTRACT

Temporal grounding is the task of locating a specific segment from an untrimmed video according to a query sentence. This task has achieved significant momentum in the computer vision community as it enables activity grounding beyond pre-defined activity classes by utilizing the semantic diversity of natural language descriptions. The semantic diversity is rooted in the principle of compositionality in linguistics, where novel semantics can be systematically described by combining known words in novel ways (compositional generalization). However, existing temporal grounding datasets are not carefully designed to evaluate the compositional generalizability. To systematically benchmark the compositional generalizability of temporal grounding models, we introduce a new Compositional Temporal Grounding task and construct two new dataset splits, i.e., Charades-CG and ActivityNet-CG. We empirically find that they fail to generalize to queries with novel combinations of seen words. We argue that the inherent compositional structure (i.e., composition constituents and their relationships) inside the videos and language is the crucial factor to achieve compositional generalization. Based on this insight, we propose a variational cross-graph reasoning framework that explicitly decomposes video and language into hierarchical semantic graphs, respectively, and learns fine-grained semantic correspondence between the two graphs. Meanwhile, we introduce a novel adaptive structured semantics learning approach to derive the structure-informed and domain-generalizable graph representations, which facilitate the fine-grained semantic correspondence reasoning between the two graphs. To further evaluate the understanding of the compositional structure, we also introduce a more challenging setting, where one of the components in the novel composition is unseen. This requires more sophisticated understanding of the compositional structure to infer the potential semantics of the unseen word based on the other learned composition constituents appearing in both the video and language context, and their relationships. Extensive experiments validate the superior compositional generalizability of our approach, demonstrating its ability to handle queries with novel combinations of seen words as well as novel words in the testing composition.

6.
IEEE Trans Pattern Anal Mach Intell ; 45(8): 10285-10299, 2023 Aug.
Article in English | MEDLINE | ID: mdl-37027600

ABSTRACT

In recommender systems, users' behavior data are driven by the interactions of user-item latent factors. To improve recommendation effectiveness and robustness, recent advances focus on latent factor disentanglement via variational inference. Despite significant progress, uncovering the underlying interactions, i.e., dependencies of latent factors, remains largely neglected by the literature. To bridge the gap, we investigate the joint disentanglement of user-item latent factors and the dependencies between them, namely latent structure learning. We propose to analyze the problem from the causal perspective, where a latent structure should ideally reproduce observational interaction data, and satisfy the structure acyclicity and dependency constraints, i.e., causal prerequisites. We further identify the recommendation-specific challenges for latent structure learning, i.e., the subjective nature of users' minds and the inaccessibility of private/sensitive user factors causing universally learned latent structure to be suboptimal for individuals. To address these challenges, we propose the personalized latent structure learning framework for recommendation, namely PlanRec, which incorporates 1) differentiable Reconstruction, Dependency, and Acyclicity regularizations to satisfy the causal prerequisites; 2) Personalized Structure Learning (PSL) which personalizes the universally learned dependencies through probabilistic modeling; and 3) uncertainty estimation which explicitly measures the uncertainty of structure personalization, and adaptively balances personalization and shared knowledge for different users. We conduct extensive experiments on two public benchmark datasets from MovieLens and Amazon, and a large-scale industrial dataset from Alipay. Empirical studies validate that PlanRec discovers effective shared/personalized structures, and successfully balances shared knowledge and personalization via rational uncertainty estimation.


Subject(s)
Algorithms , Learning , Humans
7.
IEEE Trans Pattern Anal Mach Intell ; 45(2): 2297-2309, 2023 Feb.
Article in English | MEDLINE | ID: mdl-35471869

ABSTRACT

Explainability is crucial for probing graph neural networks (GNNs), answering questions like "Why the GNN model makes a certain prediction?". Feature attribution is a prevalent technique of highlighting the explanatory subgraph in the input graph, which plausibly leads the GNN model to make its prediction. Various attribution methods have been proposed to exploit gradient-like or attention scores as the attributions of edges, then select the salient edges with top attribution scores as the explanation. However, most of these works make an untenable assumption - the selected edges are linearly independent - thus leaving the dependencies among edges largely unexplored, especially their coalition effect. We demonstrate unambiguous drawbacks of this assumption - making the explanatory subgraph unfaithful and verbose. To address this challenge, we propose a reinforcement learning agent, Reinforced Causal Explainer (RC-Explainer). It frames the explanation task as a sequential decision process - an explanatory subgraph is successively constructed by adding a salient edge to connect the previously selected subgraph. Technically, its policy network predicts the action of edge addition, and gets a reward that quantifies the action's causal effect on the prediction. Such reward accounts for the dependency of the newly-added edge and the previously-added edges, thus reflecting whether they collaborate together and form a coalition to pursue better explanations. It is trained via policy gradient to optimize the reward stream of edge sequences. As such, RC-Explainer is able to generate faithful and concise explanations, and has a better generalization power to unseen graphs. When explaining different GNNs on three graph classification datasets, RC-Explainer achieves better or comparable performance to state-of-the-art approaches w.r.t. two quantitative metrics: predictive accuracy, contrastivity, and safely passes sanity checks and visual inspections. Codes and datasets are available at https://github.com/xiangwang1223/reinforced_causal_explainer.

8.
IEEE Trans Image Process ; 31: 3949-3960, 2022.
Article in English | MEDLINE | ID: mdl-35635814

ABSTRACT

Although the single-image super-resolution (SISR) methods have achieved great success on the single degradation, they still suffer performance drop with multiple degrading effects in real scenarios. Recently, some blind and non-blind models for multiple degradations have been explored. However, these methods usually degrade significantly for distribution shifts between the training and test data. Towards this end, we propose a novel conditional hyper-network framework for super-resolution with multiple degradations (named CMDSR), which helps the SR framework learn how to adapt to changes in the degradation distribution of input. We extract degradation prior at the task-level with the proposed ConditionNet, which will be used to adapt the parameters of the basic SR network (BaseNet). Specifically, the ConditionNet of our framework first learns the degradation prior from a support set, which is composed of a series of degraded image patches from the same task. Then the adaptive BaseNet rapidly shifts its parameters according to the conditional features. Moreover, in order to better extract degradation prior, we propose a task contrastive loss to shorten the inner-task distance and enlarge the cross-task distance between task-level features. Without predefining degradation maps, our blind framework can conduct one single parameter update to yield considerable improvement in SR results. Extensive experiments demonstrate the effectiveness of CMDSR over various blind, and even several non-blind methods. The flexible BaseNet structure also reveals that CMDSR can be a general framework for a large series of SISR models. Our code is available at https://github.com/guanghaoyin/CMDSR.

9.
IEEE Trans Image Process ; 31: 1204-1216, 2022.
Article in English | MEDLINE | ID: mdl-35015640

ABSTRACT

The task of video moment retrieval (VMR) is to retrieve the specific video moment from an untrimmed video, according to a textual query. It is a challenging task that requires effective modeling of complex cross-modal matching relationship. Recent efforts primarily model the cross-modal interactions by hand-crafted network architectures. Despite their effectiveness, they rely heavily on expert experience to select architectures and have numerous hyperparameters that need to be carefully tuned, which significantly limit their applications in real-world scenarios. How to design flexible architectures for modeling cross-modal interactions with less manual effort is crucial for the task of VMR but has received limited attention so far. To address this issue, we present a novel VMR approach that automatically searches for an optimal architecture to learn cross-modal matching relationship. Specifically, we develop a cross-modal architecture searching method. It first searches for repeatable cell network architectures based on a directed acyclic graph, which performs operation sampling over a customized task-specific operation set. Then, we adaptively modulate the edge importance in the graph by a query-aware attention network, which performs edge sampling softly in the searched cell. Different from existing neural architecture search methods, our approach can effectively exploit the query information to reach query-conditioned architectures for modeling cross modal matching. Extensive experiments on three benchmark datasets show that our approach can not only significantly outperform the state-of-the-art methods but also run more efficiently and robustly than manually crafted network architectures.


Subject(s)
Machine Learning
10.
IEEE Trans Pattern Anal Mach Intell ; 44(3): 1443-1456, 2022 03.
Article in English | MEDLINE | ID: mdl-32822293

ABSTRACT

Meta-learning has been proposed as a framework to address the challenging few-shot learning setting. The key idea is to leverage a large number of similar few-shot tasks in order to learn how to adapt a base-learner to a new task for which only a few labeled samples are available. As deep neural networks (DNNs) tend to overfit using a few samples only, typical meta-learning models use shallow neural networks, thus limiting its effectiveness. In order to achieve top performance, some recent works tried to use the DNNs pre-trained on large-scale datasets but mostly in straight-forward manners, e.g., (1) taking their weights as a warm start of meta-training, and (2) freezing their convolutional layers as the feature extractor of base-learners. In this paper, we propose a novel approach called meta-transfer learning (MTL), which learns to transfer the weights of a deep NN for few-shot learning tasks. Specifically, meta refers to training multiple tasks, and transfer is achieved by learning scaling and shifting functions of DNN weights (and biases) for each task. To further boost the learning efficiency of MTL, we introduce the hard task (HT) meta-batch scheme as an effective learning curriculum of few-shot classification tasks. We conduct experiments for five-class few-shot classification tasks on three challenging benchmarks, miniImageNet, tieredImageNet, and Fewshot-CIFAR100 (FC100), in both supervised and semi-supervised settings. Extensive comparisons to related works validate that our MTL approach trained with the proposed HT meta-batch scheme achieves top performance. An ablation study also shows that both components contribute to fast convergence and high accuracy.


Subject(s)
Algorithms , Neural Networks, Computer , Learning , Machine Learning
11.
IEEE Trans Pattern Anal Mach Intell ; 44(10): 6729-6751, 2022 10.
Article in English | MEDLINE | ID: mdl-34214034

ABSTRACT

Images can convey rich semantics and induce various emotions in viewers. Recently, with the rapid advancement of emotional intelligence and the explosive growth of visual data, extensive research efforts have been dedicated to affective image content analysis (AICA). In this survey, we will comprehensively review the development of AICA in the recent two decades, especially focusing on the state-of-the-art methods with respect to three main challenges - the affective gap, perception subjectivity, and label noise and absence. We begin with an introduction to the key emotion representation models that have been widely employed in AICA and description of available datasets for performing evaluation with quantitative comparison of label noise and dataset bias. We then summarize and compare the representative approaches on (1) emotion feature extraction, including both handcrafted and deep features, (2) learning methods on dominant emotion recognition, personalized emotion prediction, emotion distribution learning, and learning from noisy data or few labels, and (3) AICA based applications. Finally, we discuss some challenges and promising research directions in the future, such as image content and context understanding, group emotion clustering, and viewer-image interaction.


Subject(s)
Algorithms , Emotions , Image Processing, Computer-Assisted
12.
IEEE Trans Neural Netw Learn Syst ; 32(6): 2733-2743, 2021 06.
Article in English | MEDLINE | ID: mdl-32697723

ABSTRACT

With superhuman-level performance of face recognition, we are more concerned about the recognition of fine-grained attributes, such as emotion, age, and gender. However, given that the label space is extremely large and follows a long-tail distribution, it is quite expensive to collect sufficient samples for fine-grained attributes. This results in imbalanced training samples and inferior attribute recognition models. To this end, we propose the use of arbitrary attribute combinations, without human effort, to synthesize face images. In particular, to bridge the semantic gap between high-level attribute label space and low-level face image, we propose a novel neural-network-based approach that maps the target attribute labels to an embedding vector, which can be fed into a pretrained image decoder to synthesize a new face image. Furthermore, to regularize the attribute for image synthesis, we propose to use a perceptual loss to make the new image explicitly faithful to target attributes. Experimental results show that our approach can generate photorealistic face images from attribute labels, and more importantly, by serving as augmented training samples, these images can significantly boost the performance of attribute recognition model. The code is open-sourced at this link.

13.
IEEE Trans Image Process ; 30: 1514-1526, 2021.
Article in English | MEDLINE | ID: mdl-33360994

ABSTRACT

Food recognition has captured numerous research attention for its importance for health-related applications. The existing approaches mostly focus on the categorization of food according to dish names, while ignoring the underlying ingredient composition. In reality, two dishes with the same name do not necessarily share the exact list of ingredients. Therefore, the dishes under the same food category are not mandatorily equal in nutrition content. Nevertheless, due to limited datasets available with ingredient labels, the problem of ingredient recognition is often overlooked. Furthermore, as the number of ingredients is expected to be much less than the number of food categories, ingredient recognition is more tractable in the real-world scenario. This paper provides an insightful analysis of three compelling issues in ingredient recognition. These issues involve recognition in either image-level or region level, pooling in either single or multiple image scales, learning in either single or multi-task manner. The analysis is conducted on a large food dataset, Vireo Food-251, contributed by this paper. The dataset is composed of 169,673 images with 251 popular Chinese food and 406 ingredients. The dataset includes adequate challenges in scale and complexity to reveal the limit of the current approaches in ingredient recognition.


Subject(s)
Deep Learning , Food Ingredients/classification , Image Processing, Computer-Assisted/methods , Pattern Recognition, Automated/methods , China , Cooking , Humans
14.
IEEE Trans Neural Netw Learn Syst ; 31(8): 2791-2804, 2020 Aug.
Article in English | MEDLINE | ID: mdl-30676983

ABSTRACT

Matrix factorization (MF) has been widely used to discover the low-rank structure and to predict the missing entries of data matrix. In many real-world learning systems, the data matrix can be very high dimensional but sparse. This poses an imbalanced learning problem since the scale of missing entries is usually much larger than that of the observed entries, but they cannot be ignored due to the valuable negative signal. For efficiency concern, existing work typically applies a uniform weight on missing entries to allow a fast learning algorithm. However, this simplification will decrease modeling fidelity, resulting in suboptimal performance for downstream applications. In this paper, we weight the missing data nonuniformly, and more generically, we allow any weighting strategy on the missing data. To address the efficiency challenge, we propose a fast learning method, for which the time complexity is determined by the number of observed entries in the data matrix rather than the matrix size. The key idea is twofold: 1) we apply truncated singular value decomposition on the weight matrix to get a more compact representation of the weights and 2) we learn MF parameters with elementwise alternating least squares (eALS) and memorize the key intermediate variables to avoid repeating computations that are unnecessary. We conduct extensive experiments on two recommendation benchmarks, demonstrating the correctness, efficiency, and effectiveness of our fast eALS method.

15.
IEEE Trans Image Process ; 28(1): 32-44, 2019 Jan.
Article in English | MEDLINE | ID: mdl-30010565

ABSTRACT

Recently, a great progress in automatic image captioning has been achieved by using semantic concepts detected from the image. However, we argue that existing concepts-to-caption framework, in which the concept detector is trained using the image-caption pairs to minimize the vocabulary discrepancy, suffers from the deficiency of insufficient concepts. The reasons are two-fold: 1) the extreme imbalance between the number of occurrence positive and negative samples of the concept and 2) the incomplete labeling in training captions caused by the biased annotation and usage of synonyms. In this paper, we propose a method, termed online positive recall and missing concepts mining, to overcome those problems. Our method adaptively re-weights the loss of different samples according to their predictions for online positive recall and uses a two-stage optimization strategy for missing concepts mining. In this way, more semantic concepts can be detected and a high accuracy will be expected. On the caption generation stage, we explore an element-wise selection process to automatically choose the most suitable concepts at each time step. Thus, our method can generate more precise and detailed caption to describe the image. We conduct extensive experiments on the MSCOCO image captioning data set and the MSCOCO online test server, which shows that our method achieves superior image captioning performance compared with other competitive methods.

16.
IEEE Trans Cybern ; 48(11): 3218-3231, 2018 Nov.
Article in English | MEDLINE | ID: mdl-29990033

ABSTRACT

Detecting events from massive social media data in social networks can facilitate browsing, search, and monitoring of real-time events by corporations, governments, and users. The short, conversational, heterogeneous, and real-time characteristics of social media data bring great challenges for event detection. The existing event detection approaches rely mainly on textual information, while the visual content of microblogs and the intrinsic correlation among the heterogeneous data are scarcely explored. To deal with the above challenges, we propose a novel real-time event detection method by generating an intermediate semantic level from social multimedia data, named microblog clique (MC), which is able to explore the high correlations among different microblogs. Specifically, the proposed method comprises three stages. First, the heterogeneous data in microblogs is formulated in a hypergraph structure. Hypergraph cut is conducted to group the highly correlated microblogs with the same topics as the MCs, which can address the information inadequateness and data sparseness issues. Second, a bipartite graph is constructed based on the generated MCs and the transfer cut partition is performed to detect the events. Finally, for new incoming microblogs, incremental hypergraph is constructed based on the latest MCs to generate new MCs, which are classified by bipartite graph partition into existing events or new ones. Extensive experiments are conducted on the events in the Brand-Social-Net dataset and the results demonstrate the superiority of the proposed method, as compared to the state-of-the-art approaches.

17.
EURASIP J Bioinform Syst Biol ; 2016(1): 18, 2016 Dec.
Article in English | MEDLINE | ID: mdl-27917229

ABSTRACT

Online community-based health services accumulate a huge amount of unstructured health question answering (QA) records at a continuously increasing pace. The ability to organize these health QA records has been found to be effective for data access. The existing approaches for organizing information are often not applicable to health domain due to its domain nature as characterized by complex relation among entities, large vocabulary gap, and heterogeneity of users. To tackle these challenges, we propose a top-down organization scheme, which can automatically assign the unstructured health-related records into a hierarchy with prior domain knowledge. Besides automatic hierarchy prototype generation, it also enables each data instance to be associated with multiple leaf nodes and profiles each node with terminologies. Based on this scheme, we design a hierarchy-based health information retrieval system. Experiments on a real-world dataset demonstrate the effectiveness of our scheme in organizing health QA into a topic hierarchy and retrieving health QA records from the topic hierarchy.

18.
IEEE Trans Image Process ; 25(3): 1033-46, 2016 Mar.
Article in English | MEDLINE | ID: mdl-26780785

ABSTRACT

We present a deep learning strategy to fuse multiple semantic cues for complex event recognition. In particular, we tackle the recognition task by answering how to jointly analyze human actions (who is doing what), objects (what), and scenes (where). First, each type of semantic features (e.g., human action trajectories) is fed into a corresponding multi-layer feature abstraction pathway, followed by a fusion layer connecting all the different pathways. Second, the correlations of how the semantic cues interacting with each other are learned in an unsupervised cross-modality autoencoder fashion. Finally, by fine-tuning a large-margin objective deployed on this deep architecture, we are able to answer the question on how the semantic cues of who, what, and where compose a complex event. As compared with the traditional feature fusion methods (e.g., various early or late strategies), our method jointly learns the essential higher level features that are most effective for fusion and recognition. We perform extensive experiments on two real-world complex event video benchmarks, MED'11 and CCV, and demonstrate that our method outperforms the best published results by 21% and 11%, respectively, on an event recognition task.

19.
IEEE Trans Image Process ; 23(7): 2996-3012, 2014 Jul.
Article in English | MEDLINE | ID: mdl-24860032

ABSTRACT

Nonnegative matrix factorization (NMF) has received considerable attention in image processing, computer vision, and patter recognition. An important variant of NMF is nonnegative graph embedding (NGE), which encodes the statistical or geometric information of data in the process of matrix factorization. The NGE offers a general framework for unsupervised/supervised settings. However, NGE-like algorithms often suffer from noisy data, unreliable graphs, and noisy labels, which are commonly encountered in real-world applications. To address these issues, in this paper, we first propose a robust nonnegative graph embedding (RNGE) framework, where the joint sparsity in both graph embedding and data reconstruction endues robustness to undesirable noises. Next, we present a robust seminonnegative graph embedding (RsNGE) framework, which only constrains the coefficient matrix to be nonnegative while places no constraint on the base matrix. This extends the applicable range of RNGE to data which are not nonnegative and endows more discriminative power of the learnt base matrix. The RNGE/RsNGE provides a general formulation such that all the algorithms unified within the graph embedding framework can be easily extended to obtain their robust nonnegative/seminonnegative solutions. Further, we develop elegant multiplicative updating solutions that can solve RNGE/RsNGE efficiently and offer a rigorous convergence analysis. We conduct extensive experiments on four real-world data sets and compare the proposed RNGE/RsNGE to other representative NMF variants and data factorization methods. The experimental results demonstrate the robustness and effectiveness of the proposed approaches.

20.
IEEE Trans Image Process ; 21(4): 2269-81, 2012 Apr.
Article in English | MEDLINE | ID: mdl-21965212

ABSTRACT

Recently, extensive research efforts have been dedicated to view-based methods for 3-D object retrieval due to the highly discriminative property of multiviews for 3-D object representation. However, most of state-of-the-art approaches highly depend on their own camera array settings for capturing views of 3-D objects. In order to move toward a general framework for 3-D object retrieval without the limitation of camera array restriction, a camera constraint-free view-based (CCFV) 3-D object retrieval algorithm is proposed in this paper. In this framework, each object is represented by a free set of views, which means that these views can be captured from any direction without camera constraint. For each query object, we first cluster all query views to generate the view clusters, which are then used to build the query models. For a more accurate 3-D object comparison, a positive matching model and a negative matching model are individually trained using positive and negative matched samples, respectively. The CCFV model is generated on the basis of the query Gaussian models by combining the positive matching model and the negative matching model. The CCFV removes the constraint of static camera array settings for view capturing and can be applied to any view-based 3-D object database. We conduct experiments on the National Taiwan University 3-D model database and the ETH 3-D object database. Experimental results show that the proposed scheme can achieve better performance than state-of-the-art methods.


Subject(s)
Image Interpretation, Computer-Assisted/methods , Imaging, Three-Dimensional/methods , Information Storage and Retrieval/methods , Pattern Recognition, Automated/methods , Photography/methods , Subtraction Technique , Algorithms , Image Enhancement/methods , Reproducibility of Results , Sensitivity and Specificity , Signal Processing, Computer-Assisted
SELECTION OF CITATIONS
SEARCH DETAIL
...