Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 10 de 10
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
IEEE Trans Pattern Anal Mach Intell ; 45(11): 13363-13375, 2023 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-37405895

RESUMEN

Medical diagnosis assistant (MDA) aims to build an interactive diagnostic agent to sequentially inquire about symptoms for discriminating diseases. However, since the dialogue records for building a patient simulator are collected passively, the collected records might be deteriorated by some task-unrelated biases, such as the preference of the collectors. These biases might hinder the diagnostic agent to capture transportable knowledge from the simulator. This work identifies and resolves two representative non-causal biases, i.e., (i) default-answer bias and (ii) distributional inquiry bias. Specifically, Bias (i) originates from the patient simulator which tries to answer the unrecorded inquiries with some biased default answers. To eliminate this bias and improve upon a well-known causal inference technique, i.e., propensity score matching, we propose a novel propensity latent matching in building a patient simulator to effectively answer unrecorded inquiries; Bias (ii) inherently comes along with the passively collected data that the agent might learn by remembering what to inquire within the training data while not able to generalize to the out-of-distribution cases. To this end, we propose a progressive assurance agent, which includes the dual processes accounting for symptom inquiry and disease diagnosis respectively. The diagnosis process pictures the patient mentally and probabilistically by intervention to eliminate the effect of the inquiry behavior. And the inquiry process is driven by the diagnosis process to inquire about symptoms to enhance the diagnostic confidence which alters as the patient distribution changes. In this cooperative manner, our proposed agent can improve upon the out-of-distribution generalization significantly. Extensive experiments demonstrate that our framework achieves new state-of-the-art performance and possesses the advantage of transportability.

2.
IEEE Trans Image Process ; 31: 1978-1993, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35157584

RESUMEN

Video self-supervised learning is a challenging task, which requires significant expressive power from the model to leverage rich spatial-temporal knowledge and generate effective supervisory signals from large amounts of unlabeled videos. However, existing methods fail to increase the temporal diversity of unlabeled videos and ignore elaborately modeling multi-scale temporal dependencies in an explicit way. To overcome these limitations, we take advantage of the multi-scale temporal dependencies within videos and propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL), which jointly models the inter-snippet and intra-snippet temporal dependencies for temporal representation learning with a hybrid graph contrastive learning strategy. Specifically, a Spatial-Temporal Knowledge Discovering (STKD) module is first introduced to extract motion-enhanced spatial-temporal representations from videos based on the frequency domain analysis of discrete cosine transform. To explicitly model multi-scale temporal dependencies of unlabeled videos, our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter-snippet Temporal Contrastive Graphs (TCG). Then, specific contrastive learning modules are designed to maximize the agreement between nodes in different graph views. To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module which leverages the relational knowledge among video snippets to learn the global context representation and recalibrate the channel-wise features adaptively. Experimental results demonstrate the superiority of our TCGL over the state-of-the-art methods on large-scale action recognition and video retrieval benchmarks. The code is publicly available at https://github.com/YangLiu9208/TCGL.

3.
iScience ; 25(1): 103581, 2022 Jan 21.
Artículo en Inglés | MEDLINE | ID: mdl-35036861

RESUMEN

We propose CX-ToM, short for counterfactual explanations with theory-of-mind, a new explainable AI (XAI) framework for explaining decisions made by a deep convolutional neural network (CNN). In contrast to the current methods in XAI that generate explanations as a single shot response, we pose explanation as an iterative communication process, i.e., dialogue between the machine and human user. More concretely, our CX-ToM framework generates a sequence of explanations in a dialogue by mediating the differences between the minds of the machine and human user. To do this, we use Theory of Mind (ToM) which helps us in explicitly modeling the human's intention, the machine's mind as inferred by the human, as well as human's mind as inferred by the machine. Moreover, most state-of-the-art XAI frameworks provide attention (or heat map) based explanations. In our work, we show that these attention-based explanations are not sufficient for increasing human trust in the underlying CNN model. In CX-ToM, we instead use counterfactual explanations called fault-lines which we define as follows: given an input image I for which a CNN classification model M predicts class c pred , a fault-line identifies the minimal semantic-level features (e.g., stripes on zebra), referred to as explainable concepts, that need to be added to or deleted from I to alter the classification category of I by M to another specified class c alt . Extensive experiments verify our hypotheses, demonstrating that our CX-ToM significantly outperforms the state-of-the-art XAI models.

4.
IEEE Trans Neural Netw Learn Syst ; 33(7): 2758-2767, 2022 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-33385313

RESUMEN

Though beneficial for encouraging the visual question answering (VQA) models to discover the underlying knowledge by exploiting the input-output correlation beyond image and text contexts, the existing knowledge VQA data sets are mostly annotated in a crowdsource way, e.g., collecting questions and external reasons from different users via the Internet. In addition to the challenge of knowledge reasoning, how to deal with the annotator bias also remains unsolved, which often leads to superficial overfitted correlations between questions and answers. To address this issue, we propose a novel data set named knowledge-routed visual question reasoning for VQA model evaluation. Considering that a desirable VQA model should correctly perceive the image context, understand the question, and incorporate its learned knowledge, our proposed data set aims to cut off the shortcut learning exploited by the current deep embedding models and push the research boundary of the knowledge-based visual question reasoning. Specifically, we generate the question-answer pair based on both the visual genome scene graph and an external knowledge base with controlled programs to disentangle the knowledge from other biases. The programs can select one or two triplets from the scene graph or knowledge base to push multistep reasoning, avoid answer ambiguity, and balance the answer distribution. In contrast to the existing VQA data sets, we further imply the following two major constraints on the programs to incorporate knowledge reasoning. First, multiple knowledge triplets can be related to the question, but only one knowledge relates to the image object. This can enforce the VQA model to correctly perceive the image instead of guessing the knowledge based on the given question solely. Second, all questions are based on different knowledge, but the candidate answers are the same for both the training and test sets. We make the testing knowledge unused during training to evaluate whether a model can understand question words and handle unseen combinations. Extensive experiments with various baselines and state-of-the-art VQA models are conducted to demonstrate that there still exists a big gap between the model with and without groundtruth supporting triplets when given the embedded knowledge base. This reveals the weakness of the current deep embedding models on the knowledge reasoning problem.

5.
IEEE Trans Image Process ; 30: 5573-5588, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34110991

RESUMEN

Existing vision-based action recognition is susceptible to occlusion and appearance variations, while wearable sensors can alleviate these challenges by capturing human motion with one-dimensional time-series signals (e.g. acceleration, gyroscope, and orientation). For the same action, the knowledge learned from vision sensors (videos or images) and wearable sensors, may be related and complementary. However, there exists a significantly large modality difference between action data captured by wearable-sensor and vision-sensor in data dimension, data distribution, and inherent information content. In this paper, we propose a novel framework, named Semantics-aware Adaptive Knowledge Distillation Networks (SAKDN), to enhance action recognition in vision-sensor modality (videos) by adaptively transferring and distilling the knowledge from multiple wearable sensors. The SAKDN uses multiple wearable-sensors as teacher modalities and uses RGB videos as student modalities. To preserve the local temporal relationship and facilitate employing visual deep learning models, we transform one-dimensional time-series signals of wearable sensors to two-dimensional images by designing a gramian angular field based virtual image generation model. Then, we introduce a novel Similarity-Preserving Adaptive Multi-modal Fusion Module (SPAMFM) to adaptively fuse intermediate representation knowledge from different teacher networks. Finally, to fully exploit and transfer the knowledge of multiple well-trained teacher networks to the student network, we propose a novel Graph-guided Semantically Discriminative Mapping (GSDM) module, which utilizes graph-guided ablation analysis to produce a good visual explanation to highlight the important regions across modalities and concurrently preserve the interrelations of original data. Experimental results on Berkeley-MHAD, UTD-MHAD, and MMAct datasets well demonstrate the effectiveness of our proposed SAKDN for adaptive knowledge transfer from wearable-sensors modalities to vision-sensors modalities. The code is publicly available at https://github.com/YangLiu9208/SAKDN.

6.
IEEE Trans Pattern Anal Mach Intell ; 42(5): 1069-1082, 2020 05.
Artículo en Inglés | MEDLINE | ID: mdl-30640601

RESUMEN

Driven by recent computer vision and robotic applications, recovering 3D human poses has become increasingly important and attracted growing interests. In fact, completing this task is quite challenging due to the diverse appearances, viewpoints, occlusions and inherently geometric ambiguities inside monocular images. Most of the existing methods focus on designing some elaborate priors /constraints to directly regress 3D human poses based on the corresponding 2D human pose-aware features or 2D pose predictions. However, due to the insufficient 3D pose data for training and the domain gap between 2D space and 3D space, these methods have limited scalabilities for all practical scenarios (e.g., outdoor scene). Attempt to address this issue, this paper proposes a simple yet effective self-supervised correction mechanism to learn all intrinsic structures of human poses from abundant images. Specifically, the proposed mechanism involves two dual learning tasks, i.e., the 2D-to-3D pose transformation and 3D-to-2D pose projection, to serve as a bridge between 3D and 2D human poses in a type of "free" self-supervision for accurate 3D human pose estimation. The 2D-to-3D pose implies to sequentially regress intermediate 3D poses by transforming the pose representation from the 2D domain to the 3D domain under the sequence-dependent temporal context, while the 3D-to-2D pose projection contributes to refining the intermediate 3D poses by maintaining geometric consistency between the 2D projections of 3D poses and the estimated 2D poses. Therefore, these two dual learning tasks enable our model to adaptively learn from 3D human pose data and external large-scale 2D human pose data. We further apply our self-supervised correction mechanism to develop a 3D human pose machine, which jointly integrates the 2D spatial relationship, temporal smoothness of predictions and 3D geometric knowledge. Extensive evaluations on the Human3.6M and HumanEva-I benchmarks demonstrate the superior performance and efficiency of our framework over all the compared competing methods.


Asunto(s)
Aprendizaje Profundo , Imagenología Tridimensional/métodos , Postura/fisiología , Aprendizaje Automático Supervisado , Algoritmos , Bases de Datos Factuales , Humanos , Grabación en Video
7.
IEEE Trans Pattern Anal Mach Intell ; 42(11): 2809-2824, 2020 11.
Artículo en Inglés | MEDLINE | ID: mdl-31071019

RESUMEN

Face hallucination is a domain-specific super-resolution problem that aims to generate a high-resolution (HR) face image from a low-resolution (LR) input. In contrast to the existing patch-wise super-resolution models that divide a face image into regular patches and independently apply LR to HR mapping to each patch, we implement deep reinforcement learning and develop a novel attention-aware face hallucination (Attention-FH) framework, which recurrently learns to attend a sequence of patches and performs facial part enhancement by fully exploiting the global interdependency of the image. Specifically, our proposed framework incorporates two components: a recurrent policy network for dynamically specifying a new attended region at each time step based on the status of the super-resolved image and the past attended region sequence, and a local enhancement network for selected patch hallucination and global state updating. The Attention-FH model jointly learns the recurrent policy network and local enhancement network through maximizing a long-term reward that reflects the hallucination result with respect to the whole HR image. Extensive experiments demonstrate that our Attention-FH significantly outperforms the state-of-the-art methods on in-the-wild face images with large pose and illumination variations.


Asunto(s)
Cara/diagnóstico por imagen , Procesamiento de Imagen Asistido por Computador/métodos , Aprendizaje Automático , Algoritmos , Humanos , Redes Neurales de la Computación
8.
IEEE Trans Neural Netw Learn Syst ; 30(3): 834-850, 2019 03.
Artículo en Inglés | MEDLINE | ID: mdl-30059324

RESUMEN

Though quite challenging, leveraging large-scale unlabeled or partially labeled data in learning systems (e.g., model/classifier training) has attracted increasing attentions due to its fundamental importance. To address this problem, many active learning (AL) methods have been proposed that employ up-to-date detectors to retrieve representative minority samples according to predefined confidence or uncertainty thresholds. However, these AL methods cause the detectors to ignore the remaining majority samples (i.e., those with low uncertainty or high prediction confidence). In this paper, by developing a principled active sample mining (ASM) framework, we demonstrate that cost-effective mining samples from these unlabeled majority data are a key to train more powerful object detectors while minimizing user effort. Specifically, our ASM framework involves a switchable sample selection mechanism for determining whether an unlabeled sample should be manually annotated via AL or automatically pseudolabeled via a novel self-learning process. The proposed process can be compatible with mini-batch-based training (i.e., using a batch of unlabeled or partially labeled data as a one-time input) for object detection. In this process, the detector, such as a deep neural network, is first applied to the unlabeled samples (i.e., object proposals) to estimate their labels and output the corresponding prediction confidences. Then, our ASM framework is used to select a number of samples and assign pseudolabels to them. These labels are specific to each learning batch based on the confidence levels and additional constraints introduced by the AL process and will be discarded afterward. Then, these temporarily labeled samples are employed for network fine-tuning. In addition, a few samples with low-confidence predictions are selected and annotated via AL. Notably, our method is suitable for object categories that are not seen in the unlabeled data during the learning process. Extensive experiments on two public benchmarks (i.e., the PASCAL VOC 2007/2012 data sets) clearly demonstrate that our ASM framework can achieve performance comparable to that of the alternative methods but with significantly fewer annotations.

9.
Artículo en Inglés | MEDLINE | ID: mdl-28092522

RESUMEN

This paper aims to develop a novel cost-effective framework for face identification, which progressively maintains a batch of classifiers with the increasing face images of different individuals. By naturally combining two recently rising techniques: active learning (AL) and self-paced learning (SPL), our framework is capable of automatically annotating new instances and incorporating them into training under weak expert recertification. We first initialize the classifier using a few annotated samples for each individual, and extract image features using the convolutional neural nets. Then, a number of candidates are selected from the unannotated samples for classifier updating, in which we apply the current classifiers ranking the samples by the prediction confidence. In particular, our approach utilizes the high-confidence and low-confidence samples in the self-paced and the active user-query way, respectively. The neural nets are later fine-tuned based on the updated classifiers. Such heuristic implementation is formulated as solving a concise active SPL optimization problem, which also advances the SPL development by supplementing a rational dynamic curriculum constraint. The new model finely accords with the "instructor-student-collaborative" learning mode in human education. The advantages of this proposed framework are two-folds: i) The required number of annotated samples is significantly decreased while the comparable performance is guaranteed. A dramatic reduction of user effort is also achieved over other state-of-the-art active learning techniques. ii) The mixture of SPL and AL effectively improves not only the classifier accuracy compared to existing AL/SPL methods but also the robustness against noisy data. We evaluate our framework on two challenging datasets, which include hundreds of persons under diverse conditions, and demonstrate very promising results. Please find the code of this project at: http://hcp.sysu.edu.cn/projects/aspl/.


Asunto(s)
Identificación Biométrica/economía , Identificación Biométrica/métodos , Cara/anatomía & histología , Procesamiento de Imagen Asistido por Computador/métodos , Aprendizaje Automático Supervisado , Algoritmos , Análisis Costo-Beneficio , Bases de Datos Factuales , Humanos
10.
IEEE Trans Image Process ; 24(10): 3019-33, 2015 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-25974938

RESUMEN

Driven by recent vision and graphics applications such as image segmentation and object recognition, computing pixel-accurate saliency values to uniformly highlight foreground objects becomes increasingly important. In this paper, we propose a unified framework called pixelwise image saliency aggregating (PISA) various bottom-up cues and priors. It generates spatially coherent yet detail-preserving, pixel-accurate, and fine-grained saliency, and overcomes the limitations of previous methods, which use homogeneous superpixel based and color only treatment. PISA aggregates multiple saliency cues in a global context, such as complementary color and structure contrast measures, with their spatial priors in the image domain. The saliency confidence is further jointly modeled with a neighborhood consistence constraint into an energy minimization formulation, in which each pixel will be evaluated with multiple hypothetical saliency levels. Instead of using global discrete optimization methods, we employ the cost-volume filtering technique to solve our formulation, assigning the saliency levels smoothly while preserving the edge-aware structure details. In addition, a faster version of PISA is developed using a gradient-driven image subsampling strategy to greatly improve the runtime efficiency while keeping comparable detection accuracy. Extensive experiments on a number of public data sets suggest that PISA convincingly outperforms other state-of-the-art approaches. In addition, with this work, we also create a new data set containing 800 commodity images for evaluating saliency detection.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA